WO2021120145A1

WO2021120145A1 - Voice conversion method and apparatus, computer device and computer-readable storage medium

Info

Publication number: WO2021120145A1
Application number: PCT/CN2019/126865
Authority: WO
Inventors: 刘洋; 李柏; 丁万; 黄东延; 熊友军
Original assignee: 深圳市优必选科技股份有限公司
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2021-06-24
Also published as: CN111108558A; CN111108558B

Abstract

A voice conversion method and apparatus, a computer device, and a computer-readable storage medium. The method comprises: acquiring a voice to be converted and an original conversion model, the format of the original conversion model being an online format (202); performing format conversion on the original conversion model to obtain a target conversion model in an offline format (204); performing feature extraction on the voice to obtain features to be converted (206); inputting the features into the target conversion model to obtain target features outputted by the target conversion model (208); and obtaining a target voice according to the target features outputted by the target conversion model, wherein the target voice has the same voice content as the voice to be converted, and the target voice has a different sound from the voice to be converted (210). The voice conversion method may not only perform high-quality voice conversion in an offline state, but also has a fast running speed and can achieve real-time voice conversion.

Description

Voice conversion method, device, computer equipment and computer readable storage medium

Technical field

This application relates to the field of audio processing technology, and in particular to a voice conversion method, device, computer equipment, and computer-readable storage medium.

Background technique

Voice conversion technology is a technology that converts the source voice into the target voice while keeping the semantic content unchanged. The source voice is the voice uttered by the first human voice, and the target voice is the voice uttered by the second human voice. That is, the source voice emitted by the first human voice is converted into the target voice emitted by the second human voice with the same semantics through voice conversion technology.

With the rapid development of deep neural network technology, the voice conversion method based on deep learning has high voice similarity, good voice quality and good fluency. The current deep learning-based speech conversion method mainly includes two steps. First, a large amount of speech data is used to train the conversion model, and then the trained model is used for speech conversion. Because training requires high computing resources, there are few offline resources and low performance. It is easy to run out of resources when used for training. Even if it can be trained, the efficiency is very low, and the time cost is too high and difficult to use. Therefore, the current deep learning-based voice conversion function can only be realized by relying on online high-performance servers, and cannot be used offline.

Application content

Based on this, it is necessary to address the above problems and propose a voice conversion method, device, computer equipment, and storage medium that can still perform high-quality voice conversion in an offline state.

A voice conversion method, the method includes:

Acquiring the voice to be converted and the original conversion model, where the format of the original conversion model is an online format;

Format conversion of the original conversion model to obtain a target conversion model in offline format;

Performing feature extraction on the voice to be converted to obtain the feature to be converted;

Input the feature to be converted into the target conversion model to obtain the target feature output by the target conversion model;

The target voice is obtained according to the target feature output by the target conversion model, the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from the voice to be converted.

A device for voice conversion, the device includes:

An acquisition module for acquiring the voice to be converted and the original conversion model, the format of the original conversion model is an online format;

A format conversion module for format conversion of the original conversion model to obtain a target conversion model in an offline format;

The feature extraction module is used to perform feature extraction on the voice to be converted to obtain the feature to be converted;

The feature conversion module is configured to input the features to be converted into the target conversion model to obtain the target features output by the target conversion model;

The result module is configured to obtain a target voice according to the target feature output by the target conversion model, the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from the voice to be converted.

A computer device includes a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the following steps:

A computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, the processor executes the following steps:

The embodiments of this application have the following beneficial effects:

The above voice conversion method, device, computer equipment and computer readable storage medium, by acquiring the voice to be converted and the original conversion model, since the original conversion model cannot work in an offline state, the features of the voice to be converted are extracted to obtain the features to be converted. After the format of the original conversion model is converted to the offline format, the target feature can be obtained according to the features to be converted and the target conversion model in the offline format, and then the target voice can be obtained according to the target feature. This voice conversion method can not only perform high-quality voice conversion in an offline state, but also runs fast, and can realize real-time voice conversion.

Description of the drawings

In order to more clearly describe the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.

among them:

Figure 1 is an application environment diagram of a voice conversion method in an embodiment;

Figure 2 is a flowchart of a voice conversion method in an embodiment;

Figure 3 is a flowchart of a voice conversion method in an embodiment;

FIG. 4 is a schematic diagram of segmentation processing of the voice to be converted in an embodiment;

Figure 5 is a structural block diagram of a voice conversion device in an embodiment;

Fig. 6 is a structural block diagram of a computer device in an embodiment.

Detailed ways

The technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

Fig. 1 is an application environment diagram of a voice conversion method in an embodiment. As shown in Figure 1, the voice conversion method is applied to a voice conversion system. The voice conversion system includes a terminal. The terminal may be a desktop terminal or a mobile terminal, and the mobile terminal may be at least one of a mobile phone, a tablet computer, and a notebook computer. The terminal includes a microphone, a conversion unit, and a player. The microphone is used to obtain the voice to be converted. The conversion unit is used to convert the voice to be converted into a target voice with the same voice content but a different voice. The player is used to play the target voice.

As shown in Figure 2, in one embodiment, a voice conversion method is provided. The method can be applied to terminals, servers, and other voice conversion devices. In this embodiment, it is applied to a voice conversion device as an example. In the offline state, after the voice conversion device obtains the voice to be converted, the following voice conversion method can obtain the target voice with the same voice content and different voice as the voice to be converted. The voice conversion method specifically includes the following steps:

Step 202: Obtain the voice to be converted and the original conversion model, where the format of the original conversion model is an online format.

Among them, the voice to be converted refers to the voice that is emitted by the human voice to be converted and is to be converted into the target human voice.

Among them, the online format refers to the saving format of files that can be opened or work normally only when the network is connected.

Among them, the original conversion model refers to a model in which the input is the feature to be converted of the voice to be converted, and the output is the target feature of the target voice, which is used to obtain the target feature of the target voice according to the feature of the voice to be converted in the state of network connection .

Step 204: Perform format conversion on the original conversion model to obtain an offline format target conversion model.

Among them, the offline format refers to the saving format of files that can be opened or work normally when disconnected from the network.

Among them, the target conversion model is used to obtain the target characteristics of the target voice according to the characteristics of the voice to be converted when the network is disconnected.

The original conversion model is formatted to obtain an offline format target conversion model. Exemplarily, the original conversion model is a model file trained by the TensorFlow (a machine learning library developed by Google, using the python language) framework. The original conversion model is saved in the online format CheckPoint (abbreviated ckpt), which can be converted to save format It is the offline format JetSoft Shield Now (jsn in short) to obtain the target conversion model. The original conversion model in ckpt format records a lot of information, such as some parameters and data used when training the original conversion model. This part of the data is not needed in the process of voice conversion in the offline state, so it is necessary to convert the original conversion model to the save format. Excess data will be removed when it is in jsn format, which is equivalent to simplifying and compressing the model file, which can improve the running speed in offline state, thereby increasing the speed of voice conversion and realizing real-time voice conversion.

Step 206: Perform feature extraction on the voice to be converted to obtain the feature to be converted.

The feature to be converted is used to input the target conversion model to obtain the target feature corresponding to the voice to be converted.

Obtain the spectral features of the voice to be converted according to the voice to be converted, such as the Mel spectrum of the voice to be converted, extract the features of the converted voice, and determine the feature to be converted of the voice to be converted based on these features.

Step 208: Input the feature to be converted into the target conversion model to obtain the target feature output by the target conversion model.

Among them, the target feature is used to obtain a target voice with the same voice content and different voice as the voice to be converted.

In the offline state, when the target conversion model is in the running state, the feature to be converted is input to the target conversion model, and the target conversion model directly outputs the target feature corresponding to the feature to be converted.

Step 210: Obtain a target voice according to the target feature output by the target conversion model, the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from the voice to be converted.

Wherein, the target voice refers to a voice whose voice content is the same as the voice to be converted and whose voice is different from the voice to be converted.

According to the target characteristics, the fundamental frequency, spectrum envelope, and non-periodical characteristics of the target voice can be obtained, and the Mel spectrum of the target voice can be determined, and the target voice can be obtained according to the Mel spectrum of the target voice. Exemplarily, the feature to be converted is binarized 130-dimensional serialized data, and the target feature obtained by inputting the target conversion model is also 130-dimensional serialized data. After denormalization, the lf0, mgc, and bap features of the target voice are obtained. Then use SPTK to convert the data into f0, sp, and ap features. From the f0, sp, and ap of the target voice, the mel spectrum of the target voice can be determined, and the mel spectrum of the target voice can be used to obtain the target voice.

In the above voice conversion method, by acquiring the voice to be converted and the original conversion model, since the original conversion model cannot work in an offline state, the features of the voice to be converted are extracted to obtain the features to be converted, and after the format of the original conversion model is converted to an offline format, According to the features to be converted and the target conversion model in offline format, the target features can be obtained, and then the target speech can be obtained according to the target features. This voice conversion method can not only perform high-quality voice conversion in an offline state, but also runs fast, and can realize real-time voice conversion.

In one embodiment, step 206 performs feature extraction on the voice to be converted to obtain the feature to be converted, including: performing periodic feature extraction and aperiodic feature extraction on the voice to be converted to obtain the period corresponding to the voice to be converted Features and aperiodic features, where the periodic features include a fundamental frequency and a spectrum envelope; the features to be converted are obtained according to the periodic features and the aperiodic features.

There are multiple sound sources that produce acoustic energy in the vocal tract when a person speaks. Among them, non-periodic sound sources include aspiration, friction, and blasting sound generated at the lips, teeth, throat, and vocal tract, while periodic sound sources are It is generated by the vibration of the vocal cords at the glottis, so the voice to be converted includes periodic components and non-periodic components, and the corresponding spectral features of the voice to be converted include periodic features and non-periodic features. In this embodiment, the Mel spectrum of the voice to be converted is used as the spectral feature for description.

Among them, the fundamental frequency (Fundamental Frequency, f0) refers to a group of sine waves forming the original signal, the sine wave with the lowest frequency is the fundamental frequency, and the others are overtones. The spectral envelope (spectral envelope, sp) refers to the envelope obtained by connecting the highest amplitude points of different frequencies through a smooth curve. Aperiodic sequence (aperiodic parameter, ap) refers to aperiodic signal parameters of speech.

Among them, the periodic feature refers to the fundamental frequency and spectrum envelope in the Mel spectrum of the voice to be converted.

Among them, the aperiodic feature refers to the aperiodic sequence in the Mel spectrum of the voice to be converted.

According to the periodic feature and the non-periodic feature, the feature data as the input of the target conversion model can be obtained through processing, and the feature data is the feature to be converted. Exemplarily, a set of characteristic data is obtained according to the periodic characteristic and the aperiodic characteristic, and the characteristic data is calculated and formatted to obtain the characteristic to be converted.

In one embodiment, obtaining the feature to be converted according to the periodic feature and the aperiodic feature includes: obtaining a target dimensional feature according to the periodic feature and the aperiodic feature, and the target dimensional feature has a higher dimension than all. The sum of the dimensionality of the periodic feature and the non-periodic feature; performing format conversion on the target dimensional feature to obtain the feature to be converted.

Wherein, the target dimensional feature refers to a good feature whose dimension obtained according to the periodic feature and the aperiodic feature is higher than the dimension of the periodic feature and the aperiodic feature. The low-dimensional periodic features and the non-periodic features are mapped to obtain high-dimensional target dimensional features, which can improve the quality of synthesized speech.

Exemplarily, the periodic features f0 and sp are obtained according to the Mel spectrum of the voice to be converted, and the non-periodic feature ap is used to process the three features using the Speech Signal Processing Toolkit (SPTK) Get 1-dimensional lF0 (take the logarithm of F0), 41-dimensional mgc and 1-dimensional aperiodicity (band aperiodicity, bap), and calculate the 1-dimensional voice (voice, unvoice, abbreviated as vuv) according to lf0 Data, find the first derivative and the second derivative of lf0, mgc, and bap respectively, and obtain 1×2, 41×2, and 1×2 dimensional data respectively. Finally, the data vuv, lf0 and its derivatives, mgc and its derivatives, bap and its derivatives are normalized to obtain a total of 130-dimensional serialized data. The 130-dimensional serialized data is used as the target dimensional feature.

The target dimension feature is formatted to meet the input format requirements of the target conversion model, and the feature data obtained by the format conversion is the feature to be converted. Exemplarily, when the input format of the target conversion model is required to be binary data, binary conversion is performed on the target dimensional feature, and the obtained binary data is the feature to be converted.

In one embodiment, the target conversion model runs based on the Computer Unified Device Architecture Recursive Neural Network Toolkit (Computed Unified Device Architecture Recursive Neural Network Toolkit, CURRENNT).

Among them, CURRENNT is an open source parallel implementation of a deep parallel neural network (Recurrent Neural Network, RNN). It supports the Graphics Processing Unit (GPU) through NVIDIA's Computer Unified Device Architecture (CUDA). CURRENNT supports one-way and two-way RNNs with Long Short-Term Memory (LSTM) storage units, thereby overcoming the problem of vanishing gradients.

Put the target conversion model in CURRENNT, the target conversion model is in the running state, and put the features to be converted into the same CURRENNT, the features to be converted will be input into the target conversion model, and the target The conversion model outputs the target feature corresponding to the feature to be converted.

As shown in Figure 3, in one embodiment, the method further includes:

Step 306: Perform segmentation processing on the voice to be converted to obtain multiple segmented voices.

Due to the limited computing resources of offline devices, if the duration of the voice to be converted is long, the voice to be converted is directly converted, the running speed is slow, and the real-time voice conversion cannot be realized. The voice to be converted is processed in segments to obtain multiple segmented voices. Due to the short duration of the segmented voices, the conversion can be performed quickly, thereby greatly improving the running speed. Exemplarily, when the duration of the voice to be converted is greater than the preset duration, the voice to be converted is segmented according to a preset condition. As shown in FIG. 4, the voice to be converted 41 is divided into 3 segments evenly according to the length of time, and 3 segmented voices 42 are obtained.

Step 308: Perform feature extraction on the multiple segmented voices to obtain multiple segmented features.

Among them, the segmented feature refers to the feature to be converted corresponding to each segmented voice.

The feature extraction is performed on each segmented voice respectively, and the feature to be converted corresponding to each segmented voice is obtained according to the extracted features, that is, the segmented feature of each segmented voice is obtained.

Step 310: Input each of the segmented features into the target conversion model in parallel to obtain a target segmented feature corresponding to each of the segmented features.

Among them, the target segment feature refers to the target feature corresponding to each segment feature.

After obtaining multiple segmented features, call multiple cores of the central processing unit (CPU) to convert multiple segmented features at the same time, open multiple processes, and each process individually executes the input of the segmented features to the target conversion In the model, the target segment feature corresponding to the segment feature is obtained. Inputting each of the segmented features into the target conversion model in parallel is much faster than sequentially converting each of the segmented features, thereby facilitating real-time speech conversion.

Step 312: Obtain a target voice according to the target segment feature corresponding to each of the segment features.

The target segmented features corresponding to each of the segmented features can be synthesized to obtain the target feature, and the target speech can be obtained according to the target feature; the corresponding target segmented speech can also be obtained according to the target segmented feature, and the segmented speech can be synthesized Target voice. Exemplarily, the voice to be converted is segmented into 5 segmented voices, and 5 corresponding segmented features are obtained from the 5 segmented voices, and the 5 corresponding segmented features are input into the target conversion model to obtain 5 corresponding targets Segmentation features: According to the 5 corresponding target segmentation features, 5 corresponding target segmented voices are obtained, and the 5 corresponding target segmented voices can be synthesized to obtain the target voice.

In one embodiment, any two of the target segmentation features that are adjacent in time among the plurality of target segmentation features include overlapping features, and step 312 is obtained according to the target segmentation feature corresponding to each of the segmentation features. The target voice includes: obtaining the result according to the target segment feature corresponding to each of the segment features and the overlapping features of any two target segment features adjacent in time among the plurality of target segment features. Describe the target voice.

As shown in FIG. 4, in order to prevent the segmentation processing of the voice to be converted 41 from causing errors in subsequent feature extraction or loss of some features, during segmentation processing, multiple segmented voices 42 may be adjacent in time. Any two of the segmented speech 42 includes an overlapping portion 421.

Wherein, the overlapping feature refers to that the overlapping part 421 included in any two segmented voices 42 adjacent in time in the plurality of segmented speeches 42 is converted to obtain the corresponding target feature.

The target segmentation features corresponding to each of the segmentation features are merged together to obtain a merged feature, according to the overlapping features of any two target segmentation features that are adjacent in time among the plurality of target segmentation features , The target feature can be obtained by adjusting the merged feature, and then the target voice can be obtained according to the target feature. Exemplarily, the voice to be converted is segmented into 2 segmented voices, and 2 target segmented features are obtained after conversion. The target segmented feature I is (A+C _A ), and the target segmented feature II is (C _B +B), the overlap feature of the standard segment feature I and the target segment feature II is C. In the process of obtaining the target feature, the first 1/2 of the overlap feature C in the target segment feature I can be retained, that is, before C _A. The last 1/2 of the overlapping feature C in the target segmentation feature II, that is, after C _B , the target feature is (A + CA _before + C _{B after} + B), and the target voice is obtained according to the target feature.

In one embodiment, it is obtained according to the target segmentation feature corresponding to each of the segmentation features and the overlapping feature of any two target segmentation features adjacent in time among the plurality of the target segmentation features The target speech includes: acquiring a feature weight set, the feature weight set includes a first feature weight and a second feature weight, the first feature weight and the second feature weight are any two targets that are adjacent in time The weights corresponding to the overlapping features in the segmented features; according to the target segmented feature corresponding to each of the segmented features, any two target segments that are adjacent in time among the plurality of target segmented features The overlapping features of the features and the feature weight set obtain the target speech.

Among them, the feature weight set is used to determine the weights of overlapping features of any two target segment features that are adjacent in time in the two target segment features.

Exemplarily, the voice to be converted is segmented into 2 segmented voices, and 2 target segmented features are obtained after conversion. The target segmented feature I is (A+C _A ), and the target segmented feature II is (C _B +B), the overlap feature of the standard segment feature I and the target segment feature II is C, the first feature weight in the feature weight set is m, which is used to determine the weight of the overlap feature C in the target segment feature I, and the second feature The weight is n, which is used to determine the weight of the overlapping feature C in the target segmented feature II. The target feature of the voice to be converted is (A+m×C _A +n×C _B +B), and the target voice is obtained according to the target feature.

As shown in FIG. 5, in one embodiment, a voice conversion device is provided, and the device includes:

The obtaining module 502 is configured to obtain the voice to be converted and the original conversion model, and the format of the original conversion model is an online format;

The format conversion module 504 is used for format conversion of the original conversion model to obtain an offline format target conversion model;

The feature extraction module 506 is configured to perform feature extraction on the voice to be converted to obtain the feature to be converted;

The feature conversion module 508 is configured to input the features to be converted into the target conversion model to obtain the target features output by the target conversion model;

The result module 510 is configured to obtain a target voice according to the target feature output by the target conversion model, the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from the voice to be converted.

The above voice conversion device obtains the voice to be converted and the original conversion model. Since the original conversion model cannot work in an offline state, the features of the voice to be converted are extracted to obtain the features to be converted, and after the format of the original conversion model is converted to an offline format, According to the features to be converted and the target conversion model in offline format, the target features can be obtained, and then the target speech can be obtained according to the target features. This voice conversion method can not only perform high-quality voice conversion in an offline state, but also runs fast, and can realize real-time voice conversion.

In one embodiment, the feature extraction module 506 is configured to perform periodic feature extraction and aperiodic feature extraction on the voice to be converted to obtain periodic features and aperiodic features corresponding to the voice to be converted, and the periodic features include Fundamental frequency and spectrum envelope; the feature to be converted is obtained according to the periodic feature and the non-periodic feature.

In an embodiment, the feature extraction module 506 is specifically configured to obtain a target dimensional feature according to the periodic feature and the aperiodic feature, and the target dimensional feature has a higher dimension than the periodic feature and the aperiodic feature The sum of the dimensions; format conversion of the target dimensional feature to obtain the feature to be converted.

In one embodiment, the target conversion model runs based on a computer unified device architecture recurrent neural network toolkit framework.

In one embodiment, the feature extraction module 506 is configured to perform segmentation processing on the voice to be converted to obtain multiple segmented voices, and perform feature extraction on the multiple segmented voices to obtain multiple segmented features The feature conversion module 508 is configured to input each of the segmented features into the target conversion model in parallel to obtain the target segmented feature corresponding to each of the segmented features; the result module 510 is configured to The target segmentation feature corresponding to each of the segmentation features obtains the target voice.

In one embodiment, any two of the target segmentation features that are adjacent in time among the plurality of target segmentation features include overlapping features, and the result module 510 is configured to correspond to each of the segmentation features. The target voice is obtained by overlapping features of the target segmentation feature and any two temporally adjacent target segmentation features among the plurality of target segmentation features.

In one embodiment, the result module 510 is used to obtain a feature weight set, the feature weight set includes a first feature weight and a second feature weight, and the first feature weight and the second feature weight are relative in time. The weight corresponding to the overlapping feature in any two adjacent target segmentation features; according to the target segmentation feature corresponding to each of the segmentation features, and the temporally adjacent ones of the multiple target segmentation features The target speech is obtained by overlapping features of any two target segmented features and the feature weight set.

Fig. 6 shows an internal structure diagram of a computer device in an embodiment. The computer device can be a terminal, a server, or a voice conversion device. As shown in Figure 6, the computer device includes a processor, a memory, and a network interface connected through a system bus. Among them, the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program. When the computer program is executed by the processor, the processor can realize the voice conversion method. A computer program may also be stored in the internal memory, and when the computer program is executed by the processor, the processor can execute the voice conversion method. Those skilled in the art can understand that the structure shown in FIG. 6 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.

In one embodiment, a computer device is provided, which includes a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the following steps:

The above-mentioned computer equipment obtains the voice to be converted and the original conversion model. Since the original conversion model cannot work in an offline state, the features of the voice to be converted are extracted to obtain the features to be converted, and after the format of the original conversion model is converted to the offline format, according to The feature to be converted and the target conversion model in the offline format can obtain the target feature, and then the target voice can be obtained according to the target feature. This voice conversion method can not only perform high-quality voice conversion in an offline state, but also runs fast, and can realize real-time voice conversion.

In one embodiment, the performing feature extraction on the voice to be converted to obtain the feature to be converted includes: performing periodic feature extraction and aperiodic feature extraction on the voice to be converted to obtain the period corresponding to the voice to be converted Features and aperiodic features, where the periodic features include a fundamental frequency and a spectrum envelope; the features to be converted are obtained according to the periodic features and the aperiodic features.

In one embodiment, the obtaining the feature to be converted according to the periodic feature and the aperiodic feature includes: obtaining a target dimensional feature according to the periodic feature and the aperiodic feature, and the target dimensional feature has a high dimensionality Based on the sum of the dimensions of the periodic feature and the non-periodic feature; performing format conversion on the target dimensional feature to obtain the feature to be converted.

In one embodiment, the target conversion model runs based on the recurrent neural network toolkit framework of a computer unified device architecture.

In one embodiment, the performing feature extraction on the voice to be converted to obtain the feature to be converted includes: performing segmentation processing on the voice to be converted to obtain multiple segmented voices; Perform feature extraction on speech to obtain multiple segmented features; the inputting the features to be converted into the target conversion model to obtain the target features output by the target conversion model includes: parallelizing each of the segmented features Inputting the target conversion model to obtain the target segment feature corresponding to each of the segment features; the obtaining the target voice according to the target feature output by the target conversion model includes: according to each of the segment features corresponding The target voice is obtained by the target segmentation feature.

In one embodiment, any two of the target segmentation features that are adjacent in time among the plurality of target segmentation features include overlapping features; the target segmentation feature corresponding to each of the segmentation features is obtained The target voice includes: obtaining the result according to the target segment feature corresponding to each of the segment features and the overlapping features of any two target segment features adjacent in time among the plurality of target segment features. Describe the target voice.

In an embodiment, the target segmentation feature corresponding to each of the segmentation features and the overlap of any two target segmentation features adjacent in time among the plurality of the target segmentation features Obtaining the target voice from features includes: acquiring a feature weight set, the feature weight set including a first feature weight and a second feature weight, and the first feature weight and the second feature weight are any two that are adjacent in time. Weights corresponding to overlapping features in the target segmentation features; according to the target segmentation feature corresponding to each of the segmentation features, any two targets that are adjacent in time among the plurality of target segmentation features The overlapping features of the segmented features and the feature weight set are used to obtain the target speech.

In one embodiment, a computer-readable storage medium is provided that stores a computer program, and when the computer program is executed by a processor, the processor executes the following steps:

The above-mentioned computer-readable storage medium obtains the voice to be converted and the original conversion model. Since the original conversion model cannot work in an offline state, the features of the voice to be converted are extracted to obtain the features to be converted, and the format of the original conversion model is converted to an offline format. Then, the target features can be obtained according to the features to be converted and the target conversion model in offline format, and then the target speech can be obtained according to the target features. This voice conversion method can not only perform high-quality voice conversion in an offline state, but also runs fast, and can realize real-time voice conversion.

In an embodiment, any two of the target segmentation features that are adjacent in time among the plurality of target segmentation features include overlapping features; and the target segmentation feature is obtained according to the target segmentation feature corresponding to each of the segmentation features. The target speech includes: obtaining the result according to the target segment feature corresponding to each of the segment features and the overlapping features of any two target segment features that are adjacent in time among the plurality of target segment features. Describe the target voice.

It should be noted that the above-mentioned voice conversion method, voice conversion device, computer equipment, and computer-readable storage medium belong to a general inventive concept. The content in the embodiments of the voice conversion method, voice conversion device, computer equipment, and computer-readable storage medium Can be applied to each other.

A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through a computer program. The program can be stored in a non-volatile computer readable storage medium. Here, when the program is executed, it may include the procedures of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database, or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

The technical features of the above embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, they should be It is considered as the range described in this specification.

The above-mentioned embodiments only express several implementation manners of the present application, and their descriptions are more specific and detailed, but they should not be understood as a limitation to the scope of the present application. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of this application, several modifications and improvements can be made, and these all fall within the protection scope of this application. Therefore, the scope of protection of the patent of this application shall be subject to the appended claims.

Claims

A voice conversion method, the method includes:

Acquiring the voice to be converted and the original conversion model, where the format of the original conversion model is an online format;

Format conversion of the original conversion model to obtain a target conversion model in offline format;

Performing feature extraction on the voice to be converted to obtain the feature to be converted;

Input the feature to be converted into the target conversion model to obtain the target feature output by the target conversion model;

The target voice is obtained according to the target feature output by the target conversion model, the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from the voice to be converted.
The voice conversion method according to claim 1, wherein the feature extraction of the voice to be converted to obtain the feature to be converted comprises:

Performing periodic feature extraction and aperiodic feature extraction on the voice to be converted to obtain periodic features and aperiodic features corresponding to the voice to be converted, the periodic features including a fundamental frequency and a spectrum envelope;

The feature to be converted is obtained according to the periodic feature and the non-periodic feature.
The voice conversion method according to claim 2, wherein the obtaining the feature to be converted according to the periodic feature and the non-periodic feature comprises:

Obtaining a target dimensional feature according to the periodic feature and the aperiodic feature, where the dimensionality of the target dimensional feature is higher than the sum of the dimensionality of the periodic feature and the aperiodic feature;

Perform format conversion on the target dimensional feature to obtain the feature to be converted.
The speech conversion method according to claim 1, wherein the target conversion model runs based on a recurrent neural network toolkit framework of a computer unified device architecture.
The voice conversion method according to claim 1, wherein the feature extraction of the voice to be converted to obtain the feature to be converted comprises:

Performing segmentation processing on the voice to be converted to obtain multiple segmented voices;

Performing feature extraction on the multiple segmented voices to obtain multiple segmented features;

The inputting the feature to be converted into the target conversion model to obtain the target feature output by the target conversion model includes:

Input each of the segmented features into the target conversion model in parallel to obtain a target segmented feature corresponding to each of the segmented features;

The obtaining the target voice according to the target feature output by the target conversion model includes:

The target voice is obtained according to the target segment feature corresponding to each of the segment features.
The voice conversion method according to claim 5, wherein any two of the target segment features that are adjacent in time among the plurality of target segment features include overlapping features; The target voice is obtained by the target segmentation feature corresponding to the segment feature, including:

The target voice is obtained according to the target segment feature corresponding to each of the segment features and the overlapping feature of any two target segment features adjacent in time among the plurality of target segment features.
The voice conversion method according to claim 6, wherein the target segment feature corresponding to each of the segment features is adjacent in time among the plurality of target segment features The target speech is obtained by overlapping features of any two target segmentation features, including:

Obtain a feature weight set, the feature weight set includes a first feature weight and a second feature weight, the first feature weight and the second feature weight are overlapping features of any two target segmented features that are adjacent in time Corresponding weight;

Obtained according to the target segmentation feature corresponding to each of the segmentation features, the overlapping feature of any two target segmentation features adjacent in time among the plurality of the target segmentation features, and the feature weight set The target voice.
A voice conversion device, characterized in that the device includes:

An acquisition module for acquiring the voice to be converted and the original conversion model, the format of the original conversion model is an online format;

A format conversion module for format conversion of the original conversion model to obtain a target conversion model in an offline format;

The feature extraction module is used to perform feature extraction on the voice to be converted to obtain the feature to be converted;

The feature conversion module is configured to input the feature to be converted into the target conversion model to obtain the target feature output by the target conversion model;

The result module is configured to obtain a target voice according to the target feature output by the target conversion model, the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from the voice to be converted.
A computer-readable storage medium storing a computer program, and when the computer program is executed by a processor, the processor executes the steps of the voice conversion method according to any one of claims 1 to 7.
A computer device, comprising a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the speech according to any one of claims 1 to 7 Steps of the conversion method.