CN111108558B

CN111108558B - Voice conversion method, device, computer equipment and computer readable storage medium

Info

Publication number: CN111108558B
Application number: CN201980003120.8A
Authority: CN
Inventors: 刘洋; 李柏; 丁万; 黄东延; 熊友军
Original assignee: Ubtech Robotics Corp
Current assignee: Beijing Youbixuan Intelligent Robot Co ltd
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2023-08-04
Anticipated expiration: 2039-12-20
Also published as: CN111108558A; WO2021120145A1

Abstract

The embodiment of the invention discloses a voice conversion method, a voice conversion device, computer equipment and a computer readable storage medium. The voice conversion method comprises the following steps: acquiring a voice to be converted and an original conversion model, wherein the format of the original conversion model is an online format; performing format conversion on the original conversion model to obtain a target conversion model in an offline format; extracting characteristics of the voice to be converted to obtain characteristics to be converted; inputting the feature to be converted into the target conversion model to obtain a target feature output by the target conversion model; and obtaining target voice according to the target characteristics output by the target conversion model, wherein the voice content of the target voice is the same as the voice to be converted, and the voice of the target voice is different from the voice to be converted. The voice conversion method not only can carry out voice conversion with high quality in an off-line state, but also has high running speed and can realize real-time voice conversion.

Description

Voice conversion method, device, computer equipment and computer readable storage medium

Technical Field

The present invention relates to the field of audio processing technologies, and in particular, to a voice conversion method, a voice conversion device, a computer device, and a computer readable storage medium.

Background

The voice conversion technology is a technology for converting source voice into target voice under the condition that semantic content is kept unchanged, wherein the source voice is voice sent by a first voice, and the target voice is voice sent by a second voice, namely, the source voice sent by the first voice is converted into the target voice sent by the second voice with the same semantic through the voice conversion technology.

With the rapid development of deep neural network technology, the voice conversion method based on deep learning has high similarity of the converted voice, good voice quality and good fluency. The current voice conversion method based on deep learning mainly comprises two steps, namely training a conversion model by using a large amount of voice data, and then carrying out voice conversion by using the trained model. Because the training has high requirements on computing resources, the offline end has few resources and low performance, and the offline end is used for training and is easy to be used for resource exhaustion, even if the offline end can train, the offline end has low efficiency, and the offline end is too high in time cost and difficult to use. Therefore, the voice conversion function based on deep learning can be realized by an online high-performance server, and cannot be used in an offline state.

Disclosure of Invention

In view of the foregoing, it is necessary to provide a voice conversion method, apparatus, computer device, and storage medium capable of performing high-quality voice conversion in an offline state.

A method of speech conversion, the method comprising:

acquiring a voice to be converted and an original conversion model, wherein the format of the original conversion model is an online format;

performing format conversion on the original conversion model to obtain a target conversion model in an offline format;

extracting characteristics of the voice to be converted to obtain characteristics to be converted;

inputting the feature to be converted into the target conversion model to obtain a target feature output by the target conversion model;

and obtaining target voice according to the target characteristics output by the target conversion model, wherein the voice content of the target voice is the same as the voice to be converted, and the voice of the target voice is different from the voice to be converted.

An apparatus for speech conversion, the apparatus comprising:

the acquisition module is used for acquiring the voice to be converted and an original conversion model, wherein the format of the original conversion model is an online format;

the format conversion module is used for carrying out format conversion on the original conversion model to obtain a target conversion model in an offline format;

The feature extraction module is used for extracting features of the voice to be converted to obtain features to be converted;

the feature conversion module is used for inputting the feature to be converted into the target conversion model to obtain target features output by the target conversion model;

and the result module is used for obtaining target voice according to the target characteristics output by the target conversion model, wherein the voice content of the target voice is the same as the voice to be converted, and the voice of the target voice is different from the voice to be converted.

A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of:

A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:

The embodiment of the invention has the following beneficial effects:

according to the voice conversion method, the voice conversion device, the computer equipment and the computer readable storage medium, the voice to be converted and the original conversion model are obtained, and the original conversion model cannot work in an offline state, so that the characteristics of the voice to be converted are extracted to obtain the characteristics to be converted, after the format of the original conversion model is converted into the offline format, the target characteristics can be obtained according to the characteristics to be converted and the target conversion model in the offline format, and then the target voice is obtained according to the target characteristics. The voice conversion method not only can carry out voice conversion with high quality in an off-line state, but also has high running speed and can realize real-time voice conversion.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Wherein:

FIG. 1 is a diagram of an application environment for a speech conversion method in one embodiment;

FIG. 2 is a flow chart of a method of speech conversion in one embodiment;

FIG. 3 is a flow chart of a method of speech conversion in one embodiment;

FIG. 4 is a schematic diagram of segmentation processing of speech to be converted in one embodiment;

FIG. 5 is a block diagram of a speech conversion device according to one embodiment;

FIG. 6 is a block diagram of a computer device in one embodiment.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

FIG. 1 is a diagram of an application environment for a speech conversion method in one embodiment. As shown in fig. 1, the voice conversion method is applied to a voice conversion system. The voice conversion system comprises a terminal, wherein the terminal can be a desktop terminal or a mobile terminal, and the mobile terminal can be at least one of a mobile phone, a tablet computer, a notebook computer and the like. The terminal comprises a microphone, a conversion unit and a player, wherein the microphone is used for acquiring voice to be converted, the conversion unit is used for converting the voice to be converted into target voice with the same voice content as the voice to be converted but different sounds, and the player is used for playing the target voice.

As shown in fig. 2, in one embodiment, a speech conversion method is provided. The method can be applied to a terminal, a server and other voice conversion devices. The present embodiment is exemplified as applied to a voice conversion device. In an off-line state, after the voice conversion device obtains the voice to be converted, the target voice with the same voice content and different voices with the voice to be converted can be obtained through the voice conversion method. The voice conversion method specifically comprises the following steps:

step 202: and acquiring the voice to be converted and an original conversion model, wherein the format of the original conversion model is an online format.

The voice to be converted refers to a voice emitted by the voice to be converted and emitted by the target voice to be converted.

The online format refers to a storage format of a file which can be opened or normally operated only in a network connection state.

The original conversion model refers to a model which is input as a feature to be converted of the voice to be converted and output as a target feature of the target voice, and is used for acquiring the target feature of the target voice according to the feature to be converted of the voice to be converted under the network connection state.

Step 204: and carrying out format conversion on the original conversion model to obtain an offline format target conversion model.

The offline format refers to a storage format of a file which can still be opened or normally operated in a state of being disconnected from the network.

The target conversion model is used for obtaining target characteristics of target voice according to the characteristics to be converted of the voice to be converted in a network disconnection state.

And carrying out format conversion on the original conversion model to obtain an offline format target conversion model. Illustratively, the original conversion model is a model file trained by a TensorFlow (machine learning library developed by Google, using python language) framework, and the storage format of the original conversion model is an online format CheckPoint (abbreviated ckpt), and the storage format of the original conversion model can be converted into an offline format JetSoft Shield Now (abbreviated jsn) to obtain the target conversion model. The original conversion model in the ckpt format has more recorded information, such as some parameters and data used for training the original conversion model, and the voice conversion process in the offline state does not need the data, so that the redundant data can be removed when the storage format of the original conversion model is converted into the jsn format, which is equivalent to simplifying and compressing the model file, so that the running speed in the offline state can be improved, the voice conversion speed can be further improved, and the real-time voice conversion can be realized.

Step 206: and extracting the characteristics of the voice to be converted to obtain the characteristics to be converted.

The feature to be converted is used for inputting a target conversion model to acquire the target feature corresponding to the voice to be converted.

And obtaining the frequency spectrum characteristics of the voice to be converted according to the voice to be converted, such as the Mel frequency spectrum of the voice to be converted, extracting the characteristics of the voice to be converted, and determining the characteristics to be converted of the voice to be converted according to the characteristics.

Step 208: and inputting the feature to be converted into the target conversion model to obtain the target feature output by the target conversion model.

The target characteristics are used for acquiring target voices which are the same as voice contents of the voice to be converted and have different sounds.

In an offline state, when the target conversion model is in an operation state, the feature to be converted is input into the target conversion model, and the target conversion model directly outputs the target feature corresponding to the feature to be converted.

Step 210: and obtaining target voice according to the target characteristics output by the target conversion model, wherein the voice content of the target voice is the same as the voice to be converted, and the voice of the target voice is different from the voice to be converted.

The target voice refers to voice with the same voice content as the voice to be converted and different voice from the voice to be converted.

The characteristics of fundamental frequency, spectrum envelope, aperiodicity and the like of the target voice can be obtained according to the target characteristics, the Mel spectrum of the target voice is determined, and the target voice can be obtained according to the Mel spectrum of the target voice. The feature to be converted is binary 130-dimensional serialization data, the target feature obtained by inputting the target conversion model is also 130-dimensional serialization data, the lf0, mgc and bap feature data of the target voice are obtained through inverse normalization, the feature data are converted into f0, sp and ap features through SPTK, the Mel frequency spectrum of the target voice can be determined through the f0, sp and ap of the target voice, and the target voice can be obtained through the Mel frequency spectrum of the target voice.

According to the voice conversion method, the voice to be converted and the original conversion model are obtained, and the original conversion model cannot work in an offline state, so that the characteristics of the voice to be converted are extracted to obtain the characteristics to be converted, the format of the original conversion model is converted into the offline format, then the target characteristics can be obtained according to the characteristics to be converted and the target conversion model in the offline format, and then the target voice is obtained according to the target characteristics. The voice conversion method not only can carry out voice conversion with high quality in an off-line state, but also has high running speed and can realize real-time voice conversion.

In one embodiment, step 206 performs feature extraction on the speech to be converted to obtain features to be converted, including: performing periodic feature extraction and non-periodic feature extraction on the voice to be converted to obtain periodic features and non-periodic features corresponding to the voice to be converted, wherein the periodic features comprise fundamental frequency and spectrum envelope; and obtaining the feature to be converted according to the periodic feature and the non-periodic feature.

Since there are a plurality of sound sources generating acoustic energy in the vocal tract when a person speaks, wherein the non-periodic sound sources include air supply sound, friction sound, explosion sound generated at lips, teeth, throat, vocal tract, etc., and the periodic sound sources are generated by vocal cord vibration at glottis, the speech to be converted includes periodic components and non-periodic components, and the corresponding spectral features of the speech to be converted include periodic features and non-periodic features. In this embodiment, a mel spectrum of a voice to be converted is used as a spectrum feature.

The fundamental frequency (Fundamental Frequency, f 0) is that a group of sine waves form an original signal, the sine wave with the lowest frequency is the fundamental frequency, and the other sine waves are overtones. The spectral envelope (spectral envelope, sp) is an envelope obtained by connecting the peaks of the amplitudes of different frequencies by a smooth curve. The non-periodic sequence (aperiodic parameter, ap) refers to non-periodic signal parameters of the speech.

Where the periodic characteristics refer to the fundamental frequency and the spectral envelope in the mel spectrum of the speech to be converted.

Wherein, the non-periodic characteristic refers to a non-periodic sequence in the mel spectrum of the speech to be converted.

From the periodic features and the non-periodic features, feature data, which is a feature to be converted, can be obtained as input to a target conversion model by processing. Illustratively, a set of feature data is obtained according to the periodic features and the non-periodic features, and the feature data is calculated and converted in format to obtain the feature to be converted.

In one embodiment, obtaining the feature to be converted according to the periodic feature and the non-periodic feature includes: obtaining a target dimension characteristic according to the periodic characteristic and the non-periodic characteristic, wherein the dimension of the target dimension characteristic is higher than the sum of the dimensions of the periodic characteristic and the non-periodic characteristic; and carrying out format conversion on the target dimension characteristics to obtain the characteristics to be converted.

The target dimension feature refers to a good feature which is obtained according to the periodic feature and the non-periodic feature and has a dimension higher than that of the periodic feature and the non-periodic feature. And mapping the periodic features and the non-periodic features with low dimensions to obtain target dimension features with high dimensions, so that the quality of the synthesized voice can be improved.

Illustratively, the periodic features F0 and sp are obtained according to the mel spectrum of the voice to be converted, the non-periodic features ap are processed by a voice signal processing kit (Speech Signal Processing Toolkit, SPTK) to obtain 1-dimensional lF0 (logarithm of F0), the mgc-dimensional and 1-dimensional band non-periodic (band aperiodicity, bap) data of 1-dimensional are calculated according to lF0, whether sound (un sound, abbreviated as vuv) data of 1-dimensional are obtained, and the first derivative and the second derivative are respectively obtained for lF0, mgc and bap, so as to obtain 1×2, 41×2 and 1×2-dimensional data. Finally, the data vuv, lf0 and its derivative, mgc and its derivative, and bap and its derivative are normalized to obtain a total of 130-dimensional serialized data. The 130-dimensional serialized data is used as a target dimension feature.

And carrying out format conversion on the target dimension characteristics so as to enable the target dimension characteristics to meet the format requirement of the input of a target conversion model, wherein the characteristic data obtained through format conversion is the characteristics to be converted. For example, when the input format requirement of the target conversion model is binary data, binary conversion is performed on the target dimension feature, and the obtained binary data is the feature to be converted.

In one embodiment, the target conversion model operates based on a computer unified device architecture recurrent neural network toolkit framework (Computed Unified Device Architecture RecurREnt Neural Network Toolkit, CURRENNT).

Among these, CURRENNT is an open source parallel implementation of a deep parallel neural network (Recurrent Neural Network, RNN) that supports graphics processing units (Graphics Processing Unit, GPU) through the computer unified device architecture (Computed Unified Device Architecture, CUDA) of inflight. The current nnt supports one-way and two-way RNNs with Long Short-Term Memory (LSTM) storage units, thereby overcoming the problem of gradient extinction.

And placing the target conversion model in a current NNT, wherein the target conversion model is in an operation state, at the moment, placing the feature to be converted in the same current NNT, inputting the feature to be converted into the target conversion model, and outputting the target feature corresponding to the feature to be converted by the target conversion model.

As shown in fig. 3, in one embodiment, the method further comprises:

step 306: and carrying out segmentation processing on the voice to be converted to obtain a plurality of segmented voices.

Because the offline equipment has limited computing resources, if the duration of the voice to be converted is longer, the voice to be converted is directly converted, the operation speed is low, and the real-time conversion of the voice cannot be realized. The voice to be converted is subjected to segmentation processing to obtain a plurality of segmented voices, and the conversion can be rapidly performed due to the length of the segmented voices, so that the running speed can be greatly improved. The method includes that when the time length of the voice to be converted is longer than the preset time length, the voice to be converted is segmented according to preset conditions. As shown in fig. 4, the speech 41 to be converted is equally divided into 3 segments according to the duration, and 3 segmented speech 42 is obtained.

Step 308: and extracting the characteristics of the segmented voices to obtain a plurality of segmented characteristics.

The segmentation features refer to features to be converted corresponding to each segmented voice.

And respectively extracting the characteristics of each piece of segmented voice, and obtaining the characteristics to be converted corresponding to each piece of segmented voice according to the extracted characteristics to obtain the segmented characteristics of each piece of segmented voice.

Step 310: and inputting each segment feature into the target conversion model in parallel to obtain a target segment feature corresponding to each segment feature.

The target segment features refer to target features corresponding to each segment feature.

After obtaining a plurality of segment features, a plurality of cores of a central processing unit (central processing unit, CPU) are called to convert the plurality of segment features at the same time, a plurality of processes are started, and each process independently executes to input the segment features into a target conversion model to obtain target segment features corresponding to the segment features. And each segmented feature is input into the target conversion model in parallel, so that the conversion speed is much faster than that of each segmented feature in sequence, and the real-time conversion of the voice is facilitated.

Step 312: and obtaining target voice according to the target segmentation characteristics corresponding to each segmentation characteristic.

Target segment features corresponding to each segment feature can be synthesized to obtain target features, and target voice is obtained according to the target features; the corresponding target segmented voice can be obtained according to the target segmented characteristics, and the segmented voice is synthesized to obtain the target voice. The speech to be converted is segmented into 5 segmented speech, 5 corresponding segmented features are obtained according to the 5 segmented speech, 5 corresponding segmented features are input into a target conversion model to obtain 5 corresponding target segmented features, 5 corresponding target segmented speech is obtained according to the 5 corresponding target segmented features, and the 5 corresponding target segmented speech is synthesized to obtain target speech.

In one embodiment, any two target segment features adjacent in time from the plurality of target segment features include an overlapping feature, step 312 obtains a target speech from the target segment feature corresponding to each of the segment features, including: and obtaining the target voice according to the target segment feature corresponding to each segment feature and the overlapped feature of any two target segment features adjacent in time in the target segment features.

As shown in fig. 4, in order to prevent errors or loss of some features in the subsequent extraction of features of the speech 41 to be converted due to the segmentation process, any two segmented voices 42 adjacent in time among the plurality of segmented voices 42 may include an overlapping portion 421 during the segmentation process.

The overlapping feature refers to that overlapping portions 421 included in any two segmented voices 42 adjacent in time in the plurality of segmented voices 42 are converted to obtain corresponding target features.

Combining target segment features corresponding to each segment feature to obtain a combined feature, adjusting the combined feature according to overlapping features of any two target segment features adjacent in time in the target segment features to obtain a target feature, and obtaining target voice according to the target feature. Illustratively, the speech to be converted is segmented into 2 segmented speech, and 2 target segment features are obtained by conversion, where the target segment feature I is (A+C) _A ) The target segment feature II is (C _B +B), the overlapping characteristic of the target segment characteristic I and the target segment characteristic II is C, and the former 1/2 of the overlapping characteristic C in the target segment characteristic I, namely C, can be reserved in the process of obtaining the target characteristic _{Before A} The back 1/2 of the overlapped feature C in the target segment feature II is reserved _{After B} Target featureIs (A+C) _{Before A} +C _{After B} +B), obtaining target voice according to the target characteristics.

In one embodiment, the obtaining the target speech according to the target segment feature corresponding to each segment feature and the overlapping feature of any two target segment features adjacent in time in the target segment features includes: acquiring a feature weight set, wherein the feature weight set comprises a first feature weight and a second feature weight, and the first feature weight and the second feature weight are weights corresponding to overlapping features in any two target segment features adjacent in time; and obtaining the target voice according to the target segment feature corresponding to each segment feature, the overlapped feature of any two target segment features adjacent in time in the target segment features and the feature weight set.

The feature weight set is used for determining the weight sizes of overlapping features of any two target segment features adjacent in time in the two target segment features respectively.

Illustratively, the speech to be converted is segmented into 2 segmented speech, and 2 target segment features are obtained by conversion, where the target segment feature I is (A+C) _A ) The target segment feature II is (C _B +B), the overlapping feature of the target segment feature I and the target segment feature II is C, the first feature weight in the feature weight set is m, used for determining the weight of the overlapping feature C in the target segment feature I, the second feature weight is n, used for determining the weight of the overlapping feature C in the target segment feature II, and the target feature of the voice to be converted is (A+m×C) _A +n×C _B +B), obtaining target voice according to the target characteristics.

As shown in fig. 5, in one embodiment, there is provided a voice conversion apparatus including:

the obtaining module 502 is configured to obtain a voice to be converted and an original conversion model, where the format of the original conversion model is an online format;

a format conversion module 504, configured to perform format conversion on the original conversion model to obtain an offline format target conversion model;

the feature extraction module 506 is configured to perform feature extraction on the speech to be converted to obtain features to be converted;

the feature conversion module 508 is configured to input the feature to be converted into the target conversion model, so as to obtain a target feature output by the target conversion model;

And a result module 510, configured to obtain a target voice according to the target feature output by the target conversion model, where the voice content of the target voice is the same as the voice to be converted, and the voice of the target voice is different from the voice to be converted.

According to the voice conversion device, the voice to be converted and the original conversion model are obtained, and the original conversion model cannot work in an offline state, so that the characteristics of the voice to be converted are extracted to obtain the characteristics to be converted, the format of the original conversion model is converted into the offline format, then the target characteristics can be obtained according to the characteristics to be converted and the target conversion model in the offline format, and then the target voice is obtained according to the target characteristics. The voice conversion method not only can carry out voice conversion with high quality in an off-line state, but also has high running speed and can realize real-time voice conversion.

In one embodiment, the feature extraction module 506 is configured to perform periodic feature extraction and non-periodic feature extraction on the speech to be converted, so as to obtain periodic features and non-periodic features corresponding to the speech to be converted, where the periodic features include a fundamental frequency and a spectral envelope; and obtaining the feature to be converted according to the periodic feature and the non-periodic feature.

In one embodiment, the feature extraction module 506 is specifically configured to obtain a target dimension feature according to the periodic feature and the non-periodic feature, where a dimension of the target dimension feature is higher than a sum of dimensions of the periodic feature and the non-periodic feature; and carrying out format conversion on the target dimension characteristics to obtain the characteristics to be converted.

In one embodiment, the target transformation model operates based on a computer unified device architecture recurrent neural network toolkit framework.

In one embodiment, the feature extraction module 506 is configured to perform a segmentation process on the speech to be converted to obtain a plurality of segmented speech, and perform feature extraction on the plurality of segmented speech to obtain a plurality of segmented features; the feature conversion module 508 is configured to input each of the segment features into the target conversion model in parallel, so as to obtain a target segment feature corresponding to each of the segment features; the result module 510 is configured to obtain a target voice according to a target segmentation feature corresponding to each segmentation feature.

In one embodiment, any two target segment features adjacent in time in the plurality of target segment features include an overlapping feature, and the result module 510 is configured to obtain the target speech according to the target segment feature corresponding to each of the segment features and the overlapping feature of any two target segment features adjacent in time in the plurality of target segment features.

In one embodiment, the result module 510 is configured to obtain a feature weight set, where the feature weight set includes a first feature weight and a second feature weight, and the first feature weight and the second feature weight are weights corresponding to overlapping features in any two target segment features that are adjacent in time; and obtaining the target voice according to the target segment feature corresponding to each segment feature, the overlapped feature of any two target segment features adjacent in time in the target segment features and the feature weight set.

FIG. 6 illustrates an internal block diagram of a computer device in one embodiment. The computer device may be a terminal, a server, or a voice conversion device. As shown in fig. 6, the computer device includes a processor, a memory, and a network interface connected by a system bus. The memory includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system, and may also store a computer program that, when executed by a processor, causes the processor to implement a speech conversion method. The internal memory may also store a computer program that, when executed by the processor, causes the processor to perform the speech conversion method. It will be appreciated by those skilled in the art that the structure shown in fig. 6 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is presented comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of:

According to the computer equipment, the to-be-converted voice and the original conversion model are obtained, and the original conversion model cannot work in an offline state, so that the to-be-converted voice characteristic is extracted to obtain the to-be-converted characteristic, after the format of the original conversion model is converted into the offline format, the target characteristic can be obtained according to the to-be-converted characteristic and the target conversion model in the offline format, and then the target voice is obtained according to the target characteristic. The voice conversion method not only can carry out voice conversion with high quality in an off-line state, but also has high running speed and can realize real-time voice conversion.

In one embodiment, the feature extraction of the speech to be converted to obtain the feature to be converted includes: performing periodic feature extraction and non-periodic feature extraction on the voice to be converted to obtain periodic features and non-periodic features corresponding to the voice to be converted, wherein the periodic features comprise fundamental frequency and spectrum envelope; and obtaining the feature to be converted according to the periodic feature and the non-periodic feature.

In one embodiment, the obtaining the feature to be converted according to the periodic feature and the non-periodic feature includes: obtaining a target dimension characteristic according to the periodic characteristic and the non-periodic characteristic, wherein the dimension of the target dimension characteristic is higher than the sum of the dimensions of the periodic characteristic and the non-periodic characteristic; and carrying out format conversion on the target dimension characteristics to obtain the characteristics to be converted.

In one embodiment, the feature extraction of the speech to be converted to obtain the feature to be converted includes: performing segmentation processing on the voice to be converted to obtain a plurality of segmented voices; extracting features of the segmented voices to obtain segmented features; the step of inputting the feature to be converted into the target conversion model to obtain the target feature output by the target conversion model comprises the following steps: inputting each segmented feature into the target conversion model in parallel to obtain a target segmented feature corresponding to each segmented feature; the obtaining the target voice according to the target characteristics output by the target conversion model comprises the following steps: and obtaining target voice according to the target segmentation characteristics corresponding to each segmentation characteristic.

In one embodiment, any two target segment features of the plurality of target segment features that are adjacent in time comprise overlapping features; the step of obtaining the target voice according to the target segmentation feature corresponding to each segmentation feature comprises the following steps: and obtaining the target voice according to the target segment feature corresponding to each segment feature and the overlapped feature of any two target segment features adjacent in time in the target segment features.

In one embodiment, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, causes the processor to perform the steps of:

The above computer readable storage medium, by acquiring the voice to be converted and the original conversion model, extracts the feature of the voice to be converted to obtain the feature to be converted because the original conversion model cannot work in an offline state, converts the format of the original conversion model into an offline format, and then obtains the target feature according to the feature to be converted and the target conversion model in the offline format, and then obtains the target voice according to the target feature. The voice conversion method not only can carry out voice conversion with high quality in an off-line state, but also has high running speed and can realize real-time voice conversion.

It should be noted that the above-mentioned voice conversion method, voice conversion apparatus, computer device, and computer-readable storage medium belong to one general inventive concept, and the contents in the embodiments of the voice conversion method, voice conversion apparatus, computer device, and computer-readable storage medium are mutually applicable.

Those skilled in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a non-volatile computer readable storage medium, and where the program, when executed, may include processes in the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A method of speech conversion, the method comprising:

Obtaining target voice according to target characteristics output by the target conversion model, wherein the voice content of the target voice is the same as the voice to be converted, and the voice of the target voice is different from the voice to be converted;

the feature extraction of the voice to be converted to obtain the feature to be converted includes:

performing segmentation processing on the voice to be converted to obtain a plurality of segmented voices;

extracting features of the segmented voices to obtain segmented features;

the step of inputting the feature to be converted into the target conversion model to obtain the target feature output by the target conversion model comprises the following steps:

inputting each segmented feature into the target conversion model in parallel to obtain a target segmented feature corresponding to each segmented feature;

the obtaining the target voice according to the target characteristics output by the target conversion model comprises the following steps:

obtaining target voice according to the target segmentation characteristics corresponding to each segmentation characteristic;

wherein any two target segment features of the plurality of target segment features that are adjacent in time comprise overlapping features; the step of obtaining the target voice according to the target segmentation feature corresponding to each segmentation feature comprises the following steps:

Obtaining the target voice according to the target segment feature corresponding to each segment feature and the overlapped feature of any two target segment features adjacent in time in the target segment features;

the step of obtaining the target voice according to the target segment feature corresponding to each segment feature and the overlapped feature of any two target segment features adjacent in time in the target segment features, including:

acquiring a feature weight set, wherein the feature weight set comprises a first feature weight and a second feature weight, and the first feature weight and the second feature weight are weights corresponding to overlapping features in any two target segment features adjacent in time;

and obtaining the target voice according to the target segment feature corresponding to each segment feature, the overlapped feature of any two target segment features adjacent in time in the target segment features and the feature weight set.

2. The method for converting speech according to claim 1, wherein the feature extraction of the speech to be converted to obtain the feature to be converted includes:

performing periodic feature extraction and non-periodic feature extraction on the voice to be converted to obtain periodic features and non-periodic features corresponding to the voice to be converted, wherein the periodic features comprise fundamental frequency and spectrum envelope;

And obtaining the feature to be converted according to the periodic feature and the non-periodic feature.

3. The method for voice conversion according to claim 2, wherein the obtaining the feature to be converted from the periodic feature and the non-periodic feature includes:

obtaining a target dimension characteristic according to the periodic characteristic and the non-periodic characteristic, wherein the dimension of the target dimension characteristic is higher than the sum of the dimensions of the periodic characteristic and the non-periodic characteristic;

and carrying out format conversion on the target dimension characteristics to obtain the characteristics to be converted.

4. The speech conversion method according to claim 1, wherein the target conversion model is run based on a computer unified device architecture recurrent neural network toolkit framework.

5. A speech conversion apparatus, the apparatus comprising:

the result module is used for obtaining target voice according to the target characteristics output by the target conversion model, the voice content of the target voice is the same as the voice to be converted, and the voice of the target voice is different from the voice to be converted;

extracting features of the segmented voices to obtain segmented features;

6. A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the speech conversion method according to any one of claims 1 to 4.

7. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the speech conversion method of any one of claims 1 to 4.