WO2022140966A1

WO2022140966A1 - Cross-language voice conversion method, computer device, and storage medium

Info

Publication number: WO2022140966A1
Application number: PCT/CN2020/140344
Authority: WO
Inventors: 赵之源; 王若童; 黄东延
Original assignee: 深圳市优必选科技股份有限公司
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2022-07-07

Abstract

A cross-language voice conversion method, comprising: obtaining a voice to be converted and an example voice of a target user, wherein the language used by voice content of the voice to be converted is different from the language used by voice content of the example voice (S110); preprocessing the voice to be converted to obtain a voice feature to be converted, and preprocessing the example voice to obtain an example voice feature (S120); taking the voice feature to be converted and the example voice feature as inputs, and using a pre-trained voice conversion model to obtain a target voice feature (S130); and converting the target voice feature into a target voice simulating the example voice, wherein voice content of the target voice is the same as the voice content of the voice to be converted (S140). Thus, the cross-language synthesis of the voice of the target user is implemented. Also provided are a computer device and a storage medium.

Description

Cross-language speech conversion method, computer device and storage medium

technical field

The present application relates to the field of computer technology, and in particular, to a cross-language voice conversion method, computer device and storage medium.

Background technique

Machine learning and deep learning rely on massive data and the powerful processing power of computers, and have made major breakthroughs in the fields of image, speech, and text. Since the same type of framework can achieve good results in different fields, neural network algorithm models that have been used to solve text and image problems are all applied to the field of speech.

The existing neural network algorithm models applied in the field of speech can capture the characteristics of the target speaker's voice, so as to stably synthesize other voices of the target speaker, and are close to the level of real people in terms of timbre similarity and language naturalness, but The synthesized speech can only be the same as the target speaker's language. The target speaker's voice cannot be synthesized into the target speaker's speech in other languages. If the target speaker can only speak Chinese, it can only be synthesized. The voice of the Chinese language cannot be synthesized into the voice of other national languages.

Application content

Based on this, it is necessary to propose a cross-language voice conversion method, computer equipment and storage medium for the above problems.

In a first aspect, an embodiment of the present application provides a method for cross-language voice conversion, the method comprising:

Acquiring the voice to be converted and the sample voice of the target user, the language used by the voice content of the voice to be converted is different from the language used by the voice content of the sample voice;

Preprocessing the to-be-converted voice to obtain the to-be-converted voice feature, and preprocessing the sample voice to obtain the sample voice feature;

Using the to-be-converted voice feature and the example voice feature as input, use the pre-trained voice conversion model to obtain the target voice feature;

Converting the target voice feature into a target voice simulating the example voice, the voice content of the target voice is the same as the voice content of the to-be-converted voice.

In a second aspect, an embodiment of the present application provides a computer device, comprising a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor is caused to perform the following steps:

In a third aspect, an embodiment of the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, causes the processor to perform the following steps:

In this embodiment of the present application, the to-be-converted voice and the sample voice with different languages used in the voice content are obtained, and the two are input into the voice content obtained by using a pre-trained voice conversion model to obtain the same voice content as the to-be-converted voice, simulating the sample voice. The target speech solves the problem that the target speaker's voice cannot be synthesized into the speech issued by the target speaker in other national languages, and obtains the beneficial effect of synthesizing the target user's speech across languages.

Description of drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following briefly introduces the accompanying drawings required for the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort.

in:

Fig. 1 is the application environment diagram of the cross-language speech conversion method in one embodiment;

Fig. 2 is the flow chart of the cross-language speech conversion method in one embodiment;

Fig. 3 is the flow chart of step S130 in the cross-language speech conversion method in one embodiment;

4 is a flowchart of step S110 in the method for cross-language speech conversion in one embodiment;

Fig. 5 is the flow chart of step S120 in the cross-language speech conversion method in one embodiment;

6 is a flowchart of step S410 in the method for cross-language speech conversion in one embodiment;

Fig. 7 is the flow chart of the speech conversion model training method in one embodiment;

FIG. 8 is a structural block diagram of a computer device in one embodiment.

Detailed ways

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.

FIG. 1 is an application environment diagram of a method for cross-language speech conversion in one embodiment. Referring to FIG. 1 , the cross-language voice conversion method is applied to a cross-language voice conversion system. The cross-language voice conversion system includes a terminal 110 and a server 120 . The terminal 110 and the server 120 are connected through a network, and the terminal 110 may be a desktop terminal or a mobile terminal, and the mobile terminal may be at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 can be implemented by an independent server or a server cluster composed of multiple servers. The terminal 110 is used for the voice to be converted and the sample voice of the target user and uploads it to the server 120. The language used for the voice content of the voice to be converted is different from the language used for the voice content of the sample voice. Convert the voice and the sample voice of the target user; perform preprocessing on the voice to be converted to obtain the voice feature to be converted, and perform preprocessing on the sample voice to obtain the sample voice feature; take the voice feature to be converted and the sample voice feature as Input, use a pre-trained voice conversion model to obtain target voice features; convert the target voice features into a target voice that simulates the example voice, and the voice content of the target voice is the same as the voice content of the to-be-converted voice.

In another embodiment, the above-mentioned cross-language voice conversion method can also be directly applied to the terminal 110, where the terminal 110 is used to obtain the voice to be converted and the sample voice of the target user, the language used in the voice content of the voice to be converted and the The speech content of the sample voices uses different languages; the voice to be converted is preprocessed to obtain the voice feature to be converted, and the sample voice is preprocessed to obtain the sample voice feature; the voice feature to be converted and the sample voice are obtained by preprocessing The feature is used as input, and the target voice feature is obtained using a pre-trained voice conversion model; the target voice feature is converted into a target voice that simulates the example voice, the voice content of the target voice and the voice content of the voice to be converted same.

As shown in FIG. 2, in one embodiment, a method for cross-language speech conversion is provided. The method can be applied to both a terminal and a server, and this embodiment is described by taking the application to a terminal as an example. The cross-language voice conversion method specifically includes the following steps:

S110. Acquire the voice to be converted and the sample voice of the target user, where the voice content of the voice to be converted is in a different language than the voice content of the sample voice.

In this embodiment, when executing the method for cross-language voice conversion, the user may execute it on a mobile device, such as a mobile phone. First, the user needs to input the voice to be converted and the sample voice of the target user, wherein the voice content of the voice to be converted is the last desired voice of the user. The obtained voice content, the sample voice of the target user is the sound feature of the voice sound that the user finally wishes to obtain. In addition, the language used for the voice content of the voice to be converted is different from the language used for the voice content of the sample voice, that is, the voice to be converted may be Chinese, then the sample voice may be English, and the voice to be converted may also be English plus Chinese, The example speech can be in English. It should be noted that as long as the language used for the speech content of the speech to be converted and the language used for the speech content of the example speech are partially different, or not completely the same, it is regarded as different. Exemplarily, if the user wants to obtain the target voice of X who can only speak Chinese and say "Yes", he only needs to say "Yes" as the voice to be converted, and obtain the sample voice of X, which can be X. any Chinese voice spoken.

S120. Preprocess the speech to be converted to obtain speech features to be converted, and perform preprocessing on the sample speech to obtain sample speech features.

S130. Using the to-be-converted speech feature and the example speech feature as input, use a pre-trained speech conversion model to obtain the target speech feature.

S140. Convert the target voice feature into a target voice that simulates the example voice, and the voice content of the target voice is the same as the voice content of the to-be-converted voice.

In this embodiment, after obtaining the to-be-converted voice and the sample voice, it is also necessary to preprocess the to-be-converted voice to obtain the to-be-converted voice feature, and preprocess the sample voice to obtain the sample voice feature, so as to facilitate input into the voice conversion model, where the voice The conversion model is a neural network model, which is pre-trained with a large number of user voices. The input and output during the training process are also voice features. The voice conversion model can extract the voice content in the voice features to be converted and the sample voice features. Therefore, after inputting the speech features to be converted and the sample speech features into the pre-trained speech conversion model, the target speech features can be obtained. Finally, it is necessary to convert the target voice feature into target voice through other preset neural network models, and the target voice obtained by converting the target voice feature obtained by the voice conversion model simulates the voice feature of the sample voice, and the output voice content is For the speech content of the speech to be converted, since the language used for the speech content of the speech to be converted is different from the language used for the speech content of the example speech, the cross-language speech conversion is completed. Other preset neural network models can be WaveNet neural network models, WaveRNN neural network models, and so on.

In one embodiment, as shown in FIG. 3 , step S130 specifically includes:

S210. Input the Mel cepstrum to the first encoder to obtain a first vector.

S220. Input a part of the example cepstrum to the second encoder to obtain a second vector, where the part of the example cepstrum is randomly intercepted from the example cepstrum.

In this embodiment, the speech feature to be converted is the to-be-converted cepstrum, and the example speech feature is the example cepstrum. After obtaining the to-be-converted speech feature and the example speech feature, the to-be-converted speech feature and the example speech feature can be input to a pre-trained speech conversion model, wherein the speech conversion model includes a first encoder, a second encoder, a length regulator and a decoder. The first encoder is built based on the FastSpeech framework, and the first encoder includes FFT Block (Feed-Forward Transformer Block, FFT block), which is based on a non-autoregressive self-attention mechanism and a one-dimensional convolutional neural network. The network is generated, so that the first encoder does not depend on the output of the previous frame, and can perform parallel operations, thereby greatly speeding up the generation of target speech features. Specifically, the first encoder includes a CNN (Convolutional Neural Network) model, a Positional Enecoding (Position-based Word Embedding) model and an FFT Block, and the second encoder includes an LSTM (Long Short-Term Memory, long short-term memory network) model , Linear (linear regression algorithm) model, as well as pooling layer and normalization layer, length adjuster includes CNN model and Linear model, decoder includes FFT Block, Linear model, Post-Net and output layer.

Specifically, the Mel cepstrum to be converted is input into the first encoder, and the CNN model in the first encoder is used to compress the Mel cepstrum to be converted to obtain Bottle-neck features, so as to better extract speech content, and then based on the parallel operation of the FFT Block, the first vector is quickly output. The vector length of the first vector takes the maximum value of the input sequence length in the batch processing (Btach), and the rest of the sequences that are not long enough are filled with 0 at the back. The first vector is used as the extracted speech content. Then the part of the example cepstrum is input to the second encoder, and the second encoder will output a second vector, wherein the part of the example cepstrum is randomly intercepted from the example speech feature, that is, the example cepstrum. Specifically, after converting the sample speech into the sample Mel cepstrum, randomly select a preset number of clipped fragments of the sample Mel cepstrum of the target user, and splicing these clipped fragments as part of the sample Mel cepstrum, which is represented by The obtained second vector is used as the extracted sound feature.

S230. After splicing the first vector and the second vector, a third vector is obtained.

S240. Input the third vector to the length adjuster to obtain a fourth vector.

S250. Input the fourth vector to the decoder to obtain a predicted Mel cepstrum as a target speech feature.

In this embodiment, after the first vector and the second vector are obtained, the first vector and the second vector need to be spliced together to obtain a third vector, and then the third vector is input to the length adjuster. A vector is compressed by the first encoder. Therefore, the length adjuster can also obtain the predicted extension length of each frame in the third vector according to the third vector through its own two-layer convolutional layer, which is equivalent to predicting the length of each frame in the Mel cepstrum, and according to the predicted extension length Extend the third vector to the fourth vector. Exemplarily, the speech content corresponding to the third vector is "How are you", its feature length is 3, and the predicted extension length obtained by the length adjuster according to the third vector corresponds to [4, 2, 3], then the final obtained In the fourth vector, the feature length of "you" is 4, the feature length of "good" is 2, and the feature length of "do" is 3. Finally, the fourth vector is input to the decoder to obtain the predicted Mel cepstrum, and the predicted Mel cepstrum is used as the target speech feature.

In the embodiment of the present invention, through the FFT Block generated by the non-autoregressive self-attention mechanism and the one-dimensional convolutional neural network, the first encoder does not depend on the output of the previous frame, and can perform parallel operations, thereby greatly speeding up the Generation speed of target speech features.

In one embodiment, as shown in FIG. 4 , step S110 specifically includes:

S310. Acquire the text to be converted and the sample speech of the target user.

S320. Convert the to-be-converted text into synthesized speech, and as the to-be-converted speech, the language used for the speech content of the to-be-converted speech is different from the language used for the speech content of the example speech.

In this embodiment, if the voice to be converted for the user to read aloud is directly obtained as the input voice feature of the subsequent voice conversion model, the user's own reasons may interfere with the input voice feature, such as coughing, slurred speech, etc., in order to avoid the above The problem, in this embodiment, the text to be converted is obtained, and the text content of the text to be converted is the same as the voice content of the voice to be converted, and then the TTS (TextToSpeech, from text to speech) technology is used to convert the text to be converted into synthetic speech, as the to-be-converted text. Convert voice. Therefore, by converting the text to be converted with the same content into clear and accurate synthesized speech, interference caused by the user's own reasons is eliminated.

Further, in order to illustrate that the use of synthetic speech as the input of the speech conversion model can eliminate the interference caused by the user's own reasons, in the process of using the speech conversion model, it is assumed that the input feature sequence of the speech feature to be converted is x=(x ₁ ,x ₂ ,...,x _n ), where n represents the nth frame on the time series of the Mel cepstrum to be converted, and the feature sequence of the target speech feature predicted by the speech conversion model is y=(y ₁ ,y ₂ ,…,y _m ), similarly, m here also represents the mth frame on the time series of the predicted Mel cepstrum. We hope that the feature sequence predicted by the speech conversion model can be as close as possible to the target feature sequence of the actual speech features

Here we assume that each frame of the input feature sequence contains two latent variables, one latent variable is the speech content of the input speech c=(c ₁ ,c ₂ ,..., _cn ), and the other latent variable is The sound feature s=(s ₁ ,s ₂ ,...,s _i ) of the input speech, while in the target sequence

also contains the voice characteristics of the target user

where i denotes the input speech, t denotes the target user, i∈{1,2,...,j},t∈{1,2,...,k}, where j denotes the number of input speech in the entire input dataset, k denotes The number of target users in the entire input dataset.

The function of the first encoder in the speech conversion model is to remove the speech feature _si of the input speech from the input sequence, and only keep the speech content c, then the input sequence can be expressed as the following form:

Since we use the method of converting TTS synthetic voice to real voice to achieve the purpose of separating the user's voice features and voice content, because there is only one voice feature in the input voice, that is, the voice feature of the synthesized voice, we set it as s ₀ , s ₀ can be considered as a constant. According to Bayes' theorem, formula (1) can be transformed into:

For the prediction sequence y, in the same way, it can be expressed as:

in,

is the output of the second encoder, and c is the output of the first encoder. The combination of the two is adjusted by the length adjuster as the input of the decoder, and finally the predicted sequence y is output by the decoder. due to c and

is derived from two sequences, which can be considered independent of each other. Therefore, combining formulas (2) and (3), we can get:

It can be seen from formula (4) that when the input speech is a fixed synthetic speech, the prediction sequence y is only the same as the input sequence x and the training user

And the voice content c is related. Therefore, the interference of directly acquiring the voice to be converted read aloud by the user as the input voice and extracting the voice content in the voice conversion model is eliminated.

In one embodiment, as shown in FIG. 5 , step S120 specifically includes:

S410. Perform short-time Fourier transform on the to-be-converted speech to obtain an amplitude spectrum.

S420. Filter the amplitude spectrum to obtain a Mel spectrum.

S430. Perform cepstrum analysis on the Mel spectrum to obtain the Mel cepstrum to be converted, which is used as the voice feature to be converted.

In this embodiment, when the voice to be converted is preprocessed to obtain the features of the voice to be converted, specifically, it is first necessary to perform short-time Fourier transform on the voice to be converted, and the voice to be converted is subjected to short-time Fourier transform to obtain the amplitude spectrum and Phase spectrum, which converts the waveform of the speech to be converted from the time domain to the frequency domain, which is convenient for the extraction of speech features. Only the amplitude spectrum is used for filtering to obtain the Mel spectrum. The filter used for filtering can be Filter Bank (Filter Bank), Filter Bank is based on the principle that people are more sensitive to high-frequency sounds. At low frequencies, the filters are denser and the threshold value is larger, while the filters at high frequencies are sparser, the threshold value is smaller, and the filtering results are more suitable. In line with the human voice. In order to obtain features closer to the human vocalization mechanism and closer to the human nonlinear auditory system, it is necessary to perform cepstral analysis on the Mel spectrum to obtain the Mel-Frequency Spectrum (MFC, Mel-Frequency Spectrum). The spectrum is used as the speech feature to be converted. It should be noted that, the target voice needs to be processed in the same way as the voice to be converted, which is not repeated in this embodiment of the present application.

By converting the speech to be converted into a Mel cepstrum, the embodiment of the present application not only approximates the characteristics of the human vocalization mechanism and the nonlinear auditory system, but also facilitates the training and input and output of the neural network model.

In one embodiment, as shown in FIG. 6 , step S410 specifically includes:

S510 , subtracting the blank parts at the beginning and end of the speech to be converted to obtain a first corrected speech to be converted.

S520. Perform pre-emphasis, framing and windowing on the first modified voice to be converted to obtain a second modified voice to be converted.

S530. Perform short-time Fourier transform on the second modified speech to be converted to obtain an amplitude spectrum.

In this embodiment, since there will be blank parts in the head and tail parts of the speech to be converted, in order to better align the learning and conversion of the speech conversion model, when the speech to be converted is subjected to short-time Fourier transform to obtain the amplitude spectrum, the It is necessary to subtract the blanks at the beginning and end of the speech to be converted to obtain the first corrected speech to be converted. In addition, in order to better adapt to the short-time Fourier transform, after obtaining the first corrected speech to be converted, it is also necessary to modify the first corrected speech to be converted. The voice is pre-emphasized, divided into frames and windowed to obtain the second modified voice to be converted. After pre-emphasis, high-frequency information can be added to the voice to be converted, and part of the noise can be filtered out. After framing and windowing, the voice to be converted can be made It is more stable and continuous, and finally, short-time Fourier transform is performed on the second modified speech to be converted to obtain an amplitude spectrum. Wherein, steps S510 and S520 in this embodiment of the present application may be selectively executed according to user requirements.

As shown in FIG. 7 , in one embodiment, a method for training a speech conversion model is provided. The method can be applied to both a terminal and a server, and this embodiment is described by taking the application to a terminal as an example. The speech conversion model training specifically includes the following steps:

S610. Acquire the training voice, the first training sample voice and the second training sample voice of the training user.

S620. Preprocess the training speech to obtain training speech features, perform preprocessing on the first training example speech to obtain first training example speech features, and perform preprocessing on the second training example speech to obtain a second training example Example speech features.

S630. Input the training Mel cepstrum to the first encoder to obtain a first vector.

S640. Input part of the second training example cepstrum to the second encoder to obtain a second vector, where the part of the second training example cepstrum is in the second training example cepstrum Randomly intercepted.

S650. After splicing the first vector and the second vector, a third vector is obtained.

S660. Input the third vector to the length adjuster to obtain a fourth vector.

S670. Input the fourth vector to the decoder to obtain a training prediction Mel cepstrum.

S680. Calculate the training loss of the training predicted Mel cepstrum and the first training example Mel cepstrum.

S690. Perform backpropagation according to the training loss to update the training weight of the speech conversion model until the speech conversion model converges.

In this embodiment, when training the speech conversion model, it is first necessary to obtain the training speech and the training example speech of the training user. The training example speech includes the first training example speech and the second training example speech, wherein the first training example speech is The speech content is the same as the speech content of the training speech. The language used in the speech content of the training speech is different from the language used in the speech content of the second training example speech. The first training example speech is the predicted speech we need to get at the end, and the second training example speech The example speech is the speech feature used as the input model. Then, the training speech needs to be preprocessed to obtain the training speech features, the first training example speech is preprocessed to obtain the first training example speech features, and the second training example speech is preprocessed to obtain the second training example speech features, wherein, The training speech feature is the training cepstrum, the speech feature of the first training example is the first training example cepstrum, and the second training example speech feature is the second training example cepstrum. Subsequent operations are the same as S210-S250 in the embodiments of the present application, and details are not repeated in the embodiments of the present application. After obtaining the training predicted Mel cepstrum, it is also necessary to calculate the training loss of the training predicted Mel cepstrum and the first training example Mel cepstrum, that is, the loss between the predicted value and the actual value, and finally reverse the training loss according to the training loss. Propagation to update the training weights of the speech-to-speech model until the speech-to-speech model converges.

Among them, two kinds of training example voices need to be obtained, but when the training set data is enough, it will not cause additional data collection. Exemplarily, if the training voice includes "YES", then the first training example voice of the voice content needs to be obtained at the same time, that is, "YES" issued by the training user. In addition, it is also necessary to obtain the second voice content with a different language. The training example voice, that is, the voice of other languages uttered by the training user, such as "good", when there is enough data in the training set, the "good" issued by the training user is the first training example voice when the training voice includes "good", at this time Then, there is no need to additionally acquire the second training example speech.

Preferably, the language used in the speech content of the training speech includes the language used in the speech content of the speech to be converted in actual use, that is, the speech used in the speech content of the speech to be converted participates in the training of the speech conversion model, training the user It also includes the target user, that is, the target user participates in the training of the speech conversion model as a training user, so that the cross-language conversion can be more accurately achieved. In addition, since the first encoder does not depend on the output of the previous frame, the training speed of the speech conversion model is greatly accelerated.

Figure 8 shows an internal structure diagram of a computer device in one embodiment. Specifically, the computer device may be a terminal or a server. As shown in FIG. 8, the computer device includes a processor, memory, and a network interface connected by a system bus. Wherein, the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system, and also stores a computer program, which, when executed by the processor, enables the processor to implement a method for cross-language voice conversion. A computer program can also be stored in the internal memory, and when the computer program is executed by the processor, can cause the processor to execute the cross-language speech conversion method. Those skilled in the art can understand that the structure shown in FIG. 8 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is proposed, comprising a memory and a processor, wherein the memory stores a computer program, and when the computer program is executed by the processor, the processor performs the following steps:

Obtain the voice to be converted and the sample voice of the target user, and the language used by the voice content of the voice to be converted is different from the language used by the voice content of the sample voice; preprocess the voice to be converted to obtain the features of the voice to be converted , and preprocess the sample voice to obtain the sample voice feature; take the to-be-converted voice feature and the sample voice feature as input, use the pre-trained voice conversion model to obtain the target voice feature; Convert the target voice feature into The target voice of the sample voice is simulated, and the voice content of the target voice is the same as the voice content of the to-be-converted voice.

In one embodiment, the voice feature to be converted is a Mel cepstrum to be converted, the example voice feature is an example Mel cepstrum, and the voice conversion model includes a first encoder, a second encoder, a length adjustment The first encoder includes an FFT Block, and the voice feature to be converted and the sample voice feature are used as input, and the target voice feature obtained by using a pre-trained voice conversion model includes: using the Mel The cepstrum is input to the first encoder to obtain the first vector; the partial example Mel cepstrum is input to the second encoder to obtain the second vector, and the partial example Mel cepstrum is in the example obtained by random interception in the Mel cepstrum; the third vector is obtained by splicing the first vector and the second vector; the third vector is input into the length adjuster to obtain the fourth vector; the The fourth vector is input to the decoder to obtain the predicted Mel cepstrum as the target speech feature.

In one embodiment, the first encoder is configured to compress the Mel cepstrum to obtain a first vector, and the length adjuster is configured to obtain each frame in the third vector according to the third vector and extending the third vector into a fourth vector according to the predicted extension length.

In one embodiment, the training of the speech conversion model includes: acquiring a training speech, a first training example speech and a second training example speech of a training user, the speech content of the first training example speech and the difference of the training speech The voice content is the same, and the language used in the voice content of the training voice is different from the language used in the voice content of the second training example voice; the training voice is preprocessed to obtain training voice features, and the first The training example speech is preprocessed to obtain the first training example speech feature, the second training example speech is preprocessed to obtain the second training example speech feature, the training speech feature is the training Mel cepstrum, and the first training example speech The example speech feature is the first training example Mel cepstrum, and the second training example speech feature is the second training example Mel cepstrum; the training Mel cepstrum is input to the first encoder to obtain the first cepstrum. a vector; inputting part of the second training example cepstrum to the second encoder to obtain a second vector, the part of the second training example mel cepstrum is the cepstrum of the second training example obtained by random interception in ; the third vector is obtained by splicing the first vector and the second vector; the third vector is input to the length adjuster to obtain the fourth vector; the fourth vector is input to the decoder to obtain the training predicted cepstrum; compute the training loss for the training predicted cepstrum and the first training example cepstrum; perform backpropagation based on the training loss to update the speech Transform the training weights of the model until the speech translation model converges.

In one embodiment, the acquiring the speech to be converted includes: acquiring the text to be converted; and converting the text to be converted into synthesized speech as the speech to be converted.

In one embodiment, the preprocessing of the speech to be converted to obtain the speech features to be converted includes: performing a short-time Fourier transform on the speech to be converted to obtain an amplitude spectrum; filtering the amplitude spectrum to obtain an amplitude spectrum Mel spectrum; cepstrum analysis is performed on the Mel spectrum to obtain the Mel cepstrum to be converted, which is used as the feature of the voice to be converted.

In an embodiment, the performing short-time Fourier transform on the to-be-converted speech to obtain an amplitude spectrum includes: subtracting the blank parts of the to-be-converted speech to obtain a first modified to-be-converted speech; The first modified voice to be converted is subjected to pre-emphasis, framing and windowing to obtain a second modified voice to be converted; short-time Fourier transform is performed on the second modified voice to be converted to obtain an amplitude spectrum.

In one embodiment, a computer-readable storage medium is provided, which stores a computer program, and when the computer program is executed by a processor, causes the processor to perform the following steps:

In one embodiment, the voice feature to be converted is a Mel cepstrum to be converted, the example voice feature is an example Mel cepstrum, and the voice conversion model includes a first encoder, a second encoder, a length adjustment The first encoder includes an FFT Block, and the voice feature to be converted and the sample voice feature are used as input, and the target voice feature obtained by using a pre-trained voice conversion model includes: using the Mel The cepstrum is input to the first encoder to obtain the first vector; the partial example Mel cepstrum is input to the second encoder to obtain the second vector, and the partial example Mel cepstrum is in the example obtained by random interception in the Mel cepstrum; the third vector is obtained by splicing the first vector and the second vector; the third vector is input to the length adjuster to obtain the fourth vector; the The fourth vector is input to the decoder to obtain the predicted Mel cepstrum as the target speech feature.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program, and the program can be stored in a non-volatile computer-readable storage medium , when the program is executed, it may include the flow of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other medium used in the various embodiments provided in this application may include non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

The technical features of the above embodiments can be combined arbitrarily. In order to make the description simple, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features It is considered to be the range described in this specification.

The above-mentioned embodiments only represent several embodiments of the present application, and the descriptions thereof are relatively specific and detailed, but should not be construed as a limitation on the scope of the patent of the present application. It should be pointed out that for those skilled in the art, without departing from the concept of the present application, several modifications and improvements can be made, which all belong to the protection scope of the present application. Therefore, the scope of protection of the patent of the present application shall be subject to the appended claims.

Claims

A method for cross-language voice conversion, characterized in that the method comprises:

Acquiring the voice to be converted and the sample voice of the target user, the language used by the voice content of the voice to be converted is different from the language used by the voice content of the sample voice;

Preprocessing the to-be-converted voice to obtain the to-be-converted voice feature, and preprocessing the sample voice to obtain the sample voice feature;

Using the to-be-converted voice feature and the example voice feature as input, use the pre-trained voice conversion model to obtain the target voice feature;

Converting the target voice feature into a target voice simulating the example voice, the voice content of the target voice is the same as the voice content of the to-be-converted voice.
The method according to claim 1, wherein the speech feature to be converted is a Mel cepstrum to be converted, the example speech feature is an example Mel cepstrum, and the speech conversion model comprises a first encoder, A second encoder, a length regulator and a decoder, the first encoder includes an FFT Block, and the target speech feature is obtained by using the pre-trained speech conversion model using the to-be-converted speech feature and the example speech feature as input include:

inputting the Mel cepstrum to the first encoder to obtain a first vector;

inputting part of the example cepstrum to the second encoder to obtain a second vector, the part of the example cepstrum is randomly intercepted from the example cepstrum;

After splicing the first vector and the second vector, a third vector is obtained;

inputting the third vector to the length adjuster to obtain a fourth vector;

The fourth vector is input to the decoder to obtain a predicted Mel cepstrum as a target speech feature.
The method according to claim 2, wherein the first encoder is configured to compress the Mel cepstrum to obtain a first vector, and the length adjuster is configured to obtain the third vector according to the third vector. The predicted extension length of each frame in the third vector, and the third vector is extended into a fourth vector according to the predicted extension length.
The method according to claim 1, wherein the training of the speech conversion model comprises:

Acquire the training voice, the first training sample voice and the second training sample voice of the training user, the voice content of the first training sample voice is the same as the voice content of the training voice, and the voice content of the training voice uses the language and The languages used in the speech content of the second training example speech are not the same;

Preprocessing the training voice to obtain training voice features, preprocessing the first training sample voice to obtain first training sample voice features, and preprocessing the second training sample voice to obtain a second training sample voice feature, the training speech feature is the training cepstrum, the first training example speech feature is the first training example cepstrum, and the second training example speech feature is the second training example cepstrum;

inputting the training Mel cepstrum to the first encoder to obtain a first vector;

inputting part of the second training example cepstrum to the second encoder to obtain a second vector, the part of the second training example cepstrum is randomly intercepted in the second training example cepstrum owned;

After splicing the first vector and the second vector, a third vector is obtained;

inputting the third vector to the length adjuster to obtain a fourth vector;

inputting the fourth vector to the decoder to obtain a training predicted Mel cepstrum;

calculating the training loss of the training predicted Mel cepstrum and the first training example Mel cepstrum;

Back-propagation is performed according to the training loss to update the training weights of the speech conversion model until the speech conversion model converges.
The method according to claim 1, wherein the acquiring the voice to be converted comprises:

Get the text to be converted;

Convert the to-be-converted text into synthesized speech as the to-be-converted speech.
The method according to claim 2, wherein the preprocessing of the to-be-converted speech to obtain the to-be-converted speech features comprises:

performing short-time Fourier transform on the to-be-converted speech to obtain an amplitude spectrum;

Filtering the amplitude spectrum to obtain a Mel spectrum;

The cepstrum analysis is performed on the Mel spectrum to obtain the Mel cepstrum to be converted, which is used as the speech feature to be converted.
The method according to claim 6, wherein the performing short-time Fourier transform on the to-be-converted speech to obtain an amplitude spectrum comprises:

Subtract the first and last blank parts in the voice to be converted to obtain the first modified voice to be converted;

Performing pre-emphasis, framing and windowing on the first modified voice to be converted to obtain a second modified voice to be converted;

Short-time Fourier transform is performed on the second modified speech to be converted to obtain an amplitude spectrum.
A computer device includes a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the following steps are performed:

Obtain the voice to be converted and the sample voice of the target user, and the language used by the voice content of the voice to be converted is different from the language used by the voice content of the sample voice;

Preprocessing the to-be-converted voice to obtain the to-be-converted voice feature, and preprocessing the sample voice to obtain the sample voice feature;

Using the to-be-converted voice feature and the example voice feature as input, use the pre-trained voice conversion model to obtain the target voice feature;

Converting the target voice feature into a target voice simulating the example voice, the voice content of the target voice is the same as the voice content of the to-be-converted voice.
The device according to claim 8, wherein the speech feature to be converted is a to-be-converted Mel cepstrum, the example speech feature is an example Mel cepstrum, and the speech conversion model comprises a first encoder, A second encoder, a length regulator and a decoder, the first encoder includes an FFT Block, and the target speech feature is obtained by using the pre-trained speech conversion model using the to-be-converted speech feature and the example speech feature as input include:

inputting the Mel cepstrum to the first encoder to obtain a first vector;

inputting part of the example cepstrum to the second encoder to obtain a second vector, the part of the example cepstrum is randomly intercepted from the example cepstrum;

After splicing the first vector and the second vector, a third vector is obtained;

inputting the third vector to the length adjuster to obtain a fourth vector;

The fourth vector is input to the decoder to obtain a predicted Mel cepstrum as a target speech feature.
The device according to claim 9, wherein the first encoder is configured to compress the Mel cepstrum to obtain a first vector, and the length adjuster is configured to obtain the third vector according to the third vector. The predicted extension length of each frame in the third vector, and the third vector is extended into a fourth vector according to the predicted extension length.
The device according to claim 8, wherein the training of the speech conversion model comprises:

Acquire the training voice, the first training sample voice and the second training sample voice of the training user, the voice content of the first training sample voice is the same as the voice content of the training voice, and the voice content of the training voice uses the language and The languages used in the speech content of the second training example speech are not the same;

Preprocessing the training voice to obtain training voice features, preprocessing the first training sample voice to obtain first training sample voice features, and preprocessing the second training sample voice to obtain a second training sample voice feature, the training speech feature is the training cepstrum, the first training example speech feature is the first training example cepstrum, and the second training example speech feature is the second training example cepstrum;

inputting the training Mel cepstrum to the first encoder to obtain a first vector;

inputting part of the second training example cepstrum to the second encoder to obtain a second vector, the part of the second training example cepstrum is randomly intercepted in the second training example cepstrum owned;

After splicing the first vector and the second vector, a third vector is obtained;

inputting the third vector to the length adjuster to obtain a fourth vector;

inputting the fourth vector to the decoder to obtain a training predicted Mel cepstrum;

calculating the training loss of the training predicted Mel cepstrum and the first training example Mel cepstrum;

Back-propagation is performed according to the training loss to update the training weights of the speech conversion model until the speech conversion model converges.
The device according to claim 8, wherein the acquiring the voice to be converted comprises:

Get the text to be converted;

Convert the to-be-converted text into synthesized speech as the to-be-converted speech.
The device according to claim 9, wherein the preprocessing of the to-be-converted speech to obtain the to-be-converted speech features comprises:

performing short-time Fourier transform on the to-be-converted speech to obtain an amplitude spectrum;

Filtering the amplitude spectrum to obtain a Mel spectrum;

The cepstrum analysis is performed on the Mel spectrum to obtain the Mel cepstrum to be converted, which is used as the speech feature to be converted.
A computer-readable storage medium storing a computer program, and when the computer program is executed by a processor, the following steps are performed:

Obtain the voice to be converted and the sample voice of the target user, and the language used by the voice content of the voice to be converted is different from the language used by the voice content of the sample voice;

Preprocessing the to-be-converted voice to obtain the to-be-converted voice feature, and preprocessing the sample voice to obtain the sample voice feature;

Using the to-be-converted voice feature and the example voice feature as input, use the pre-trained voice conversion model to obtain the target voice feature;

Converting the target voice feature into a target voice simulating the example voice, the voice content of the target voice is the same as the voice content of the to-be-converted voice.
The storage medium according to claim 14, wherein the speech feature to be converted is a cepstrum to be converted, the example speech feature is an example cepstrum, and the speech conversion model comprises a first encoder , a second encoder, a length regulator and a decoder, the first encoder includes an FFT Block, and the described voice feature to be converted and the example voice feature are used as input, and the pre-trained voice conversion model is used to obtain the target voice Features include:

inputting the Mel cepstrum to the first encoder to obtain a first vector;

inputting part of the example cepstrum to the second encoder to obtain a second vector, the part of the example cepstrum is randomly intercepted from the example cepstrum;

After splicing the first vector and the second vector, a third vector is obtained;

inputting the third vector to the length adjuster to obtain a fourth vector;

The fourth vector is input to the decoder to obtain a predicted Mel cepstrum as a target speech feature.
The storage medium according to claim 15, wherein the first encoder is configured to compress the Mel cepstrum to obtain a first vector, and the length adjuster is configured to obtain the third vector according to the third vector. The predicted extension length of each frame in the third vector is extended, and the third vector is extended into a fourth vector according to the predicted extension length.
The storage medium according to claim 14, wherein the training of the speech conversion model comprises:

Acquire the training voice, the first training sample voice and the second training sample voice of the training user, the voice content of the first training sample voice is the same as the voice content of the training voice, and the voice content of the training voice uses the language and The languages used in the speech content of the second training example speech are not the same;

Preprocessing the training voice to obtain training voice features, preprocessing the first training sample voice to obtain first training sample voice features, and preprocessing the second training sample voice to obtain a second training sample voice feature, the training speech feature is the training cepstrum, the first training example speech feature is the first training example cepstrum, and the second training example speech feature is the second training example cepstrum;

inputting the training Mel cepstrum to the first encoder to obtain a first vector;

inputting part of the second training example cepstrum to the second encoder to obtain a second vector, the part of the second training example cepstrum is randomly intercepted in the second training example cepstrum owned;

After splicing the first vector and the second vector, a third vector is obtained;

inputting the third vector to the length adjuster to obtain a fourth vector;

inputting the fourth vector to the decoder to obtain a training predicted Mel cepstrum;

calculating the training loss of the training predicted Mel cepstrum and the first training example Mel cepstrum;

Back-propagation is performed according to the training loss to update the training weights of the speech conversion model until the speech conversion model converges.
The storage medium according to claim 14, wherein the acquiring the voice to be converted comprises:

Get the text to be converted;

Convert the to-be-converted text into synthesized speech as the to-be-converted speech.
The storage medium according to claim 15, wherein the preprocessing of the to-be-converted speech to obtain the to-be-converted speech features comprises:

performing short-time Fourier transform on the to-be-converted speech to obtain an amplitude spectrum;

Filtering the amplitude spectrum to obtain a Mel spectrum;

The cepstrum analysis is performed on the Mel spectrum to obtain the Mel cepstrum to be converted, which is used as the speech feature to be converted.