CN116312617A

CN116312617A - Voice conversion method, device, electronic equipment and storage medium

Info

Publication number: CN116312617A
Application number: CN202310295364.6A
Authority: CN
Inventors: 朱清影; 缪陈峰; 陈婷; 马骏; 王少军; 肖京
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-03-23
Filing date: 2023-03-23
Publication date: 2023-06-23

Abstract

In the voice conversion method, the voice conversion device, the electronic equipment and the storage medium, a fusion feature vector is obtained according to a text sequence and the speaker features of a first speaker; acquiring a source spectrum characteristic vector according to a source Mel spectrum of the second speaker; inputting the fusion feature vector and the source spectrum feature vector into a pre-trained voice conversion acoustic model, and outputting a target Mel spectrum of a first speaker; acquiring target voice data according to a target Mel spectrum; by the method, voice conversion based on non-parallel corpus is realized, parallel corpus is not required to be collected, and voice conversion efficiency is improved; in addition, the source spectrum feature vector input to the voice conversion acoustic model is obtained according to the source Mel spectrum, the voice conversion acoustic model does not need to carry out alignment prediction, and the target Mel spectrum output by the voice conversion acoustic model is strictly aligned with the source Mel spectrum in time, so that the voice conversion efficiency is improved; and moreover, parallel corpus is generated based on non-parallel corpus, and data enhancement is realized.

Description

Voice conversion method, device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of speech synthesis technologies, and in particular, to a speech conversion method, a device, an electronic apparatus, and a storage medium.

Background

The voice conversion refers to: the timbre of the source speaker's speech is converted to the timbre of another target speaker while keeping the text content of the source speaker's speech unchanged. With the development of internet technology, voice conversion has been gradually applied to the fields of electronic games, live video broadcast, short video application, and the like.

Prior art speech conversion is generally based on parallel corpus, i.e. different speakers need to record the speech of the same text, and audio needs to be time aligned either manually or using techniques such as DTW (Dynamic Time Warping ) algorithm. Because the collection difficulty of parallel corpus is high and time alignment is needed, the mode based on parallel corpus in the prior art is unfavorable for improving the voice conversion efficiency.

Disclosure of Invention

The invention aims to provide a voice conversion method, a voice conversion device, electronic equipment and a storage medium, so as to solve the technical problem that voice conversion efficiency is not improved in the prior art.

The technical scheme of the application is as follows: provided is a voice conversion method including:

acquiring a fusion feature vector according to the text sequence and the speaker features of the first speaker;

acquiring a source spectrum characteristic vector according to a source Mel spectrum of the second speaker;

inputting the fusion feature vector and the source spectrum feature vector into a pre-trained voice conversion acoustic model, and outputting a target mel spectrum of the first speaker, wherein the voice conversion acoustic model is obtained by training according to a sample fusion feature vector and a sample spectrum feature vector, the sample fusion feature vector is obtained according to a sample text sequence and sample speaker features, and the sample text sequence, the sample speaker features and the sample spectrum feature vector are obtained according to the same real voice data;

and acquiring target voice data according to the target Mel spectrum.

As one embodiment, the obtaining the fusion feature vector according to the text sequence and the speaker feature of the first speaker includes:

splicing the text sequence and the speaker characteristics of the first speaker to obtain a spliced characteristic vector;

and inputting the spliced feature vector into a first neural network, and outputting the fusion feature vector.

As an embodiment, the inputting the fusion feature vector and the source spectrum feature vector into a pre-trained voice conversion acoustic model, and outputting a target mel spectrum of the first speaker includes:

acquiring a first alignment matrix according to the fusion feature vector and the source spectrum feature vector;

obtaining an index mapping vector according to the first alignment matrix and the index vector of the fusion feature vector;

acquiring a second alignment matrix according to the index mapping vector and the index vector of the fusion feature vector;

and acquiring a fusion feature alignment vector according to the second alignment matrix and the fusion feature vector.

As one embodiment, the training step of the speech conversion acoustic model includes:

respectively acquiring a corresponding sample text sequence, sample speaker characteristics and a sample source Mel spectrum according to the real voice data;

acquiring a sample fusion feature vector according to the sample text sequence and the sample speaker feature, and acquiring a sample spectrum feature vector according to a sample source Mel spectrum;

inputting the sample fusion feature vector and the sample spectrum feature vector into a voice conversion acoustic model to be trained, and outputting a sample target Mel spectrum;

and calculating a conversion error according to the sample source Mel spectrum and the sample target Mel spectrum, and adjusting parameters of the voice conversion acoustic model according to the conversion error until the voice conversion acoustic model reaches a training convergence condition.

As one embodiment, the loss function of the speech conversion acoustic model is:

wherein (1)>

For loss value, +_>

Mel spectrum for the ith sample target mel _i Mel spectrum for the ith sample source, N is the number of samples.

As one embodiment, the obtaining the target voice data according to the target mel spectrum includes:

inputting the target Mel spectrum into a voice generator, and outputting corresponding target voice data;

the training step of the voice generator comprises the following steps:

obtaining at least one training sample, wherein the training sample comprises real voice data and a mel spectrum extracted from the real voice data;

inputting the Mel spectrum to a Mel spectrum encoder, and outputting spectral characteristic data of the Mel spectrum;

inputting the frequency spectrum characteristic data into the voice generator, and outputting target voice data;

and calculating a generation error according to the real voice data and the target voice data, and adjusting parameters of the voice generator according to the generation error until the voice generator reaches a training convergence condition.

As an implementation manner, before the obtaining the fusion feature vector according to the text sequence and the speaker feature of the first speaker, the method further includes:

performing voice recognition on source voice data of a second speaker to obtain corresponding text content, and obtaining a text sequence of the text content;

acquiring a source mel spectrum of the second speaker according to the source voice data;

and extracting the speaker characteristic from the second voice data of the first speaker as the speaker characteristic of the first speaker.

Another technical scheme of the application is as follows: provided is a voice conversion device including:

the feature fusion module is used for acquiring a fusion feature vector according to the text sequence and the speaker features of the first speaker;

the spectrum coding module is used for acquiring a source spectrum characteristic vector according to the source Mel spectrum of the second speaker;

the voice conversion module is used for inputting the fusion feature vector and the source spectrum feature vector into a pre-trained voice conversion acoustic model and outputting a target mel spectrum of the first speaker, wherein the voice conversion acoustic model is obtained by training according to a sample fusion feature vector and a sample spectrum feature vector, the sample fusion feature vector is obtained according to a sample text sequence and sample speaker features, and the sample text sequence, the sample speaker features and the sample spectrum feature vector are obtained according to the same real voice data;

and the voice acquisition module is used for acquiring target voice data according to the target Mel spectrum.

Another technical scheme of the application is as follows: there is provided an electronic device comprising a processor, a memory coupled to the processor, the memory storing program instructions executable by the processor; the processor implements the voice conversion method described above when executing the program instructions stored in the memory.

Another technical scheme of the application is as follows: there is provided a storage medium having stored therein program instructions which, when executed by a processor, implement a method of voice conversion as described above.

In the voice conversion method, the voice conversion device, the electronic equipment and the storage medium, a fusion feature vector is obtained according to a text sequence and the speaker features of a first speaker; acquiring a source spectrum characteristic vector according to a source Mel spectrum of the second speaker; inputting the fusion feature vector and the source spectrum feature vector into a pre-trained voice conversion acoustic model, and outputting a target mel spectrum of the first speaker, wherein the voice conversion acoustic model is obtained by training according to a sample fusion feature vector and a sample spectrum feature vector, the sample fusion feature vector is obtained according to a sample text sequence and sample speaker features, and the sample text sequence, the sample speaker features and the sample spectrum feature vector are obtained according to the same real voice data; acquiring target voice data according to the target Mel spectrum; by the method, voice conversion based on non-parallel corpus is realized, parallel corpus is not required to be collected, and voice conversion efficiency is improved; in addition, the source spectrum feature vector input to the voice conversion acoustic model is obtained according to the source Mel spectrum, the voice conversion acoustic model does not need to carry out alignment prediction, and the target Mel spectrum output by the voice conversion acoustic model is strictly aligned with the source Mel spectrum in time, so that the voice conversion efficiency is improved; and moreover, parallel corpus is generated based on non-parallel corpus, and data enhancement is realized.

Drawings

FIG. 1 is a flow chart of a voice conversion method according to an embodiment of the present disclosure;

FIG. 2 is a flowchart of a training method of a speech conversion model according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a voice conversion device according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a storage medium according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

In order to better understand the solution of the present application, the following description will make clear and complete descriptions of the technical solution of the embodiment of the present application with reference to the accompanying drawings in the embodiment of the present application. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

In the embodiment of the application, at least one refers to one or more; plural means two or more. In the description of the present application, the words "first," "second," "third," and the like are used solely for the purpose of distinguishing between descriptions and not necessarily for the purpose of indicating or implying a relative importance or order.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, the terms "comprising," "including," "having," and variations thereof herein mean "including but not limited to," unless expressly specified otherwise.

It should be noted that, in the embodiment of the present application, "and/or" describe the association relationship of the association object, which means that three relationships may exist, for example, a and/or B may be represented: a exists alone, A and B exist together, and B exists alone.

It should be noted that in the embodiments of the present application, "connected" is understood to mean electrically connected, and two electrical components may be connected directly or indirectly between two electrical components. For example, a may be directly connected to B, or indirectly connected to B via one or more other electrical components.

An embodiment of the present application provides a voice conversion method. The main execution body of the voice conversion method includes, but is not limited to, at least one of a server, a terminal, and the like, which can be configured to execute the voice conversion method provided in the embodiment of the application. In other words, the voice conversion method may be performed by software or hardware installed in a terminal device or a server device, and the software may be a blockchain platform. The service end includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.

Fig. 1 is a schematic flow chart of a voice conversion method according to an embodiment of the present application. It should be noted that, if there are substantially the same results, the method of the present application is not limited to the flow sequence shown in fig. 1. In this embodiment, the voice conversion method includes the following steps:

s10, acquiring a fusion feature vector according to the text sequence and the speaker features of the first speaker;

the text sequence may be a phoneme sequence of a text corresponding to the voice to be converted, that is, a phoneme character sequence of the text to be converted. Wherein, the phoneme is the smallest unit in the speech, and can be analyzed according to the pronunciation action in the syllable of the word, and one action forms one phoneme. For example, in chinese, there are 32 phonemes, which can be divided into initials, finals. For example, for a text to be converted, its corresponding phoneme sequence may consist of initials and finals of each word in turn. Taking the text to be converted as "hello china peace" as an example, the corresponding phoneme sequence may be "n, i, h, ao, zh, ong, g, uo, p, ing, an". It should be understood that the corresponding phoneme sequence may also be in other forms, and the form of the phoneme sequence is not specifically limited in this application. The text sequence comprises characteristic vectors of characters in the text, and the text sequence can be obtained by encoding the text by using an encoder such as one-hot encoding, d-vector or x-vector.

The speaker characteristics of the first speaker may include tone characteristics, which are obtained by extracting characteristics of the voice of the first speaker, and the speaker characteristics may be obtained by extracting characteristics of the voice by using an encoder such as one-hot encoding, d-vector or x-vector.

The text sequence and the speaker characteristics of the first speaker may be input to a text-to-speaker encoder, which outputs corresponding fusion feature vectors. As an embodiment, the text-to-speaker encoder may consist of an embedded layer of feed-forward converter blocks (feed-forward transformer block, FFT blocks).

In this embodiment, the text sequence is usually a phoneme sequence, and in the example of mandarin chinese, it is usually pinyin (initial consonant+final) and tone. The speaker characteristic vector can be extracted from the original audio by using the speaker characteristic module; mel-spectra may also be extracted from the same original audio. The speaker characteristic vector is used as a global condition, the length of the text sequence is expanded, the text sequence and the speaker characteristic vector are spliced into a vector, and the vector is input into a text-speaker encoder to output implicit representation.

As an embodiment, the fusion feature vector may be obtained as follows:

s11, splicing the text sequence and the speaker characteristics of the first speaker to obtain a spliced characteristic vector;

the text sequence and the speaker characteristics of the first speaker can be summed to realize the splicing of the two characteristic vectors.

S12, inputting the spliced feature vector into a first neural network, and outputting the fused feature vector;

the first neural network is used for extracting invisible features of the stitching feature vector, and the following description is given by taking the first neural network as an example that the first neural network may include two full-connection layers, where the first neural network may include a first full-connection layer and a second full-connection layer, and in some embodiments, step S12 specifically includes:

s121, inputting the spliced feature vector into a first full-connection layer, and extracting features of the spliced feature vector to obtain a high-dimensional fusion feature vector.

S122, inputting the high-dimensional fusion feature vector into a second full-connection layer, and carrying out feature extraction on the high-dimensional fusion feature vector to obtain the fusion feature vector;

specifically, the first fully connected layer includes a first number of nodes and the second fully connected layer includes a second number of nodes, the first number being greater than the second number. In step S121, the spliced feature vector is input into a first full-connection layer, and feature fusion is performed on each feature vector in the spliced feature vector at each node of the first full-connection layer, so as to obtain a first number of different first cross features, and the first number of different first cross features form a high-dimensional fused feature vector. In step S122, the high-dimensional fusion feature vector is input into a second full-connection layer, and feature fusion is performed on the first number of different first cross features at each node of the second full-connection layer to obtain a second number of different second cross features, where the second number of different second cross features form a fusion feature vector.

In step S10, the resulting fused feature vector p [ p ] ₀ ，p ₁ ，p ₂ ，...，p _i ，...，p _T1-1 ]Wherein i is more than or equal to 0 and less than or equal to T ₁ -1，T ₁ To fuse the number of feature vectors in the feature vector, T ₁ Is an integer greater than or equal to 2.

S20, acquiring a source spectrum feature vector according to a source Mel spectrum of a second speaker;

the source mel spectrum of the second speaker may be input to a mel spectrum encoder, and the corresponding source spectrum feature vector may be output, where the spectrum encoder is configured to extract Gao Weiyin features of the source mel spectrum, and output corresponding Gao Weiyin feature vectors, and each Gao Weiyin feature vector forms a source spectrum feature vector.

In step S20, the obtained source spectral feature vector q [ q ] ₀ ，q ₁ ，q ₂ ，...，q _j ，...，q _T2-1 ]Wherein j is more than or equal to 0 and T is more than or equal to ₂ -1，T ₂ T is the number of eigenvectors in the source spectrum eigenvector ₂ Is an integer greater than or equal to 2.

S30, inputting the fusion feature vector and the source spectrum feature vector into a pre-trained voice conversion acoustic model, and outputting a target Mel spectrum of the first speaker, wherein the voice conversion acoustic model is obtained by training according to a sample fusion feature vector and a sample spectrum feature vector, and the sample fusion feature vector is obtained according to a text sequence and sample speaker features;

the text sequence, the sample speaker characteristic and the sample spectrum characteristic vector are obtained according to the same real voice data.

The voice conversion method of the application converts the source voice data of the second speaker into the target voice data of the first speaker.

As an embodiment, the step S30 specifically includes the following steps:

s31, acquiring a first alignment matrix according to the fusion eigenvector and the source spectrum eigenvector;

specifically, the first alignment matrix α is calculated as follows:

wherein alpha is _i，j For the ith row in the first alignment matrix alphaMatrix elements of jth column, p _i For fusing the ith feature vector in the feature vectors, q _j Is the j th eigenvector in the source spectrum eigenvectors, p _m For the mth feature vector in the fused feature vectors, D is the dimension of the outputs of the text-speaker encoder and the Mel-spectrum encoder, exp () is an exponential function based on a natural constant e, T ₁ Is the length of the fused feature vector. The source spectral feature vector q can be calculated from the first alignment matrix α and the fusion feature vector p.

S32, obtaining an index mapping vector (index mapping vector, IMV) according to the first alignment matrix and the index vector of the fusion feature vector;

as an embodiment, the index map vector pi' is calculated as follows:

wherein pi' _j Mapping vector for jth index, alpha _i，j Is the matrix element, k of the ith row and the jth column in the first alignment matrix alpha _i For the i-th element in the index vector k of the fusion feature vector, the index vector k= [0,1, ], T ₁ -1]A first alignment matrix

S33, acquiring a second alignment matrix according to the index mapping vector and the index vector of the fusion feature vector;

as an embodiment, to reduce the problem of error superposition, a bi-directional accumulation operation is designed to generate a second index map vector:

Δπ′ _j ＝π′ _j -π′ _j-1 ，0＜j≤T ₂ -1，

Δπ _j ＝ReLU(Δπ′ _j )，0＜j≤T ₂ -1，

for the j-th time step, Δpi is accumulated in the forward direction and the reverse direction, respectively:

finally, the second index mapping vector is obtained by the following formula:

wherein, the liquid crystal display device comprises a liquid crystal display device,

the second index map vector pi reconstructs a second alignment matrix α' by:

wherein alpha is _i，j 'is the matrix element of the ith row and jth column in the second alignment matrix alpha', k _i For the i-th element in the index vector k of the fusion feature vector, the index vector k= [0,1, ], T ₁ -1]，k _m For the m-th element, pi, in index vector k of the fused feature vector _j * For the j-th element in the second index mapping vector pi, exp () is an exponential function based on a natural constant e, T ₁ To fuse the length of feature vectors, σ ² Is a parameter representing the coefficient to Ji Bianyi.

S34, acquiring fusion feature pairs Ji Xiangliang according to the second alignment matrix and the fusion feature vector;

as one embodiment, the fusion feature pair Ji Xiangliang c is calculated as follows:

wherein c _j For the j-th fusion feature pair Ji Xiangliang, alpha _i，j 'is the matrix element of the ith row and jth column in the second alignment matrix alpha', p _i For fusing the ith element, T, in the feature vector p ₁ Is the length of the fused feature vector.

S35, acquiring the target Mel spectrum according to the fusion feature alignment vector;

the fusion feature alignment vector is input to a decoder, and the decoder outputs a predicted mel spectrum as a target mel spectrum of the first speaker. As an embodiment, the decoder may consist of several convolutional layers and one linear layer interspersed with weight normalization, a leak ReLU activation function and residual connections.

In the training process of the voice conversion acoustic model, sample speaker characteristics and sample spectrum characteristic vectors in sample fusion characteristic vectors are all derived from the same sample speaker, the sample fusion characteristic vectors are obtained according to a text sequence and the sample speaker characteristics, and the text sequence, the sample speaker characteristics and the sample spectrum characteristic vectors are obtained according to the same real voice data.

s41, respectively acquiring a corresponding sample text sequence, sample speaker characteristics and a sample source Mel spectrum according to the real voice data;

s42, obtaining a sample fusion feature vector according to the sample text sequence and the sample speaker feature, and obtaining a sample spectrum feature vector according to a sample source Mel spectrum;

s43, inputting the sample fusion feature vector and the sample spectrum feature vector into a voice conversion acoustic model to be trained, and outputting a sample target Mel spectrum;

s44, calculating a conversion error according to the sample source Mel spectrum and the sample target Mel spectrum, and adjusting parameters of the voice conversion acoustic model according to the conversion error until the voice conversion acoustic model reaches a training convergence condition.

In some embodiments, in step S43, a first alignment matrix is obtained according to the sample fusion feature vector and the sample spectrum feature vector, where the sample fusion feature vector is obtained according to a sample text sequence and a sample speaker feature; obtaining an index mapping vector according to the first alignment matrix and the index vector of the sample fusion feature vector; acquiring a second alignment matrix according to the index mapping vector and the index vector of the fusion feature vector; acquiring a sample fusion feature pair Ji Xiangliang according to the second alignment matrix and the sample fusion feature vector; and acquiring the target Mel spectrum according to the sample fusion characteristic alignment vector.

The calculation method of the voice conversion acoustic model in the training process in the step S43 is similar to the process in the steps S31 to S35, and will not be described in detail here.

In step S44, the loss function of the speech conversion acoustic model is:

wherein (1)>

For loss value, +_>

For the ith sample target mel spectrum, meli is the ith sample source mel spectrum, and N is the number of samples.

During training, the input text sequence, the speaker characteristic vector and the mel spectrum are all from the same audio. Notably, the IMV calculates the alignment of the mel-spectrum with the text sequence, so that the reconstructed alignment matrix and the mel-spectrum generated based thereon are perfectly aligned in time with the initially entered mel-spectrum. Therefore, when deducing, only one piece of audio Mel spectrum and text of the A speaker are needed to be input into the model together with the speaker characteristic vector of the B speaker, the Mel spectrum with the B speaker tone color parallel to the Mel spectrum of the A speaker can be obtained, each pair of the Mel spectrums are strictly aligned in time (the duration is equal, and the duration of each phoneme is equal), and the model learning difficulty is greatly reduced.

S40, acquiring target voice data according to the target Mel spectrum;

and inputting the target Mel spectrum into a voice generator (Hifi-GAN generator) to obtain corresponding target voice data.

In the embodiment, the target voice data of the first speaker is generated according to the source voice data of the second speaker, so that the voice conversion based on the non-parallel corpus is realized, the parallel corpus is not required to be collected, and the voice conversion efficiency is improved; in addition, the source spectrum feature vector input to the voice conversion acoustic model is obtained according to the source Mel spectrum, the voice conversion acoustic model does not need to carry out alignment prediction, and the target Mel spectrum output by the voice conversion acoustic model is strictly aligned with the source Mel spectrum in time, so that the voice conversion efficiency is improved; and moreover, parallel corpus is generated based on non-parallel corpus, and data enhancement is realized.

As an embodiment, the training step of the speech generator (Hifi-GAN generator) comprises:

s51, acquiring at least one training sample, wherein the training sample comprises real voice data and a Mel spectrum extracted from the real voice data;

s52, inputting the linear spectrum to a Mel spectrum encoder, and outputting spectrum characteristic data of the Mel spectrum;

s53, inputting the frequency spectrum characteristic data into the voice generator, and outputting target voice data;

s54, calculating a generation error according to the real voice data and the target voice data, and adjusting parameters of the voice generator according to the generation error until the voice generator reaches a training convergence condition.

As an embodiment, before step S10, the method further includes the following steps:

s61, performing voice recognition on source voice data of a second speaker to obtain corresponding text content, and obtaining a text sequence of the text content;

s62, acquiring a source Mel spectrum of the second speaker according to the source voice data;

s63, extracting the speaker characteristic from the second voice data of the first speaker as the speaker characteristic of the first speaker.

In step S62, the source speech data to be identified may be converted from a time domain signal into a frequency domain signal of a preset window number by short-time fourier transform; and converting the frequency domain signals with the preset window number from the frequency scale to the Mel scale to obtain a Mel spectrogram.

In step S63, a speaker characteristic extraction network may be trained alone for extracting speaker characteristic vectors (speaker embedding vector), which are used in steps S10 to S40. In some embodiments, the Deep Speaker framework may be used, where the input is a piece of audio and the output is a Speaker-dependent vector.

In the voice conversion method of the embodiment, an efficient data enhancement scheme is provided, high-quality parallel corpus can be generated based on non-parallel corpus, and the problem that the parallel corpus is difficult to collect is effectively solved. The generated parallel corpus has the characteristic of strict time alignment, so that the problem that the traditional parallel corpus needs subsequent time alignment is avoided. Moreover, the voice conversion acoustic model is of a fully parallel structure, and training and inference are very efficient. In the prior art, mostly, non-supervision (i.e. based on non-parallel corpus) voice conversion is adopted, and because the optimization direction is not clear (the due effect of converted voice cannot be known), the learning difficulty of a model is high, and more complex model structures are often required to be designed and more data are required to be supplemented to improve the effect of a voice conversion model. The method and the device can efficiently generate high-quality parallel corpus, define the optimization direction of the voice conversion model, simplify the learning difficulty, and design the voice conversion model by using a more concise and efficient model structure, so that the training and the inference of the voice conversion task are more efficient. The voice conversion acoustic model of the embodiment has no autoregressive structure, training and inference can be performed in parallel, and the high efficiency of the model is ensured.

Fig. 2 is a flowchart of a training method of a speech conversion model according to an embodiment of the present application. It should be noted that, if there are substantially the same results, the method of the present application is not limited to the flow sequence shown in fig. 2. In this embodiment, the voice conversion model includes a voice conversion acoustic model and a voice generator, and the training method of the voice conversion model includes the following steps:

s71, respectively acquiring a corresponding sample text sequence, sample speaker characteristics and a sample source Mel spectrum according to real voice data;

s72, obtaining a sample fusion feature vector according to the sample text sequence and the sample speaker feature, and obtaining a sample spectrum feature vector according to a sample source Mel spectrum;

s73, inputting the sample fusion feature vector and the sample spectrum feature vector into a voice conversion acoustic model to be trained, and outputting a sample target Mel spectrum;

s74, calculating a conversion error according to the sample source Mel spectrum and the sample target Mel spectrum, and adjusting parameters of the voice conversion acoustic model according to the conversion error until the voice conversion acoustic model reaches a training convergence condition;

s75, inputting the sample frequency spectrum feature vector to the voice generator, and outputting target voice data;

s76, calculating a generation error according to the real voice data and the target voice data, and adjusting parameters of the voice generator according to the generation error until the voice generator reaches a training convergence condition.

The steps of the training method of this embodiment may be specifically referred to the description of the above embodiment.

As shown in fig. 3, an embodiment of the present application provides a voice conversion apparatus, the voice conversion apparatus 30 includes: the device comprises a feature fusion module 31, a frequency spectrum coding module 32, a voice conversion module 33 and a voice acquisition module 34, wherein the feature fusion module 31 is used for acquiring a fusion feature vector according to a text sequence and the speaker features of a first speaker; a spectrum encoding module 32, configured to obtain a source spectrum feature vector according to a source mel spectrum of the second speaker; the voice conversion module 33 is configured to input the fusion feature vector and the source spectrum feature vector into a pre-trained voice conversion acoustic model, and output a target mel spectrum of the first speaker, where the voice conversion acoustic model is obtained by training according to a sample fusion feature vector and a sample spectrum feature vector, the sample fusion feature vector is obtained according to a sample text sequence and a sample speaker feature, and the sample text sequence, the sample speaker feature and the sample spectrum feature vector are obtained according to the same real voice data; the voice acquisition module 34 is configured to acquire target voice data according to the target mel spectrum.

As an embodiment, the feature fusion module 31 is further configured to: splicing the text sequence and the speaker characteristics of the first speaker to obtain a spliced characteristic vector; and inputting the spliced feature vector into a first neural network, and outputting the fusion feature vector.

As an embodiment, the feature fusion module 31 is further configured to: inputting the spliced feature vector into a first full-connection layer, and extracting features of the spliced feature vector to obtain a high-dimensional fusion feature vector; and inputting the high-dimensional fusion feature vector into a second full-connection layer, and carrying out feature extraction on the high-dimensional fusion feature vector to obtain the fusion feature vector.

As an embodiment, the speech conversion module 33 is further configured to: acquiring a first alignment matrix according to the fusion feature vector and the source spectrum feature vector; obtaining an index mapping vector according to the first alignment matrix and the index vector of the fusion feature vector; acquiring a second alignment matrix according to the index mapping vector and the index vector of the fusion feature vector; and acquiring a fusion feature alignment vector according to the second alignment matrix and the fusion feature vector.

As an embodiment, the speech conversion module 33 is further configured to: respectively acquiring a corresponding sample text sequence, sample speaker characteristics and a sample source Mel spectrum according to the real voice data; acquiring a sample fusion feature vector according to the sample text sequence and the sample speaker feature, and acquiring a sample spectrum feature vector according to a sample source Mel spectrum; inputting the sample fusion feature vector and the sample spectrum feature vector into a voice conversion acoustic model to be trained, and outputting a sample target Mel spectrum; and calculating a conversion error according to the sample source Mel spectrum and the sample target Mel spectrum, and adjusting parameters of the voice conversion acoustic model according to the conversion error until the voice conversion acoustic model reaches a training convergence condition.

wherein (1)>

For loss value, +_>

As an embodiment, the speech conversion module 33 is further configured to: acquiring a first alignment matrix according to the sample fusion feature vector and the sample spectrum feature vector, wherein the sample fusion feature vector is acquired according to a sample text sequence and sample speaker features; obtaining an index mapping vector according to the first alignment matrix and the index vector of the sample fusion feature vector; acquiring a second alignment matrix according to the index mapping vector and the index vector of the fusion feature vector; acquiring a sample fusion feature pair Ji Xiangliang according to the second alignment matrix and the sample fusion feature vector; and acquiring the target Mel spectrum according to the sample fusion characteristic alignment vector.

As an embodiment, the voice acquisition module 34 is further configured to: and inputting the target Mel spectrum into a voice generator, and outputting corresponding target voice data.

As an embodiment, the voice acquisition module 34 is further configured to: obtaining at least one training sample, wherein the training sample comprises real voice data and a mel spectrum extracted from the real voice data; inputting the Mel spectrum to a Mel spectrum encoder, and outputting spectral characteristic data of the Mel spectrum; inputting the frequency spectrum characteristic data into the voice generator, and outputting target voice data; and calculating a generation error according to the real voice data and the target voice data, and adjusting parameters of the voice generator according to the generation error until the voice generator reaches a training convergence condition.

As an embodiment, the voice conversion apparatus 30 further includes: the feature extraction module is used for: performing voice recognition on source voice data of a second speaker to obtain corresponding text content, and obtaining a text sequence of the text content; acquiring a source mel spectrum of the second speaker according to the source voice data; and extracting the speaker characteristic from the second voice data of the first speaker as the speaker characteristic of the first speaker.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 4, the electronic device 60 includes a processor 61 and a memory 62 coupled to the processor 61.

The memory 62 stores program instructions for implementing the voice conversion method of any of the embodiments described above.

The processor 61 is configured to execute program instructions stored in the memory 62 for speech conversion.

The processor 61 may also be referred to as a CPU (Central Processing Unit ). The processor 61 may be an integrated circuit chip with signal processing capabilities. Processor 61 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a storage medium according to an embodiment of the present application. The storage medium 70 of the embodiment of the present application stores the program instructions 71 capable of implementing all the methods described above, where the program instructions 71 may be stored in the storage medium in the form of a software product, and include several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes, or a terminal device such as a computer, a server, a mobile phone, a tablet, or the like.

In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units. The foregoing is only the embodiments of the present application, and is not intended to limit the scope of the patent application, and all equivalent structures or equivalent processes using the descriptions and the contents of the present application or other related technical fields are included in the scope of the patent application.

The foregoing is merely exemplary of the present application and it should be noted herein that modifications may be made by those skilled in the art without departing from the inventive concept herein, which fall within the scope of the present application.

Claims

1. A method of speech conversion, comprising:

and acquiring target voice data according to the target Mel spectrum.

2. The method of claim 1, wherein the obtaining the fusion feature vector based on the text sequence and the speaker characteristics of the first speaker comprises:

3. The voice conversion method according to claim 1, wherein inputting the fusion feature vector and the source spectrum feature vector into a pre-trained voice conversion acoustic model, outputting a target mel spectrum of the first speaker, comprises:

4. The method of claim 3, wherein the training step of the speech conversion acoustic model comprises:

5. The method of claim 4, wherein the loss function of the speech conversion acoustic model is:

wherein (1)>

For loss value, +_>

6. The voice conversion method according to claim 4, wherein the obtaining target voice data from the target mel spectrum includes:

the training step of the voice generator comprises the following steps:

7. The speech conversion method according to claim 1, wherein before the obtaining the fusion feature vector according to the text sequence and the speaker feature of the first speaker, further comprises:

8. A speech conversion apparatus, comprising:

9. An electronic device comprising a processor, and a memory coupled to the processor, the memory storing program instructions executable by the processor; the processor, when executing the program instructions stored in the memory, implements the speech conversion method according to any one of claims 1 to 7.

10. A storage medium having stored therein program instructions which, when executed by a processor, implement a method of enabling the speech conversion of any one of claims 1 to 7.