CN111292718A

CN111292718A - Voice conversion processing method and device, electronic equipment and storage medium

Info

Publication number: CN111292718A
Application number: CN202010084699.XA
Authority: CN
Inventors: 孙浩然; 王东; 李蓝天; 蔡云麒
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2020-02-10
Filing date: 2020-02-10
Publication date: 2020-06-16

Abstract

The embodiment of the invention discloses a voice conversion processing method, a voice conversion processing device, electronic equipment and a storage medium, wherein the method comprises the following steps: according to the space mapping capability of the flow model, mapping the original voice of the real space to a simple continuous hidden space to obtain hidden space voice; determining a conversion direction of a target voice in the hidden space, and displacing the hidden space voice in the conversion direction to obtain a displaced voice; and mapping the displaced voice back to the real space according to the inverse mapping of the flow model to obtain the target voice so as to realize voice conversion from the original voice to the target voice. Original voice is mapped into a continuous hidden space, voice characteristics are changed in the hidden space, and converted target voice is obtained through inverse mapping, so that the distortion resistance is high, other attributes cannot be damaged, and the converted target voice is continuous and smooth; meanwhile, the resource consumption is low, and excessive calculation overhead can not be brought.

Description

Voice conversion processing method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of computers, in particular to a voice conversion processing method and device, electronic equipment and a storage medium.

Background

The voice conversion technology mainly refers to the conversion of the information of the sound source, and the aim is to enable the converted voice to change one or more pronunciation characteristics in the source voice without changing other characteristics based on a certain conversion rule. Typical speech conversions include accent conversion (to achieve speech conversion for different accents), speaker conversion (to achieve speech conversion for different speakers), mood conversion (to achieve speech conversion for different moods). The voice conversion technology has wide application scenes in the field of intelligent human-computer interaction.

The speech conversion technique can be divided into two steps of training and conversion: in the training stage, the system trains the source category voice and the target category voice to obtain a mapping rule between the source category voice and the target category voice and obtain a relation between spectrum parameters of the source category voice and the target category voice; in the conversion stage, the mapping rule obtained in the training stage is used for transforming the spectrum characteristics of the source type voice, so that the transformed voice has the characteristics of the target type voice.

The existing voice conversion methods include a conversion method based on codebook mapping, a conversion method based on a Gaussian mixture model, a conversion method based on personalized voice synthesis, and the like.

The conversion method based on codebook mapping firstly effectively reduces the feature quantity of source and target voices through a vector quantization method, and then converts the centroid vector closest to the source codebook into a corresponding target codebook through a clustering method, thereby realizing voice conversion. However, this method cannot consider the continuity of the context of the speech during quantization, which results in discontinuity of the feature space, and thus the conversion effect is not ideal.

The conversion method based on the Gaussian mixture model introduces the Gaussian mixture model to model the voice signals, and 'soft' clustering based on probability is used for replacing 'hard' clustering based on vector quantization. The method only carries out estimation on the source feature vector, but not joint feature vector estimation, and also has the defects of poor consideration on the context information of the voice and easy overfitting and overflugging problems.

The conversion method based on personalized speech synthesis synthesizes the speech with the target pronunciation characteristic by introducing an additional characterization vector for representing the target pronunciation characteristic into the vocoder, but the calculation amount is large and the resource consumption is high.

Disclosure of Invention

Because the existing methods have the above problems, embodiments of the present invention provide a method and an apparatus for processing speech conversion, an electronic device, and a storage medium.

In a first aspect, an embodiment of the present invention provides a speech conversion processing method, including:

according to the space mapping capability of the flow model, mapping the original voice of the real space to a simple continuous hidden space to obtain hidden space voice;

determining a conversion direction of a target voice in the hidden space, and displacing the hidden space voice in the conversion direction to obtain a displaced voice;

and mapping the displaced voice back to the real space according to the inverse mapping of the flow model to obtain the target voice so as to realize voice conversion from the original voice to the target voice.

Optionally, before the mapping the original speech of the real space to the simple continuous hidden space according to the spatial mapping capability of the stream model to obtain the hidden space speech, the method further includes:

fusing voice samples with various characteristics together, and training to obtain the flow model when the probability of all the voice samples in an observation space is maximum and the voice samples accord with Gaussian distribution in a hidden space;

wherein the Gaussian distribution z is:

z＝f(x)

x to P (x), z to N (0,1), x is the voice sample, P (x) is the data distribution of the real space, f (x) is the invertible mapping, and N (0,1) is the standard normal distribution.

Optionally, the mapping, according to the spatial mapping capability of the stream model, the original speech in the real space into a simple continuous hidden space to obtain a hidden space speech specifically includes:

respectively mapping an original voice A and a target voice B to a hidden space to obtain a corresponding hidden space voice:

z_Ai＝f(x_Ai)

z_Bj＝f(x_Bj)

correspondingly, the determining a conversion direction of the target speech in the hidden space, and shifting the hidden space speech in the conversion direction to obtain a shifted speech specifically includes:

according to z_AiCalculating the first center point of the original speech A according to z_BjCalculating a second central point of the original voice B;

determining the conversion direction delta z of the target voice according to the first central point and the second central point:

according to the conversion direction delta z and the hidden space voice z_pCalculating to obtain voice z 'after displacement'_p：

z′_p＝z_p+λΔz

Wherein, λ is step length, 0< λ ≦ 1.

Optionally, the mapping the displaced speech back to the real space according to the inverse mapping of the stream model to obtain the target speech, so as to implement speech conversion between the original speech and the target speech, specifically including:

the post-displacement voice z'_pMapping to the target Speech x'_p：

x′_p＝f^-1(z′_p)

Wherein f is^-1The inverse of f.

In a second aspect, an embodiment of the present invention further provides a speech conversion processing apparatus, including:

the hidden space mapping module is used for mapping the original voice of the real space to a simple continuous hidden space according to the space mapping capacity of the stream model to obtain the hidden space voice;

the voice displacement module is used for determining the conversion direction of the target voice in the hidden space and displacing the hidden space voice in the conversion direction to obtain the displaced voice;

and the voice mapping module is used for mapping the displaced voice back to the real space according to the inverse mapping of the flow model to obtain the target voice so as to realize the voice conversion from the original voice to the target voice.

Optionally, the speech conversion processing apparatus further includes:

the model training module is used for fusing the voice samples with various characteristics together, and when the probability of all the voice samples in an observation space is maximum and the voice samples accord with Gaussian distribution in a hidden space, the flow model is obtained through training;

wherein the Gaussian distribution z is:

z＝f(x)

Optionally, the hidden space mapping module is specifically configured to:

z_Ai＝f(x_Ai)

z_Bj＝f(x_Bj)

correspondingly, the voice displacement module is specifically configured to:

z′_p＝z_p+λΔz

Wherein, λ is step length, 0< λ ≦ 1.

Optionally, the voice mapping module specifically includes:

the post-displacement voice z'_pMapping to the target Speech x'_p：

x′_p＝f^-1(z′_p)

Wherein f is^-1The inverse of f.

In a third aspect, an embodiment of the present invention further provides an electronic device, including:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, which when called by the processor are capable of performing the above-described methods.

In a fourth aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium storing a computer program, which causes the computer to execute the above method.

According to the technical scheme, the original voice is mapped into a continuous hidden space, the voice characteristics are changed in the hidden space, and the converted target voice is obtained through inverse mapping, so that the distortion resistance is high, and other attributes are not damaged, so that the converted target voice is more continuous and smooth; meanwhile, the resource consumption is low, and excessive calculation overhead can not be brought.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a flowchart illustrating a voice conversion processing method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of speech conversion based on hidden space of flow model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating a process for training a flow model according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating a speech training and conversion process according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a speech conversion processing apparatus according to an embodiment of the present invention;

fig. 6 is a logic block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following further describes embodiments of the present invention with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

Fig. 1 shows a flowchart of a speech conversion processing method provided in this embodiment, which includes:

s101, according to the space mapping capacity of the flow model, mapping the original voice of the real space to a simple continuous hidden space to obtain the hidden space voice.

The flow model maps the real space into a normalized hidden space to improve the convenience of data processing.

The original speech is speech to be converted.

Such as a vector space of a standard gaussian distribution.

The hidden space speech is speech obtained by mapping in a hidden space.

S102, determining the conversion direction of the target voice in the hidden space, and displacing the hidden space voice in the conversion direction to obtain the displaced voice.

The target voice is the final voice obtained by converting the original voice.

The conversion direction is the direction of voice conversion in the hidden space.

And the voice after the displacement is the voice obtained by performing displacement in the hidden space.

S103, mapping the displaced voice back to the real space according to the inverse mapping of the flow model to obtain the target voice so as to realize voice conversion from the original voice to the target voice.

Specifically, the stream model has strong data modeling capability, and the convenience of data processing can be greatly improved by mapping the real data space to a normalized hidden space.

The real space is sparse and discontinuous. The data is in a complex high-dimensional manifold in the real space, and the manifold is difficult to be found, so that the data is difficult to be reasonably transformed without damaging the performance of the manifold. However, the hidden space is different. In the hidden space, the data distribution is dense, continuous, e.g., gaussian. Thus, in implicit space, a point on the line connecting any two sample points has a higher p (z). In addition, because the region with lower probability density of data space is compressed to a smaller region in the hidden space, the displacement in the hidden space has higher probability to obtain effective voice conversion. In addition, in the hidden space, the manifold corresponding to the voice characteristic tends to be flat, and thus the basic characteristic of the voice can be changed by moving in a straight line direction. Based on the two basic characteristics, simple displacement can be performed on a connecting line of the sample points, so that conversion from one sample point to another sample point is realized. If one shifts from a set of sample point centers with one attribute (e.g., speaker a) to a sample point center with another attribute (e.g., speaker B), then a transition between the different attributes (transition from speaker a to speaker B) is achieved.

As shown in fig. 2, the great circle Z is the entire hidden space, and the ellipse a represents various pronunciation sets of the speaker a, and the ellipse B represents various pronunciation sets of the speaker B. When the data moves along the direction of the connecting line of the two types of centers, the conversion from the speaker A to the speaker B can be realized. The conversion of emotion, accent, etc. can be realized by the same method.

In the embodiment, the original voice is mapped into a continuous hidden space, the voice characteristics are changed in the hidden space, and the converted target voice is obtained through inverse mapping, so that the distortion resistance is strong, and other attributes are not damaged, so that the converted target voice is more continuous and smooth; meanwhile, the resource consumption is low, and excessive calculation overhead can not be brought.

Further, on the basis of the above embodiment of the method, before S101, the method further includes:

wherein the Gaussian distribution z is:

z＝f(x)

Specifically, the flow model is a generative model that has emerged in recent years, and like most generative models, the essential goal of the flow model is to fit the distribution p (x) of the data space. If the distribution p (x) of the real data space can be obtained, a lot of real data can be obtained by sampling. However, the composition of the real data is very complex, and it is obviously not feasible to directly count p (x) by the training data. The flow model therefore uses a simple normal distribution z, and an invertible mapping f (x) to fit the true data distribution.

The model may define a forward process and a reverse process. The forward process obtains a corresponding hidden variable z by sampling x in the observation space and transforming to the hidden space, as shown in fig. 3:

z＝f(x)x～P(x)

in the reverse process, an observation variable x is obtained by sampling z in a hidden space and mapping the sampling z back to an observation space:

x＝f^-1(z)z～N(0,1)

wherein f is^-1The inverse transformation of f (x and z dimensions being the same) can be implemented by various reversible neural networks.

In training, it is desirable to maximize the probability of the model for all observed data, so the objective function is set as follows:

according to the knowledge of probability theory, the following steps are carried out:

therefore, there are:

in the above formula, P_zThe distribution of the hidden variable z may be specified in advance. Thus, for a given parameter of the mapping function f, the above-mentioned objective function is computable. During the training process, the parameters of the function f can be learned through various optimization algorithms (such as a gradient descent method), so that the optimization of the objective function is realized.

Specifically, as shown in fig. 4, in this embodiment, real speech is used as an observation variable, and a strong space mapping capability of a stream model is utilized to map a real speech signal into a simple continuous hidden space through the stream model; then, finding the conversion direction of the target pronunciation characteristic in the hidden space, and shifting in the direction to realize the conversion from the original voice to the target voice in the hidden space; and finally, mapping the converted voice in the hidden space back to a real voice space by utilizing the inverse mapping of the stream model, thereby realizing the voice conversion between the source voice and the target voice.

Further, on the basis of the above method embodiment, S101 specifically includes:

z_Ai＝f(x_Ai)

z_Bj＝f(x_Bj)

correspondingly, S102 specifically includes:

z′_p＝z_p+λΔz

Wherein, λ is step length, 0< λ ≦ 1.

S103 specifically comprises the following steps:

the post-displacement voice z'_pMapping to the target Speech x'_p：

x′_p＝f^-1(z′_p)

Wherein f is^-1The inverse of f.

In the process of voice conversion processing, the following four stages are specifically included:

s1, training: all samples with various speech characteristics are fused together to train the stream model. The stream model obtained by training enables the probability of all sample points to be maximum in the observation space and accords with Gaussian distribution in the hidden space, namely:

z＝f(x)

wherein x to P (x), z to N (0, 1).

S2, forward transformation stage: based on the flow model obtained in step 1, mapping the sample set with the characteristic A and the sample set with the characteristic B into a hidden space:

z_Ai＝f(x_Ai)

z_Bj＝f(x_Bj)

s3, hidden space conversion stage: based on the hidden variables obtained in S2, the center point of the sample set having the characteristic a and the center point of the sample set having the characteristic B are calculated, respectively. The direction from the center point of the characteristic a to the center point of the characteristic B represents a characteristic conversion direction Δ z, that is:

for each sample point z to be converted having a property A_pSelecting a proper step length lambda, and converting the step length lambda along the direction to obtain a sample point with the target characteristic B, namely:

z′_p＝z_p+λΔz

wherein 0< lambda < 1.

S4, inverse transformation stage: hidden space sample point z 'obtained based on S3 conversion'_pZ 'is set by using the flow model obtained in S2'_pConverse transforming to original voice space to obtain a real voice data sample x'_p. Namely:

x′_p＝f^-1(z′_p)

thus, the conversion from the pronunciation characteristic a to the pronunciation characteristic B is realized.

The speech conversion processing method provided by the embodiment can be applied to the following specific scenes:

and (3) switching the accents: the conversion of the same speaker between different accents is realized. For example, a speech conversion from a northeast dialect to mandarin chinese can be achieved by mixing mandarin and northeast dialect together with the training stream model according to the above conversion method.

Speaker conversion: the voice signal of one speaker (source speaker) is converted, and the modified voice signal sounds like the other speaker (target speaker) to speak under the premise of keeping the expressed semantic information.

And (3) emotion conversion: the voice conversion of a speaker under different emotions is realized. For example, by mixing voices with different emotions together to train the stream model, the voice conversion from positive emotion (happy, excited) to negative emotion (sad ) of a speaker can be realized according to the conversion method.

Compared with a typical voice conversion model, the conversion method based on the stream model only converts in the direction related to a certain attribute without damaging other attributes, so that the converted voice is more continuous and smooth, and meanwhile, conversion is performed in a hidden space with continuous Gaussian distribution, so that meaningless data points can not appear in a conversion path, distortion can be prevented, and the anti-distortion capability of the conversion method is strong; in addition, the flow model has simple structure, no excessive calculation, low overhead resource consumption, and simple system software structure, and the conversion system only depends on one flow model.

Fig. 5 is a schematic structural diagram of a speech conversion processing apparatus provided in this embodiment, where the apparatus includes: a hidden space mapping module 501, a voice displacement module 502 and a voice mapping module 503, wherein:

the hidden space mapping module 501 is configured to map an original speech of a real space into a simple continuous hidden space according to a space mapping capability of a stream model to obtain a hidden space speech;

the voice displacement module 502 is configured to determine a conversion direction of a target voice in the hidden space, and displace the hidden space voice in the conversion direction to obtain a displaced voice;

the voice mapping module 503 is configured to map the displaced voice back to the real space according to inverse mapping of a stream model to obtain the target voice, so as to implement voice conversion from the original voice to the target voice.

Specifically, the hidden space mapping module 501 maps the original speech of the real space into a simple continuous hidden space according to the space mapping capability of the stream model, so as to obtain a hidden space speech; the voice displacement module 502 determines a conversion direction of a target voice in the hidden space, and displaces the hidden space voice in the conversion direction to obtain a displaced voice; the voice mapping module 503 maps the shifted voice back to the real space according to the inverse mapping of the stream model to obtain the target voice, so as to implement voice conversion from the original voice to the target voice.

Further, on the basis of the above device embodiment, the voice conversion processing device further includes:

wherein the Gaussian distribution z is:

z＝f(x)

Further, on the basis of the above apparatus embodiment, the hidden space mapping module 501 is specifically configured to:

z_Ai＝f(x_Ai)

z_Bj＝f(x_Bj)

correspondingly, the voice displacement module is specifically configured to:

z′_p＝z_p+λΔz

Wherein, λ is step length, 0< λ ≦ 1.

Further, on the basis of the above device embodiment, the voice mapping module 503 specifically includes a module for:

the post-displacement voice z'_pMapping to the target Speech x'_p：

x′_p＝f^-1(z′_p)

Wherein f is^-1The inverse of f.

The speech conversion processing apparatus described in this embodiment may be used to implement the above method embodiments, and the principle and technical effect are similar, which are not described herein again.

Referring to fig. 6, the electronic device includes: a processor (processor)601, a memory (memory)602, and a bus 603;

wherein the content of the first and second substances,

the processor 601 and the memory 602 communicate with each other through the bus 603;

the processor 601 is used for calling the program instructions in the memory 602 to execute the methods provided by the above-mentioned method embodiments.

The present embodiments disclose a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the methods provided by the above-described method embodiments.

The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the methods provided by the method embodiments described above.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

It should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of speech conversion processing, comprising:

2. The method of claim 1, wherein before mapping an original speech in a real space into a simple continuous hidden space according to a spatial mapping capability of a stream model to obtain a hidden space speech, the method further comprises:

wherein the Gaussian distribution z is:

z＝f(x)

3. The speech conversion processing method according to claim 1, wherein the mapping an original speech in a real space into a simple continuous hidden space according to a spatial mapping capability of a stream model to obtain a hidden space speech specifically includes:

z_Ai＝f(x_Ai)

z_Bj＝f(x_Bj)

according to z_AiCalculating the first center of the original speech APoint and according to z_BjCalculating a second central point of the original voice B;

z′_p＝z_p+λΔz

Wherein, the lambda is the step length, and the lambda is more than 0 and less than or equal to 1.

4. The method according to claim 1, wherein the mapping the displaced speech back to the real space according to an inverse mapping of a stream model to obtain the target speech so as to implement speech conversion from the original speech to the target speech, specifically comprises:

the post-displacement voice z'_pMapping to the target Speech x'_p：

x′_p＝f^-1(z′_p)

Wherein f is^-1The inverse of f.

5. A speech conversion processing apparatus, comprising:

6. The speech conversion processing apparatus according to claim 5, further comprising:

wherein the Gaussian distribution z is:

z＝f(x)

7. The apparatus according to claim 5, wherein the implicit spatial mapping module is specifically configured to:

z_Ai＝f(x_Ai)

z_Bj＝f(x_Bj)

correspondingly, the voice displacement module is specifically configured to:

z′_p＝z_p+λΔz

8. The apparatus for processing speech conversion according to claim 5, wherein the speech mapping module specifically includes means for:

the post-displacement voice z'_pMapping to the target Speech x'_p：

x′_p＝f^-1(z′_p)

Wherein f is^-1The inverse of f.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the speech conversion processing method according to any one of claims 1 to 4 when executing the program.

10. A non-transitory computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the speech conversion processing method according to any one of claims 1 to 4.