CN111292718A - Voice conversion processing method and device, electronic equipment and storage medium - Google Patents

Voice conversion processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111292718A
CN111292718A CN202010084699.XA CN202010084699A CN111292718A CN 111292718 A CN111292718 A CN 111292718A CN 202010084699 A CN202010084699 A CN 202010084699A CN 111292718 A CN111292718 A CN 111292718A
Authority
CN
China
Prior art keywords
voice
mapping
space
speech
conversion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010084699.XA
Other languages
Chinese (zh)
Inventor
孙浩然
王东
李蓝天
蔡云麒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202010084699.XA priority Critical patent/CN111292718A/en
Publication of CN111292718A publication Critical patent/CN111292718A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention discloses a voice conversion processing method, a voice conversion processing device, electronic equipment and a storage medium, wherein the method comprises the following steps: according to the space mapping capability of the flow model, mapping the original voice of the real space to a simple continuous hidden space to obtain hidden space voice; determining a conversion direction of a target voice in the hidden space, and displacing the hidden space voice in the conversion direction to obtain a displaced voice; and mapping the displaced voice back to the real space according to the inverse mapping of the flow model to obtain the target voice so as to realize voice conversion from the original voice to the target voice. Original voice is mapped into a continuous hidden space, voice characteristics are changed in the hidden space, and converted target voice is obtained through inverse mapping, so that the distortion resistance is high, other attributes cannot be damaged, and the converted target voice is continuous and smooth; meanwhile, the resource consumption is low, and excessive calculation overhead can not be brought.

Description

Voice conversion processing method and device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of computers, in particular to a voice conversion processing method and device, electronic equipment and a storage medium.
Background
The voice conversion technology mainly refers to the conversion of the information of the sound source, and the aim is to enable the converted voice to change one or more pronunciation characteristics in the source voice without changing other characteristics based on a certain conversion rule. Typical speech conversions include accent conversion (to achieve speech conversion for different accents), speaker conversion (to achieve speech conversion for different speakers), mood conversion (to achieve speech conversion for different moods). The voice conversion technology has wide application scenes in the field of intelligent human-computer interaction.
The speech conversion technique can be divided into two steps of training and conversion: in the training stage, the system trains the source category voice and the target category voice to obtain a mapping rule between the source category voice and the target category voice and obtain a relation between spectrum parameters of the source category voice and the target category voice; in the conversion stage, the mapping rule obtained in the training stage is used for transforming the spectrum characteristics of the source type voice, so that the transformed voice has the characteristics of the target type voice.
The existing voice conversion methods include a conversion method based on codebook mapping, a conversion method based on a Gaussian mixture model, a conversion method based on personalized voice synthesis, and the like.
The conversion method based on codebook mapping firstly effectively reduces the feature quantity of source and target voices through a vector quantization method, and then converts the centroid vector closest to the source codebook into a corresponding target codebook through a clustering method, thereby realizing voice conversion. However, this method cannot consider the continuity of the context of the speech during quantization, which results in discontinuity of the feature space, and thus the conversion effect is not ideal.
The conversion method based on the Gaussian mixture model introduces the Gaussian mixture model to model the voice signals, and 'soft' clustering based on probability is used for replacing 'hard' clustering based on vector quantization. The method only carries out estimation on the source feature vector, but not joint feature vector estimation, and also has the defects of poor consideration on the context information of the voice and easy overfitting and overflugging problems.
The conversion method based on personalized speech synthesis synthesizes the speech with the target pronunciation characteristic by introducing an additional characterization vector for representing the target pronunciation characteristic into the vocoder, but the calculation amount is large and the resource consumption is high.
Disclosure of Invention
Because the existing methods have the above problems, embodiments of the present invention provide a method and an apparatus for processing speech conversion, an electronic device, and a storage medium.
In a first aspect, an embodiment of the present invention provides a speech conversion processing method, including:
according to the space mapping capability of the flow model, mapping the original voice of the real space to a simple continuous hidden space to obtain hidden space voice;
determining a conversion direction of a target voice in the hidden space, and displacing the hidden space voice in the conversion direction to obtain a displaced voice;
and mapping the displaced voice back to the real space according to the inverse mapping of the flow model to obtain the target voice so as to realize voice conversion from the original voice to the target voice.
Optionally, before the mapping the original speech of the real space to the simple continuous hidden space according to the spatial mapping capability of the stream model to obtain the hidden space speech, the method further includes:
fusing voice samples with various characteristics together, and training to obtain the flow model when the probability of all the voice samples in an observation space is maximum and the voice samples accord with Gaussian distribution in a hidden space;
wherein the Gaussian distribution z is:
z=f(x)
x to P (x), z to N (0,1), x is the voice sample, P (x) is the data distribution of the real space, f (x) is the invertible mapping, and N (0,1) is the standard normal distribution.
Optionally, the mapping, according to the spatial mapping capability of the stream model, the original speech in the real space into a simple continuous hidden space to obtain a hidden space speech specifically includes:
respectively mapping an original voice A and a target voice B to a hidden space to obtain a corresponding hidden space voice:
zAi=f(xAi)
zBj=f(xBj)
correspondingly, the determining a conversion direction of the target speech in the hidden space, and shifting the hidden space speech in the conversion direction to obtain a shifted speech specifically includes:
according to zAiCalculating the first center point of the original speech A according to zBjCalculating a second central point of the original voice B;
determining the conversion direction delta z of the target voice according to the first central point and the second central point:
Figure BDA0002381644890000031
according to the conversion direction delta z and the hidden space voice zpCalculating to obtain voice z 'after displacement'p
z′p=zp+λΔz
Wherein, λ is step length, 0< λ ≦ 1.
Optionally, the mapping the displaced speech back to the real space according to the inverse mapping of the stream model to obtain the target speech, so as to implement speech conversion between the original speech and the target speech, specifically including:
the post-displacement voice z'pMapping to the target Speech x'p
x′p=f-1(z′p)
Wherein f is-1The inverse of f.
In a second aspect, an embodiment of the present invention further provides a speech conversion processing apparatus, including:
the hidden space mapping module is used for mapping the original voice of the real space to a simple continuous hidden space according to the space mapping capacity of the stream model to obtain the hidden space voice;
the voice displacement module is used for determining the conversion direction of the target voice in the hidden space and displacing the hidden space voice in the conversion direction to obtain the displaced voice;
and the voice mapping module is used for mapping the displaced voice back to the real space according to the inverse mapping of the flow model to obtain the target voice so as to realize the voice conversion from the original voice to the target voice.
Optionally, the speech conversion processing apparatus further includes:
the model training module is used for fusing the voice samples with various characteristics together, and when the probability of all the voice samples in an observation space is maximum and the voice samples accord with Gaussian distribution in a hidden space, the flow model is obtained through training;
wherein the Gaussian distribution z is:
z=f(x)
x to P (x), z to N (0,1), x is the voice sample, P (x) is the data distribution of the real space, f (x) is the invertible mapping, and N (0,1) is the standard normal distribution.
Optionally, the hidden space mapping module is specifically configured to:
respectively mapping an original voice A and a target voice B to a hidden space to obtain a corresponding hidden space voice:
zAi=f(xAi)
zBj=f(xBj)
correspondingly, the voice displacement module is specifically configured to:
according to zAiCalculating the first center point of the original speech A according to zBjCalculating a second central point of the original voice B;
determining the conversion direction delta z of the target voice according to the first central point and the second central point:
Figure BDA0002381644890000041
according to the conversion direction delta z and the hidden space voice zpCalculating to obtain voice z 'after displacement'p
z′p=zp+λΔz
Wherein, λ is step length, 0< λ ≦ 1.
Optionally, the voice mapping module specifically includes:
the post-displacement voice z'pMapping to the target Speech x'p
x′p=f-1(z′p)
Wherein f is-1The inverse of f.
In a third aspect, an embodiment of the present invention further provides an electronic device, including:
at least one processor; and
at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor, which when called by the processor are capable of performing the above-described methods.
In a fourth aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium storing a computer program, which causes the computer to execute the above method.
According to the technical scheme, the original voice is mapped into a continuous hidden space, the voice characteristics are changed in the hidden space, and the converted target voice is obtained through inverse mapping, so that the distortion resistance is high, and other attributes are not damaged, so that the converted target voice is more continuous and smooth; meanwhile, the resource consumption is low, and excessive calculation overhead can not be brought.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a flowchart illustrating a voice conversion processing method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of speech conversion based on hidden space of flow model according to an embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating a process for training a flow model according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating a speech training and conversion process according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a speech conversion processing apparatus according to an embodiment of the present invention;
fig. 6 is a logic block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The following further describes embodiments of the present invention with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
Fig. 1 shows a flowchart of a speech conversion processing method provided in this embodiment, which includes:
s101, according to the space mapping capacity of the flow model, mapping the original voice of the real space to a simple continuous hidden space to obtain the hidden space voice.
The flow model maps the real space into a normalized hidden space to improve the convenience of data processing.
The original speech is speech to be converted.
Such as a vector space of a standard gaussian distribution.
The hidden space speech is speech obtained by mapping in a hidden space.
S102, determining the conversion direction of the target voice in the hidden space, and displacing the hidden space voice in the conversion direction to obtain the displaced voice.
The target voice is the final voice obtained by converting the original voice.
The conversion direction is the direction of voice conversion in the hidden space.
And the voice after the displacement is the voice obtained by performing displacement in the hidden space.
S103, mapping the displaced voice back to the real space according to the inverse mapping of the flow model to obtain the target voice so as to realize voice conversion from the original voice to the target voice.
Specifically, the stream model has strong data modeling capability, and the convenience of data processing can be greatly improved by mapping the real data space to a normalized hidden space.
The real space is sparse and discontinuous. The data is in a complex high-dimensional manifold in the real space, and the manifold is difficult to be found, so that the data is difficult to be reasonably transformed without damaging the performance of the manifold. However, the hidden space is different. In the hidden space, the data distribution is dense, continuous, e.g., gaussian. Thus, in implicit space, a point on the line connecting any two sample points has a higher p (z). In addition, because the region with lower probability density of data space is compressed to a smaller region in the hidden space, the displacement in the hidden space has higher probability to obtain effective voice conversion. In addition, in the hidden space, the manifold corresponding to the voice characteristic tends to be flat, and thus the basic characteristic of the voice can be changed by moving in a straight line direction. Based on the two basic characteristics, simple displacement can be performed on a connecting line of the sample points, so that conversion from one sample point to another sample point is realized. If one shifts from a set of sample point centers with one attribute (e.g., speaker a) to a sample point center with another attribute (e.g., speaker B), then a transition between the different attributes (transition from speaker a to speaker B) is achieved.
As shown in fig. 2, the great circle Z is the entire hidden space, and the ellipse a represents various pronunciation sets of the speaker a, and the ellipse B represents various pronunciation sets of the speaker B. When the data moves along the direction of the connecting line of the two types of centers, the conversion from the speaker A to the speaker B can be realized. The conversion of emotion, accent, etc. can be realized by the same method.
In the embodiment, the original voice is mapped into a continuous hidden space, the voice characteristics are changed in the hidden space, and the converted target voice is obtained through inverse mapping, so that the distortion resistance is strong, and other attributes are not damaged, so that the converted target voice is more continuous and smooth; meanwhile, the resource consumption is low, and excessive calculation overhead can not be brought.
Further, on the basis of the above embodiment of the method, before S101, the method further includes:
fusing voice samples with various characteristics together, and training to obtain the flow model when the probability of all the voice samples in an observation space is maximum and the voice samples accord with Gaussian distribution in a hidden space;
wherein the Gaussian distribution z is:
z=f(x)
x to P (x), z to N (0,1), x is the voice sample, P (x) is the data distribution of the real space, f (x) is the invertible mapping, and N (0,1) is the standard normal distribution.
Specifically, the flow model is a generative model that has emerged in recent years, and like most generative models, the essential goal of the flow model is to fit the distribution p (x) of the data space. If the distribution p (x) of the real data space can be obtained, a lot of real data can be obtained by sampling. However, the composition of the real data is very complex, and it is obviously not feasible to directly count p (x) by the training data. The flow model therefore uses a simple normal distribution z, and an invertible mapping f (x) to fit the true data distribution.
The model may define a forward process and a reverse process. The forward process obtains a corresponding hidden variable z by sampling x in the observation space and transforming to the hidden space, as shown in fig. 3:
z=f(x)x~P(x)
in the reverse process, an observation variable x is obtained by sampling z in a hidden space and mapping the sampling z back to an observation space:
x=f-1(z)z~N(0,1)
wherein f is-1The inverse transformation of f (x and z dimensions being the same) can be implemented by various reversible neural networks.
In training, it is desirable to maximize the probability of the model for all observed data, so the objective function is set as follows:
Figure BDA0002381644890000081
according to the knowledge of probability theory, the following steps are carried out:
Figure BDA0002381644890000082
therefore, there are:
Figure BDA0002381644890000083
in the above formula, PzThe distribution of the hidden variable z may be specified in advance. Thus, for a given parameter of the mapping function f, the above-mentioned objective function is computable. During the training process, the parameters of the function f can be learned through various optimization algorithms (such as a gradient descent method), so that the optimization of the objective function is realized.
Specifically, as shown in fig. 4, in this embodiment, real speech is used as an observation variable, and a strong space mapping capability of a stream model is utilized to map a real speech signal into a simple continuous hidden space through the stream model; then, finding the conversion direction of the target pronunciation characteristic in the hidden space, and shifting in the direction to realize the conversion from the original voice to the target voice in the hidden space; and finally, mapping the converted voice in the hidden space back to a real voice space by utilizing the inverse mapping of the stream model, thereby realizing the voice conversion between the source voice and the target voice.
Further, on the basis of the above method embodiment, S101 specifically includes:
respectively mapping an original voice A and a target voice B to a hidden space to obtain a corresponding hidden space voice:
zAi=f(xAi)
zBj=f(xBj)
correspondingly, S102 specifically includes:
according to zAiCalculating the first center point of the original speech A according to zBjCalculating a second central point of the original voice B;
determining the conversion direction delta z of the target voice according to the first central point and the second central point:
Figure BDA0002381644890000091
according to the conversion direction delta z and the hidden space voice zpCalculating to obtain voice z 'after displacement'p
z′p=zp+λΔz
Wherein, λ is step length, 0< λ ≦ 1.
S103 specifically comprises the following steps:
the post-displacement voice z'pMapping to the target Speech x'p
x′p=f-1(z′p)
Wherein f is-1The inverse of f.
In the process of voice conversion processing, the following four stages are specifically included:
s1, training: all samples with various speech characteristics are fused together to train the stream model. The stream model obtained by training enables the probability of all sample points to be maximum in the observation space and accords with Gaussian distribution in the hidden space, namely:
z=f(x)
wherein x to P (x), z to N (0, 1).
S2, forward transformation stage: based on the flow model obtained in step 1, mapping the sample set with the characteristic A and the sample set with the characteristic B into a hidden space:
zAi=f(xAi)
zBj=f(xBj)
s3, hidden space conversion stage: based on the hidden variables obtained in S2, the center point of the sample set having the characteristic a and the center point of the sample set having the characteristic B are calculated, respectively. The direction from the center point of the characteristic a to the center point of the characteristic B represents a characteristic conversion direction Δ z, that is:
Figure BDA0002381644890000101
for each sample point z to be converted having a property ApSelecting a proper step length lambda, and converting the step length lambda along the direction to obtain a sample point with the target characteristic B, namely:
z′p=zp+λΔz
wherein 0< lambda < 1.
S4, inverse transformation stage: hidden space sample point z 'obtained based on S3 conversion'pZ 'is set by using the flow model obtained in S2'pConverse transforming to original voice space to obtain a real voice data sample x'p. Namely:
x′p=f-1(z′p)
thus, the conversion from the pronunciation characteristic a to the pronunciation characteristic B is realized.
The speech conversion processing method provided by the embodiment can be applied to the following specific scenes:
and (3) switching the accents: the conversion of the same speaker between different accents is realized. For example, a speech conversion from a northeast dialect to mandarin chinese can be achieved by mixing mandarin and northeast dialect together with the training stream model according to the above conversion method.
Speaker conversion: the voice signal of one speaker (source speaker) is converted, and the modified voice signal sounds like the other speaker (target speaker) to speak under the premise of keeping the expressed semantic information.
And (3) emotion conversion: the voice conversion of a speaker under different emotions is realized. For example, by mixing voices with different emotions together to train the stream model, the voice conversion from positive emotion (happy, excited) to negative emotion (sad ) of a speaker can be realized according to the conversion method.
Compared with a typical voice conversion model, the conversion method based on the stream model only converts in the direction related to a certain attribute without damaging other attributes, so that the converted voice is more continuous and smooth, and meanwhile, conversion is performed in a hidden space with continuous Gaussian distribution, so that meaningless data points can not appear in a conversion path, distortion can be prevented, and the anti-distortion capability of the conversion method is strong; in addition, the flow model has simple structure, no excessive calculation, low overhead resource consumption, and simple system software structure, and the conversion system only depends on one flow model.
Fig. 5 is a schematic structural diagram of a speech conversion processing apparatus provided in this embodiment, where the apparatus includes: a hidden space mapping module 501, a voice displacement module 502 and a voice mapping module 503, wherein:
the hidden space mapping module 501 is configured to map an original speech of a real space into a simple continuous hidden space according to a space mapping capability of a stream model to obtain a hidden space speech;
the voice displacement module 502 is configured to determine a conversion direction of a target voice in the hidden space, and displace the hidden space voice in the conversion direction to obtain a displaced voice;
the voice mapping module 503 is configured to map the displaced voice back to the real space according to inverse mapping of a stream model to obtain the target voice, so as to implement voice conversion from the original voice to the target voice.
Specifically, the hidden space mapping module 501 maps the original speech of the real space into a simple continuous hidden space according to the space mapping capability of the stream model, so as to obtain a hidden space speech; the voice displacement module 502 determines a conversion direction of a target voice in the hidden space, and displaces the hidden space voice in the conversion direction to obtain a displaced voice; the voice mapping module 503 maps the shifted voice back to the real space according to the inverse mapping of the stream model to obtain the target voice, so as to implement voice conversion from the original voice to the target voice.
In the embodiment, the original voice is mapped into a continuous hidden space, the voice characteristics are changed in the hidden space, and the converted target voice is obtained through inverse mapping, so that the distortion resistance is strong, and other attributes are not damaged, so that the converted target voice is more continuous and smooth; meanwhile, the resource consumption is low, and excessive calculation overhead can not be brought.
Further, on the basis of the above device embodiment, the voice conversion processing device further includes:
the model training module is used for fusing the voice samples with various characteristics together, and when the probability of all the voice samples in an observation space is maximum and the voice samples accord with Gaussian distribution in a hidden space, the flow model is obtained through training;
wherein the Gaussian distribution z is:
z=f(x)
x to P (x), z to N (0,1), x is the voice sample, P (x) is the data distribution of the real space, f (x) is the invertible mapping, and N (0,1) is the standard normal distribution.
Further, on the basis of the above apparatus embodiment, the hidden space mapping module 501 is specifically configured to:
respectively mapping an original voice A and a target voice B to a hidden space to obtain a corresponding hidden space voice:
zAi=f(xAi)
zBj=f(xBj)
correspondingly, the voice displacement module is specifically configured to:
according to zAiCalculating the first center point of the original speech A according to zBjCalculating a second central point of the original voice B;
determining the conversion direction delta z of the target voice according to the first central point and the second central point:
Figure BDA0002381644890000131
according to the conversion direction delta z and the hidden space voice zpCalculating to obtain voice z 'after displacement'p
z′p=zp+λΔz
Wherein, λ is step length, 0< λ ≦ 1.
Further, on the basis of the above device embodiment, the voice mapping module 503 specifically includes a module for:
the post-displacement voice z'pMapping to the target Speech x'p
x′p=f-1(z′p)
Wherein f is-1The inverse of f.
The speech conversion processing apparatus described in this embodiment may be used to implement the above method embodiments, and the principle and technical effect are similar, which are not described herein again.
Referring to fig. 6, the electronic device includes: a processor (processor)601, a memory (memory)602, and a bus 603;
wherein the content of the first and second substances,
the processor 601 and the memory 602 communicate with each other through the bus 603;
the processor 601 is used for calling the program instructions in the memory 602 to execute the methods provided by the above-mentioned method embodiments.
The present embodiments disclose a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the methods provided by the above-described method embodiments.
The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the methods provided by the method embodiments described above.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
It should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method of speech conversion processing, comprising:
according to the space mapping capability of the flow model, mapping the original voice of the real space to a simple continuous hidden space to obtain hidden space voice;
determining a conversion direction of a target voice in the hidden space, and displacing the hidden space voice in the conversion direction to obtain a displaced voice;
and mapping the displaced voice back to the real space according to the inverse mapping of the flow model to obtain the target voice so as to realize voice conversion from the original voice to the target voice.
2. The method of claim 1, wherein before mapping an original speech in a real space into a simple continuous hidden space according to a spatial mapping capability of a stream model to obtain a hidden space speech, the method further comprises:
fusing voice samples with various characteristics together, and training to obtain the flow model when the probability of all the voice samples in an observation space is maximum and the voice samples accord with Gaussian distribution in a hidden space;
wherein the Gaussian distribution z is:
z=f(x)
x to P (x), z to N (0,1), x is the voice sample, P (x) is the data distribution of the real space, f (x) is the invertible mapping, and N (0,1) is the standard normal distribution.
3. The speech conversion processing method according to claim 1, wherein the mapping an original speech in a real space into a simple continuous hidden space according to a spatial mapping capability of a stream model to obtain a hidden space speech specifically includes:
respectively mapping an original voice A and a target voice B to a hidden space to obtain a corresponding hidden space voice:
zAi=f(xAi)
zBj=f(xBj)
correspondingly, the determining a conversion direction of the target speech in the hidden space, and shifting the hidden space speech in the conversion direction to obtain a shifted speech specifically includes:
according to zAiCalculating the first center of the original speech APoint and according to zBjCalculating a second central point of the original voice B;
determining the conversion direction delta z of the target voice according to the first central point and the second central point:
Figure FDA0002381644880000021
according to the conversion direction delta z and the hidden space voice zpCalculating to obtain voice z 'after displacement'p
z′p=zp+λΔz
Wherein, the lambda is the step length, and the lambda is more than 0 and less than or equal to 1.
4. The method according to claim 1, wherein the mapping the displaced speech back to the real space according to an inverse mapping of a stream model to obtain the target speech so as to implement speech conversion from the original speech to the target speech, specifically comprises:
the post-displacement voice z'pMapping to the target Speech x'p
x′p=f-1(z′p)
Wherein f is-1The inverse of f.
5. A speech conversion processing apparatus, comprising:
the hidden space mapping module is used for mapping the original voice of the real space to a simple continuous hidden space according to the space mapping capacity of the stream model to obtain the hidden space voice;
the voice displacement module is used for determining the conversion direction of the target voice in the hidden space and displacing the hidden space voice in the conversion direction to obtain the displaced voice;
and the voice mapping module is used for mapping the displaced voice back to the real space according to the inverse mapping of the flow model to obtain the target voice so as to realize the voice conversion from the original voice to the target voice.
6. The speech conversion processing apparatus according to claim 5, further comprising:
the model training module is used for fusing the voice samples with various characteristics together, and when the probability of all the voice samples in an observation space is maximum and the voice samples accord with Gaussian distribution in a hidden space, the flow model is obtained through training;
wherein the Gaussian distribution z is:
z=f(x)
x to P (x), z to N (0,1), x is the voice sample, P (x) is the data distribution of the real space, f (x) is the invertible mapping, and N (0,1) is the standard normal distribution.
7. The apparatus according to claim 5, wherein the implicit spatial mapping module is specifically configured to:
respectively mapping an original voice A and a target voice B to a hidden space to obtain a corresponding hidden space voice:
zAi=f(xAi)
zBj=f(xBj)
correspondingly, the voice displacement module is specifically configured to:
according to zAiCalculating the first center point of the original speech A according to zBjCalculating a second central point of the original voice B;
determining the conversion direction delta z of the target voice according to the first central point and the second central point:
Figure FDA0002381644880000031
according to the conversion direction delta z and the hidden space voice zpCalculating to obtain voice z 'after displacement'p
z′p=zp+λΔz
Wherein, the lambda is the step length, and the lambda is more than 0 and less than or equal to 1.
8. The apparatus for processing speech conversion according to claim 5, wherein the speech mapping module specifically includes means for:
the post-displacement voice z'pMapping to the target Speech x'p
x′p=f-1(z′p)
Wherein f is-1The inverse of f.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the speech conversion processing method according to any one of claims 1 to 4 when executing the program.
10. A non-transitory computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the speech conversion processing method according to any one of claims 1 to 4.
CN202010084699.XA 2020-02-10 2020-02-10 Voice conversion processing method and device, electronic equipment and storage medium Pending CN111292718A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010084699.XA CN111292718A (en) 2020-02-10 2020-02-10 Voice conversion processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010084699.XA CN111292718A (en) 2020-02-10 2020-02-10 Voice conversion processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111292718A true CN111292718A (en) 2020-06-16

Family

ID=71023483

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010084699.XA Pending CN111292718A (en) 2020-02-10 2020-02-10 Voice conversion processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111292718A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114255737A (en) * 2022-02-28 2022-03-29 北京世纪好未来教育科技有限公司 Voice generation method and device and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020072908A1 (en) * 2000-10-19 2002-06-13 Case Eliot M. System and method for converting text-to-voice
CN1835074A (en) * 2006-04-07 2006-09-20 安徽中科大讯飞信息科技有限公司 Speaking person conversion method combined high layer discription information and model self adaption
US20180012613A1 (en) * 2016-07-11 2018-01-11 The Chinese University Of Hong Kong Phonetic posteriorgrams for many-to-one voice conversion
CN108198566A (en) * 2018-01-24 2018-06-22 咪咕文化科技有限公司 Information processing method and device, electronic equipment and storage medium
CN108885608A (en) * 2016-06-09 2018-11-23 苹果公司 Intelligent automation assistant in home environment
US20190355372A1 (en) * 2018-05-17 2019-11-21 Spotify Ab Automated voiceover mixing and components therefor

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020072908A1 (en) * 2000-10-19 2002-06-13 Case Eliot M. System and method for converting text-to-voice
CN1835074A (en) * 2006-04-07 2006-09-20 安徽中科大讯飞信息科技有限公司 Speaking person conversion method combined high layer discription information and model self adaption
CN108885608A (en) * 2016-06-09 2018-11-23 苹果公司 Intelligent automation assistant in home environment
US20180012613A1 (en) * 2016-07-11 2018-01-11 The Chinese University Of Hong Kong Phonetic posteriorgrams for many-to-one voice conversion
CN108198566A (en) * 2018-01-24 2018-06-22 咪咕文化科技有限公司 Information processing method and device, electronic equipment and storage medium
US20190355372A1 (en) * 2018-05-17 2019-11-21 Spotify Ab Automated voiceover mixing and components therefor

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HAORAN SUN ET AL: "On Investigation Of Unsupervised Speech Factorization Based On Normalization Flow", 《ARXIV》 *
车滢霞等: "约束条件下的结构化高斯混合模型及非平行语料语音转换", 《电子学报》 *
马振等: "基于语音个人特征信息分离的语音转换方法研究", 《信号处理》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114255737A (en) * 2022-02-28 2022-03-29 北京世纪好未来教育科技有限公司 Voice generation method and device and electronic equipment
CN114255737B (en) * 2022-02-28 2022-05-17 北京世纪好未来教育科技有限公司 Voice generation method and device and electronic equipment

Similar Documents

Publication Publication Date Title
CN111048064B (en) Voice cloning method and device based on single speaker voice synthesis data set
CN106297773A (en) A kind of neutral net acoustic training model method
CN111581470B (en) Multi-mode fusion learning analysis method and system for scene matching of dialogue system
CN114895817B (en) Interactive information processing method, network model training method and device
WO2023197979A1 (en) Data processing method and apparatus, and computer device and storage medium
WO2021212954A1 (en) Method and apparatus for synthesizing emotional speech of specific speaker with extremely few resources
CN112860871B (en) Natural language understanding model training method, natural language understanding method and device
CN113327580A (en) Speech synthesis method, device, readable medium and electronic equipment
CN114242033A (en) Speech synthesis method, apparatus, device, storage medium and program product
CN111292718A (en) Voice conversion processing method and device, electronic equipment and storage medium
CN116798405B (en) Speech synthesis method, device, storage medium and electronic equipment
CN117725936A (en) Long dialogue emotion dynamic identification method and system based on hypergraph network
Zoric et al. A real-time lip sync system using a genetic algorithm for automatic neural network configuration
Zhang et al. Promptspeaker: Speaker Generation Based on Text Descriptions
CN115798456A (en) Cross-language emotion voice synthesis method and device and computer equipment
Zorić et al. Real-time language independent lip synchronization method using a genetic algorithm
CN113611283A (en) Voice synthesis method and device, electronic equipment and storage medium
CN115148182A (en) Speech synthesis method and device
CN116863909B (en) Speech synthesis method, device and system based on factor graph
CN116825090B (en) Training method and device for speech synthesis model and speech synthesis method and device
Zhang et al. Review of the Lip-reading Recognition
CN114582314B (en) Man-machine audio-video interaction logic model design method based on ASR
CN114093389B (en) Speech emotion recognition method and device, electronic equipment and computer readable medium
CN115440198B (en) Method, apparatus, computer device and storage medium for converting mixed audio signal
WO2024008215A2 (en) Speech emotion recognition method and apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200616

RJ01 Rejection of invention patent application after publication