CN111292718A - Voice conversion processing method and device, electronic equipment and storage medium - Google Patents
Voice conversion processing method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN111292718A CN111292718A CN202010084699.XA CN202010084699A CN111292718A CN 111292718 A CN111292718 A CN 111292718A CN 202010084699 A CN202010084699 A CN 202010084699A CN 111292718 A CN111292718 A CN 111292718A
- Authority
- CN
- China
- Prior art keywords
- voice
- mapping
- space
- speech
- conversion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000006243 chemical reaction Methods 0.000 title claims abstract description 118
- 238000003672 processing method Methods 0.000 title claims abstract description 11
- 238000013507 mapping Methods 0.000 claims abstract description 91
- 238000000034 method Methods 0.000 claims abstract description 46
- 238000012545 processing Methods 0.000 claims abstract description 15
- 238000006073 displacement reaction Methods 0.000 claims description 24
- 238000012549 training Methods 0.000 claims description 20
- 238000004590 computer program Methods 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 abstract description 6
- 230000008569 process Effects 0.000 description 8
- 230000008451 emotion Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 5
- 239000000203 mixture Substances 0.000 description 4
- 238000005070 sampling Methods 0.000 description 4
- 238000013139 quantization Methods 0.000 description 3
- 230000002441 reversible effect Effects 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 230000001131 transforming effect Effects 0.000 description 3
- 241001672694 Citrus reticulata Species 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000036651 mood Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
- G10L13/047—Architecture of speech synthesisers
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Abstract
The embodiment of the invention discloses a voice conversion processing method, a voice conversion processing device, electronic equipment and a storage medium, wherein the method comprises the following steps: according to the space mapping capability of the flow model, mapping the original voice of the real space to a simple continuous hidden space to obtain hidden space voice; determining a conversion direction of a target voice in the hidden space, and displacing the hidden space voice in the conversion direction to obtain a displaced voice; and mapping the displaced voice back to the real space according to the inverse mapping of the flow model to obtain the target voice so as to realize voice conversion from the original voice to the target voice. Original voice is mapped into a continuous hidden space, voice characteristics are changed in the hidden space, and converted target voice is obtained through inverse mapping, so that the distortion resistance is high, other attributes cannot be damaged, and the converted target voice is continuous and smooth; meanwhile, the resource consumption is low, and excessive calculation overhead can not be brought.
Description
Technical Field
The invention relates to the technical field of computers, in particular to a voice conversion processing method and device, electronic equipment and a storage medium.
Background
The voice conversion technology mainly refers to the conversion of the information of the sound source, and the aim is to enable the converted voice to change one or more pronunciation characteristics in the source voice without changing other characteristics based on a certain conversion rule. Typical speech conversions include accent conversion (to achieve speech conversion for different accents), speaker conversion (to achieve speech conversion for different speakers), mood conversion (to achieve speech conversion for different moods). The voice conversion technology has wide application scenes in the field of intelligent human-computer interaction.
The speech conversion technique can be divided into two steps of training and conversion: in the training stage, the system trains the source category voice and the target category voice to obtain a mapping rule between the source category voice and the target category voice and obtain a relation between spectrum parameters of the source category voice and the target category voice; in the conversion stage, the mapping rule obtained in the training stage is used for transforming the spectrum characteristics of the source type voice, so that the transformed voice has the characteristics of the target type voice.
The existing voice conversion methods include a conversion method based on codebook mapping, a conversion method based on a Gaussian mixture model, a conversion method based on personalized voice synthesis, and the like.
The conversion method based on codebook mapping firstly effectively reduces the feature quantity of source and target voices through a vector quantization method, and then converts the centroid vector closest to the source codebook into a corresponding target codebook through a clustering method, thereby realizing voice conversion. However, this method cannot consider the continuity of the context of the speech during quantization, which results in discontinuity of the feature space, and thus the conversion effect is not ideal.
The conversion method based on the Gaussian mixture model introduces the Gaussian mixture model to model the voice signals, and 'soft' clustering based on probability is used for replacing 'hard' clustering based on vector quantization. The method only carries out estimation on the source feature vector, but not joint feature vector estimation, and also has the defects of poor consideration on the context information of the voice and easy overfitting and overflugging problems.
The conversion method based on personalized speech synthesis synthesizes the speech with the target pronunciation characteristic by introducing an additional characterization vector for representing the target pronunciation characteristic into the vocoder, but the calculation amount is large and the resource consumption is high.
Disclosure of Invention
Because the existing methods have the above problems, embodiments of the present invention provide a method and an apparatus for processing speech conversion, an electronic device, and a storage medium.
In a first aspect, an embodiment of the present invention provides a speech conversion processing method, including:
according to the space mapping capability of the flow model, mapping the original voice of the real space to a simple continuous hidden space to obtain hidden space voice;
determining a conversion direction of a target voice in the hidden space, and displacing the hidden space voice in the conversion direction to obtain a displaced voice;
and mapping the displaced voice back to the real space according to the inverse mapping of the flow model to obtain the target voice so as to realize voice conversion from the original voice to the target voice.
Optionally, before the mapping the original speech of the real space to the simple continuous hidden space according to the spatial mapping capability of the stream model to obtain the hidden space speech, the method further includes:
fusing voice samples with various characteristics together, and training to obtain the flow model when the probability of all the voice samples in an observation space is maximum and the voice samples accord with Gaussian distribution in a hidden space;
wherein the Gaussian distribution z is:
z=f(x)
x to P (x), z to N (0,1), x is the voice sample, P (x) is the data distribution of the real space, f (x) is the invertible mapping, and N (0,1) is the standard normal distribution.
Optionally, the mapping, according to the spatial mapping capability of the stream model, the original speech in the real space into a simple continuous hidden space to obtain a hidden space speech specifically includes:
respectively mapping an original voice A and a target voice B to a hidden space to obtain a corresponding hidden space voice:
zAi=f(xAi)
zBj=f(xBj)
correspondingly, the determining a conversion direction of the target speech in the hidden space, and shifting the hidden space speech in the conversion direction to obtain a shifted speech specifically includes:
according to zAiCalculating the first center point of the original speech A according to zBjCalculating a second central point of the original voice B;
determining the conversion direction delta z of the target voice according to the first central point and the second central point:
according to the conversion direction delta z and the hidden space voice zpCalculating to obtain voice z 'after displacement'p:
z′p=zp+λΔz
Wherein, λ is step length, 0< λ ≦ 1.
Optionally, the mapping the displaced speech back to the real space according to the inverse mapping of the stream model to obtain the target speech, so as to implement speech conversion between the original speech and the target speech, specifically including:
the post-displacement voice z'pMapping to the target Speech x'p:
x′p=f-1(z′p)
Wherein f is-1The inverse of f.
In a second aspect, an embodiment of the present invention further provides a speech conversion processing apparatus, including:
the hidden space mapping module is used for mapping the original voice of the real space to a simple continuous hidden space according to the space mapping capacity of the stream model to obtain the hidden space voice;
the voice displacement module is used for determining the conversion direction of the target voice in the hidden space and displacing the hidden space voice in the conversion direction to obtain the displaced voice;
and the voice mapping module is used for mapping the displaced voice back to the real space according to the inverse mapping of the flow model to obtain the target voice so as to realize the voice conversion from the original voice to the target voice.
Optionally, the speech conversion processing apparatus further includes:
the model training module is used for fusing the voice samples with various characteristics together, and when the probability of all the voice samples in an observation space is maximum and the voice samples accord with Gaussian distribution in a hidden space, the flow model is obtained through training;
wherein the Gaussian distribution z is:
z=f(x)
x to P (x), z to N (0,1), x is the voice sample, P (x) is the data distribution of the real space, f (x) is the invertible mapping, and N (0,1) is the standard normal distribution.
Optionally, the hidden space mapping module is specifically configured to:
respectively mapping an original voice A and a target voice B to a hidden space to obtain a corresponding hidden space voice:
zAi=f(xAi)
zBj=f(xBj)
correspondingly, the voice displacement module is specifically configured to:
according to zAiCalculating the first center point of the original speech A according to zBjCalculating a second central point of the original voice B;
determining the conversion direction delta z of the target voice according to the first central point and the second central point:
according to the conversion direction delta z and the hidden space voice zpCalculating to obtain voice z 'after displacement'p:
z′p=zp+λΔz
Wherein, λ is step length, 0< λ ≦ 1.
Optionally, the voice mapping module specifically includes:
the post-displacement voice z'pMapping to the target Speech x'p:
x′p=f-1(z′p)
Wherein f is-1The inverse of f.
In a third aspect, an embodiment of the present invention further provides an electronic device, including:
at least one processor; and
at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor, which when called by the processor are capable of performing the above-described methods.
In a fourth aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium storing a computer program, which causes the computer to execute the above method.
According to the technical scheme, the original voice is mapped into a continuous hidden space, the voice characteristics are changed in the hidden space, and the converted target voice is obtained through inverse mapping, so that the distortion resistance is high, and other attributes are not damaged, so that the converted target voice is more continuous and smooth; meanwhile, the resource consumption is low, and excessive calculation overhead can not be brought.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a flowchart illustrating a voice conversion processing method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of speech conversion based on hidden space of flow model according to an embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating a process for training a flow model according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating a speech training and conversion process according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a speech conversion processing apparatus according to an embodiment of the present invention;
fig. 6 is a logic block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The following further describes embodiments of the present invention with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
Fig. 1 shows a flowchart of a speech conversion processing method provided in this embodiment, which includes:
s101, according to the space mapping capacity of the flow model, mapping the original voice of the real space to a simple continuous hidden space to obtain the hidden space voice.
The flow model maps the real space into a normalized hidden space to improve the convenience of data processing.
The original speech is speech to be converted.
Such as a vector space of a standard gaussian distribution.
The hidden space speech is speech obtained by mapping in a hidden space.
S102, determining the conversion direction of the target voice in the hidden space, and displacing the hidden space voice in the conversion direction to obtain the displaced voice.
The target voice is the final voice obtained by converting the original voice.
The conversion direction is the direction of voice conversion in the hidden space.
And the voice after the displacement is the voice obtained by performing displacement in the hidden space.
S103, mapping the displaced voice back to the real space according to the inverse mapping of the flow model to obtain the target voice so as to realize voice conversion from the original voice to the target voice.
Specifically, the stream model has strong data modeling capability, and the convenience of data processing can be greatly improved by mapping the real data space to a normalized hidden space.
The real space is sparse and discontinuous. The data is in a complex high-dimensional manifold in the real space, and the manifold is difficult to be found, so that the data is difficult to be reasonably transformed without damaging the performance of the manifold. However, the hidden space is different. In the hidden space, the data distribution is dense, continuous, e.g., gaussian. Thus, in implicit space, a point on the line connecting any two sample points has a higher p (z). In addition, because the region with lower probability density of data space is compressed to a smaller region in the hidden space, the displacement in the hidden space has higher probability to obtain effective voice conversion. In addition, in the hidden space, the manifold corresponding to the voice characteristic tends to be flat, and thus the basic characteristic of the voice can be changed by moving in a straight line direction. Based on the two basic characteristics, simple displacement can be performed on a connecting line of the sample points, so that conversion from one sample point to another sample point is realized. If one shifts from a set of sample point centers with one attribute (e.g., speaker a) to a sample point center with another attribute (e.g., speaker B), then a transition between the different attributes (transition from speaker a to speaker B) is achieved.
As shown in fig. 2, the great circle Z is the entire hidden space, and the ellipse a represents various pronunciation sets of the speaker a, and the ellipse B represents various pronunciation sets of the speaker B. When the data moves along the direction of the connecting line of the two types of centers, the conversion from the speaker A to the speaker B can be realized. The conversion of emotion, accent, etc. can be realized by the same method.
In the embodiment, the original voice is mapped into a continuous hidden space, the voice characteristics are changed in the hidden space, and the converted target voice is obtained through inverse mapping, so that the distortion resistance is strong, and other attributes are not damaged, so that the converted target voice is more continuous and smooth; meanwhile, the resource consumption is low, and excessive calculation overhead can not be brought.
Further, on the basis of the above embodiment of the method, before S101, the method further includes:
fusing voice samples with various characteristics together, and training to obtain the flow model when the probability of all the voice samples in an observation space is maximum and the voice samples accord with Gaussian distribution in a hidden space;
wherein the Gaussian distribution z is:
z=f(x)
x to P (x), z to N (0,1), x is the voice sample, P (x) is the data distribution of the real space, f (x) is the invertible mapping, and N (0,1) is the standard normal distribution.
Specifically, the flow model is a generative model that has emerged in recent years, and like most generative models, the essential goal of the flow model is to fit the distribution p (x) of the data space. If the distribution p (x) of the real data space can be obtained, a lot of real data can be obtained by sampling. However, the composition of the real data is very complex, and it is obviously not feasible to directly count p (x) by the training data. The flow model therefore uses a simple normal distribution z, and an invertible mapping f (x) to fit the true data distribution.
The model may define a forward process and a reverse process. The forward process obtains a corresponding hidden variable z by sampling x in the observation space and transforming to the hidden space, as shown in fig. 3:
z=f(x)x~P(x)
in the reverse process, an observation variable x is obtained by sampling z in a hidden space and mapping the sampling z back to an observation space:
x=f-1(z)z~N(0,1)
wherein f is-1The inverse transformation of f (x and z dimensions being the same) can be implemented by various reversible neural networks.
In training, it is desirable to maximize the probability of the model for all observed data, so the objective function is set as follows:
according to the knowledge of probability theory, the following steps are carried out:
therefore, there are:
in the above formula, PzThe distribution of the hidden variable z may be specified in advance. Thus, for a given parameter of the mapping function f, the above-mentioned objective function is computable. During the training process, the parameters of the function f can be learned through various optimization algorithms (such as a gradient descent method), so that the optimization of the objective function is realized.
Specifically, as shown in fig. 4, in this embodiment, real speech is used as an observation variable, and a strong space mapping capability of a stream model is utilized to map a real speech signal into a simple continuous hidden space through the stream model; then, finding the conversion direction of the target pronunciation characteristic in the hidden space, and shifting in the direction to realize the conversion from the original voice to the target voice in the hidden space; and finally, mapping the converted voice in the hidden space back to a real voice space by utilizing the inverse mapping of the stream model, thereby realizing the voice conversion between the source voice and the target voice.
Further, on the basis of the above method embodiment, S101 specifically includes:
respectively mapping an original voice A and a target voice B to a hidden space to obtain a corresponding hidden space voice:
zAi=f(xAi)
zBj=f(xBj)
correspondingly, S102 specifically includes:
according to zAiCalculating the first center point of the original speech A according to zBjCalculating a second central point of the original voice B;
determining the conversion direction delta z of the target voice according to the first central point and the second central point:
according to the conversion direction delta z and the hidden space voice zpCalculating to obtain voice z 'after displacement'p:
z′p=zp+λΔz
Wherein, λ is step length, 0< λ ≦ 1.
S103 specifically comprises the following steps:
the post-displacement voice z'pMapping to the target Speech x'p:
x′p=f-1(z′p)
Wherein f is-1The inverse of f.
In the process of voice conversion processing, the following four stages are specifically included:
s1, training: all samples with various speech characteristics are fused together to train the stream model. The stream model obtained by training enables the probability of all sample points to be maximum in the observation space and accords with Gaussian distribution in the hidden space, namely:
z=f(x)
wherein x to P (x), z to N (0, 1).
S2, forward transformation stage: based on the flow model obtained in step 1, mapping the sample set with the characteristic A and the sample set with the characteristic B into a hidden space:
zAi=f(xAi)
zBj=f(xBj)
s3, hidden space conversion stage: based on the hidden variables obtained in S2, the center point of the sample set having the characteristic a and the center point of the sample set having the characteristic B are calculated, respectively. The direction from the center point of the characteristic a to the center point of the characteristic B represents a characteristic conversion direction Δ z, that is:
for each sample point z to be converted having a property ApSelecting a proper step length lambda, and converting the step length lambda along the direction to obtain a sample point with the target characteristic B, namely:
z′p=zp+λΔz
wherein 0< lambda < 1.
S4, inverse transformation stage: hidden space sample point z 'obtained based on S3 conversion'pZ 'is set by using the flow model obtained in S2'pConverse transforming to original voice space to obtain a real voice data sample x'p. Namely:
x′p=f-1(z′p)
thus, the conversion from the pronunciation characteristic a to the pronunciation characteristic B is realized.
The speech conversion processing method provided by the embodiment can be applied to the following specific scenes:
and (3) switching the accents: the conversion of the same speaker between different accents is realized. For example, a speech conversion from a northeast dialect to mandarin chinese can be achieved by mixing mandarin and northeast dialect together with the training stream model according to the above conversion method.
Speaker conversion: the voice signal of one speaker (source speaker) is converted, and the modified voice signal sounds like the other speaker (target speaker) to speak under the premise of keeping the expressed semantic information.
And (3) emotion conversion: the voice conversion of a speaker under different emotions is realized. For example, by mixing voices with different emotions together to train the stream model, the voice conversion from positive emotion (happy, excited) to negative emotion (sad ) of a speaker can be realized according to the conversion method.
Compared with a typical voice conversion model, the conversion method based on the stream model only converts in the direction related to a certain attribute without damaging other attributes, so that the converted voice is more continuous and smooth, and meanwhile, conversion is performed in a hidden space with continuous Gaussian distribution, so that meaningless data points can not appear in a conversion path, distortion can be prevented, and the anti-distortion capability of the conversion method is strong; in addition, the flow model has simple structure, no excessive calculation, low overhead resource consumption, and simple system software structure, and the conversion system only depends on one flow model.
Fig. 5 is a schematic structural diagram of a speech conversion processing apparatus provided in this embodiment, where the apparatus includes: a hidden space mapping module 501, a voice displacement module 502 and a voice mapping module 503, wherein:
the hidden space mapping module 501 is configured to map an original speech of a real space into a simple continuous hidden space according to a space mapping capability of a stream model to obtain a hidden space speech;
the voice displacement module 502 is configured to determine a conversion direction of a target voice in the hidden space, and displace the hidden space voice in the conversion direction to obtain a displaced voice;
the voice mapping module 503 is configured to map the displaced voice back to the real space according to inverse mapping of a stream model to obtain the target voice, so as to implement voice conversion from the original voice to the target voice.
Specifically, the hidden space mapping module 501 maps the original speech of the real space into a simple continuous hidden space according to the space mapping capability of the stream model, so as to obtain a hidden space speech; the voice displacement module 502 determines a conversion direction of a target voice in the hidden space, and displaces the hidden space voice in the conversion direction to obtain a displaced voice; the voice mapping module 503 maps the shifted voice back to the real space according to the inverse mapping of the stream model to obtain the target voice, so as to implement voice conversion from the original voice to the target voice.
In the embodiment, the original voice is mapped into a continuous hidden space, the voice characteristics are changed in the hidden space, and the converted target voice is obtained through inverse mapping, so that the distortion resistance is strong, and other attributes are not damaged, so that the converted target voice is more continuous and smooth; meanwhile, the resource consumption is low, and excessive calculation overhead can not be brought.
Further, on the basis of the above device embodiment, the voice conversion processing device further includes:
the model training module is used for fusing the voice samples with various characteristics together, and when the probability of all the voice samples in an observation space is maximum and the voice samples accord with Gaussian distribution in a hidden space, the flow model is obtained through training;
wherein the Gaussian distribution z is:
z=f(x)
x to P (x), z to N (0,1), x is the voice sample, P (x) is the data distribution of the real space, f (x) is the invertible mapping, and N (0,1) is the standard normal distribution.
Further, on the basis of the above apparatus embodiment, the hidden space mapping module 501 is specifically configured to:
respectively mapping an original voice A and a target voice B to a hidden space to obtain a corresponding hidden space voice:
zAi=f(xAi)
zBj=f(xBj)
correspondingly, the voice displacement module is specifically configured to:
according to zAiCalculating the first center point of the original speech A according to zBjCalculating a second central point of the original voice B;
determining the conversion direction delta z of the target voice according to the first central point and the second central point:
according to the conversion direction delta z and the hidden space voice zpCalculating to obtain voice z 'after displacement'p:
z′p=zp+λΔz
Wherein, λ is step length, 0< λ ≦ 1.
Further, on the basis of the above device embodiment, the voice mapping module 503 specifically includes a module for:
the post-displacement voice z'pMapping to the target Speech x'p:
x′p=f-1(z′p)
Wherein f is-1The inverse of f.
The speech conversion processing apparatus described in this embodiment may be used to implement the above method embodiments, and the principle and technical effect are similar, which are not described herein again.
Referring to fig. 6, the electronic device includes: a processor (processor)601, a memory (memory)602, and a bus 603;
wherein the content of the first and second substances,
the processor 601 and the memory 602 communicate with each other through the bus 603;
the processor 601 is used for calling the program instructions in the memory 602 to execute the methods provided by the above-mentioned method embodiments.
The present embodiments disclose a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the methods provided by the above-described method embodiments.
The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the methods provided by the method embodiments described above.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
It should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (10)
1. A method of speech conversion processing, comprising:
according to the space mapping capability of the flow model, mapping the original voice of the real space to a simple continuous hidden space to obtain hidden space voice;
determining a conversion direction of a target voice in the hidden space, and displacing the hidden space voice in the conversion direction to obtain a displaced voice;
and mapping the displaced voice back to the real space according to the inverse mapping of the flow model to obtain the target voice so as to realize voice conversion from the original voice to the target voice.
2. The method of claim 1, wherein before mapping an original speech in a real space into a simple continuous hidden space according to a spatial mapping capability of a stream model to obtain a hidden space speech, the method further comprises:
fusing voice samples with various characteristics together, and training to obtain the flow model when the probability of all the voice samples in an observation space is maximum and the voice samples accord with Gaussian distribution in a hidden space;
wherein the Gaussian distribution z is:
z=f(x)
x to P (x), z to N (0,1), x is the voice sample, P (x) is the data distribution of the real space, f (x) is the invertible mapping, and N (0,1) is the standard normal distribution.
3. The speech conversion processing method according to claim 1, wherein the mapping an original speech in a real space into a simple continuous hidden space according to a spatial mapping capability of a stream model to obtain a hidden space speech specifically includes:
respectively mapping an original voice A and a target voice B to a hidden space to obtain a corresponding hidden space voice:
zAi=f(xAi)
zBj=f(xBj)
correspondingly, the determining a conversion direction of the target speech in the hidden space, and shifting the hidden space speech in the conversion direction to obtain a shifted speech specifically includes:
according to zAiCalculating the first center of the original speech APoint and according to zBjCalculating a second central point of the original voice B;
determining the conversion direction delta z of the target voice according to the first central point and the second central point:
according to the conversion direction delta z and the hidden space voice zpCalculating to obtain voice z 'after displacement'p:
z′p=zp+λΔz
Wherein, the lambda is the step length, and the lambda is more than 0 and less than or equal to 1.
4. The method according to claim 1, wherein the mapping the displaced speech back to the real space according to an inverse mapping of a stream model to obtain the target speech so as to implement speech conversion from the original speech to the target speech, specifically comprises:
the post-displacement voice z'pMapping to the target Speech x'p:
x′p=f-1(z′p)
Wherein f is-1The inverse of f.
5. A speech conversion processing apparatus, comprising:
the hidden space mapping module is used for mapping the original voice of the real space to a simple continuous hidden space according to the space mapping capacity of the stream model to obtain the hidden space voice;
the voice displacement module is used for determining the conversion direction of the target voice in the hidden space and displacing the hidden space voice in the conversion direction to obtain the displaced voice;
and the voice mapping module is used for mapping the displaced voice back to the real space according to the inverse mapping of the flow model to obtain the target voice so as to realize the voice conversion from the original voice to the target voice.
6. The speech conversion processing apparatus according to claim 5, further comprising:
the model training module is used for fusing the voice samples with various characteristics together, and when the probability of all the voice samples in an observation space is maximum and the voice samples accord with Gaussian distribution in a hidden space, the flow model is obtained through training;
wherein the Gaussian distribution z is:
z=f(x)
x to P (x), z to N (0,1), x is the voice sample, P (x) is the data distribution of the real space, f (x) is the invertible mapping, and N (0,1) is the standard normal distribution.
7. The apparatus according to claim 5, wherein the implicit spatial mapping module is specifically configured to:
respectively mapping an original voice A and a target voice B to a hidden space to obtain a corresponding hidden space voice:
zAi=f(xAi)
zBj=f(xBj)
correspondingly, the voice displacement module is specifically configured to:
according to zAiCalculating the first center point of the original speech A according to zBjCalculating a second central point of the original voice B;
determining the conversion direction delta z of the target voice according to the first central point and the second central point:
according to the conversion direction delta z and the hidden space voice zpCalculating to obtain voice z 'after displacement'p:
z′p=zp+λΔz
Wherein, the lambda is the step length, and the lambda is more than 0 and less than or equal to 1.
8. The apparatus for processing speech conversion according to claim 5, wherein the speech mapping module specifically includes means for:
the post-displacement voice z'pMapping to the target Speech x'p:
x′p=f-1(z′p)
Wherein f is-1The inverse of f.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the speech conversion processing method according to any one of claims 1 to 4 when executing the program.
10. A non-transitory computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the speech conversion processing method according to any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010084699.XA CN111292718A (en) | 2020-02-10 | 2020-02-10 | Voice conversion processing method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010084699.XA CN111292718A (en) | 2020-02-10 | 2020-02-10 | Voice conversion processing method and device, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111292718A true CN111292718A (en) | 2020-06-16 |
Family
ID=71023483
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010084699.XA Pending CN111292718A (en) | 2020-02-10 | 2020-02-10 | Voice conversion processing method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111292718A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114255737A (en) * | 2022-02-28 | 2022-03-29 | 北京世纪好未来教育科技有限公司 | Voice generation method and device and electronic equipment |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020072908A1 (en) * | 2000-10-19 | 2002-06-13 | Case Eliot M. | System and method for converting text-to-voice |
CN1835074A (en) * | 2006-04-07 | 2006-09-20 | 安徽中科大讯飞信息科技有限公司 | Speaking person conversion method combined high layer discription information and model self adaption |
US20180012613A1 (en) * | 2016-07-11 | 2018-01-11 | The Chinese University Of Hong Kong | Phonetic posteriorgrams for many-to-one voice conversion |
CN108198566A (en) * | 2018-01-24 | 2018-06-22 | 咪咕文化科技有限公司 | Information processing method and device, electronic equipment and storage medium |
CN108885608A (en) * | 2016-06-09 | 2018-11-23 | 苹果公司 | Intelligent automation assistant in home environment |
US20190355372A1 (en) * | 2018-05-17 | 2019-11-21 | Spotify Ab | Automated voiceover mixing and components therefor |
-
2020
- 2020-02-10 CN CN202010084699.XA patent/CN111292718A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020072908A1 (en) * | 2000-10-19 | 2002-06-13 | Case Eliot M. | System and method for converting text-to-voice |
CN1835074A (en) * | 2006-04-07 | 2006-09-20 | 安徽中科大讯飞信息科技有限公司 | Speaking person conversion method combined high layer discription information and model self adaption |
CN108885608A (en) * | 2016-06-09 | 2018-11-23 | 苹果公司 | Intelligent automation assistant in home environment |
US20180012613A1 (en) * | 2016-07-11 | 2018-01-11 | The Chinese University Of Hong Kong | Phonetic posteriorgrams for many-to-one voice conversion |
CN108198566A (en) * | 2018-01-24 | 2018-06-22 | 咪咕文化科技有限公司 | Information processing method and device, electronic equipment and storage medium |
US20190355372A1 (en) * | 2018-05-17 | 2019-11-21 | Spotify Ab | Automated voiceover mixing and components therefor |
Non-Patent Citations (3)
Title |
---|
HAORAN SUN ET AL: "On Investigation Of Unsupervised Speech Factorization Based On Normalization Flow", 《ARXIV》 * |
车滢霞等: "约束条件下的结构化高斯混合模型及非平行语料语音转换", 《电子学报》 * |
马振等: "基于语音个人特征信息分离的语音转换方法研究", 《信号处理》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114255737A (en) * | 2022-02-28 | 2022-03-29 | 北京世纪好未来教育科技有限公司 | Voice generation method and device and electronic equipment |
CN114255737B (en) * | 2022-02-28 | 2022-05-17 | 北京世纪好未来教育科技有限公司 | Voice generation method and device and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111048064B (en) | Voice cloning method and device based on single speaker voice synthesis data set | |
CN106297773A (en) | A kind of neutral net acoustic training model method | |
CN111581470B (en) | Multi-mode fusion learning analysis method and system for scene matching of dialogue system | |
CN114895817B (en) | Interactive information processing method, network model training method and device | |
WO2023197979A1 (en) | Data processing method and apparatus, and computer device and storage medium | |
WO2021212954A1 (en) | Method and apparatus for synthesizing emotional speech of specific speaker with extremely few resources | |
CN112860871B (en) | Natural language understanding model training method, natural language understanding method and device | |
CN113327580A (en) | Speech synthesis method, device, readable medium and electronic equipment | |
CN114242033A (en) | Speech synthesis method, apparatus, device, storage medium and program product | |
CN111292718A (en) | Voice conversion processing method and device, electronic equipment and storage medium | |
CN116798405B (en) | Speech synthesis method, device, storage medium and electronic equipment | |
CN117725936A (en) | Long dialogue emotion dynamic identification method and system based on hypergraph network | |
Zoric et al. | A real-time lip sync system using a genetic algorithm for automatic neural network configuration | |
Zhang et al. | Promptspeaker: Speaker Generation Based on Text Descriptions | |
CN115798456A (en) | Cross-language emotion voice synthesis method and device and computer equipment | |
Zorić et al. | Real-time language independent lip synchronization method using a genetic algorithm | |
CN113611283A (en) | Voice synthesis method and device, electronic equipment and storage medium | |
CN115148182A (en) | Speech synthesis method and device | |
CN116863909B (en) | Speech synthesis method, device and system based on factor graph | |
CN116825090B (en) | Training method and device for speech synthesis model and speech synthesis method and device | |
Zhang et al. | Review of the Lip-reading Recognition | |
CN114582314B (en) | Man-machine audio-video interaction logic model design method based on ASR | |
CN114093389B (en) | Speech emotion recognition method and device, electronic equipment and computer readable medium | |
CN115440198B (en) | Method, apparatus, computer device and storage medium for converting mixed audio signal | |
WO2024008215A2 (en) | Speech emotion recognition method and apparatus |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200616 |
|
RJ01 | Rejection of invention patent application after publication |