CN114420142A

CN114420142A - Voice conversion method, device, equipment and storage medium

Info

Publication number: CN114420142A
Application number: CN202210308610.2A
Authority: CN
Inventors: 胡明櫆; 赵超
Original assignee: Beijing Wofeng Times Data Technology Co ltd
Current assignee: Beijing Wofeng Times Data Technology Co ltd
Priority date: 2022-03-28
Filing date: 2022-03-28
Publication date: 2022-04-29

Abstract

The invention provides a voice conversion method, a device, equipment and a storage medium, comprising the following steps: determining a first low-dimensional vector corresponding to a target voice based on the target voice; separating the first low-dimensional vector to obtain a target identity voice vector; a final converted voice is determined based on the sample identity voice vector and the target identity voice vector. The invention can realize one-to-many voice conversion without adding an additional conversion model for the system, and simultaneously, because a mode of calculating the feature offset of the Mel Frequency Cepstrum Coefficient (MFCC) to obtain the target voice is adopted, a large amount of parallel linguistic data are not needed to be used as training data, and the difficulty of collecting the training data is reduced; the invention is applicable to the customer service platform, and aiming at the problem of single voice of the voice conversion function of the customer service platform, the one-to-many voice conversion technology is adopted to increase the types of the convertible voice of the customer service system, thereby achieving the purpose of protecting the privacy of the customer; the invention is convenient and simple, has strong functions and can effectively and quickly realize voice conversion.

Description

Voice conversion method, device, equipment and storage medium

Technical Field

The present invention relates to the field of voice conversion, and in particular, to a voice conversion method, apparatus, device, and storage medium.

Background

In recent years, the development of artificial intelligence technology is rapid, and meanwhile, the development of artificial intelligence brings convenience to daily life of people. Speech conversion, as a newly emerging technology in artificial intelligence, has wide applications in the fields of telecine dubbing, Artificial Intelligence (AI) anchor, privacy protection, and the like. Although the speech conversion technology has a short time, the speech conversion is developed rapidly, and the development is in the golden period after three times of technical iterations. At present, the voice conversion technology is close to natural human voice in fluency and naturalness.

Because all the conversation voices are public in the customer service platform, the effect of protecting the privacy of the customer cannot be achieved, and therefore, a voice conversion function is generally embedded in the customer service platform, and the voice of the customer can be converted into specific voice to achieve the purpose of protecting the privacy of the individual.

Speaker Recognition (SR), also known as voiceprint Recognition, is a biometric authentication technique that uses specific Speaker information contained in a speech signal to identify the identity of a Speaker. In recent years, the introduction of an identity authentication vector (i-vector) speaker modeling method based on factor analysis has obviously improved the performance of a speaker recognition system. Experiments have shown that in the factorial analysis of speaker speech, usually the channel subspace contains the information of the speaker. Thus, the speaker subspace and the channel subspace are represented by a low-dimensional total variable space, and the speaker's voice is mapped to this space to obtain a fixed-length vector representation (i.e., i-vector). The speaker recognition system based on the i-vector mainly comprises 3 steps of sufficient statistic extraction, i-vector mapping and likelihood ratio score calculation. Firstly, extracting voice signal characteristics to train a Gaussian mixture model-universal background model (GMM-UBM) representing a voice space, calculating sufficient statistics of each frame of voice characteristics by using the trained universal background model, and mapping the sufficient statistics to a total variable space to obtain i-vector of each speaker voice. And finally, modeling the i-vector by utilizing a Probabilistic Linear Discriminant Analysis (PLDA) model, calculating a likelihood ratio score, and making final judgment according to a set threshold value.

The current voice conversion function based on the customer service system is based on one-to-one voice conversion, the conversion voice is monotonous, if a model needs to be additionally added for voice conversion, more system space is occupied, each model needs to be trained independently, and much time is consumed for training. However, at present, there is no technical solution for solving the above technical problems, and specifically, there is no method, apparatus, device and storage medium for voice conversion.

Disclosure of Invention

The invention aims to provide a voice conversion method, which comprises the following steps:

determining a first low-dimensional vector corresponding to a target voice based on the target voice;

separating the first low-dimensional vector to obtain a target identity voice vector;

a final converted voice is determined based on the sample identity voice vector and the target identity voice vector.

According to a voice conversion method provided by the present invention, the determining a first low-dimensional vector corresponding to a target voice based on the target voice includes:

determining MFCC features of the target speech based on Mel frequency cepstral coefficient MFCC feature extraction;

processing the MFCC characteristics of the target voice by adopting a maximum likelihood estimation algorithm to determine a supervector corresponding to the target voice;

the supervectors are compressed to obtain a first low-dimensional vector.

According to a voice conversion method provided by the present invention, the separating the first low-dimensional vector to obtain a target identity voice vector includes:

wherein, the

Is the first low-dimensional vector, b is a constant, S is a linear transformation,

in the case of a speech content-related item,

is a target identity voice vector.

According to a voice conversion method provided by the present invention, the determining of the final converted voice based on the sample identity voice vector and the target identity voice vector includes:

determining a difference value between the sample identity voice vector and the target identity voice vector;

determining a third low-dimensional vector based on the difference value and the first low-dimensional vector;

determining a final converted speech based on the third low-dimensional vector.

According to a voice conversion method provided by the present invention, the determining a third low-dimensional vector based on the difference and the first low-dimensional vector includes:

wherein, in the step (A),

is a third low-dimensional vector of

Is a first low-dimensional vector, and is,

for the sample identity voice vector, the identity of the voice vector,

for the target identity voice vector, S is a linear transformation.

According to a voice conversion method provided by the present invention, before determining a final converted voice based on a sample identity voice vector and a target identity voice vector, the method comprises:

fitting a general speech model by adopting a Gaussian mixture model for sample speech, wherein the general speech model at least comprises MFCC (Mel frequency cepstrum coefficient) characteristics, and the sample speech is a speech set fusing a plurality of pieces of speech;

processing the generic speech model to determine a plurality of second low-dimensional vectors, the plurality of second low-dimensional vectors being vectors corresponding to the generic speech model that contain a plurality of pieces of speech information;

and extracting the identity voice information of any one of the second low-dimensional vectors to obtain a sample identity voice vector.

According to a speech conversion method provided by the present invention, the processing the generic speech model to determine a plurality of second low-dimensional vectors includes:

determining an estimated value of each voice in the universal voice model by adopting a maximum likelihood estimation algorithm, and accumulating all the estimated values to obtain a supervector corresponding to all the voices;

and compressing the supervectors corresponding to all the voices to obtain a plurality of second low-dimensional vectors.

According to a voice conversion method provided by the present invention, the extracting identity voice information of any one of the second low-dimensional vectors to obtain a sample identity voice vector includes:

separating the second low-dimensional vector based on Probabilistic Linear Discriminant Analysis (PLDA) to obtain a plurality of candidate identity speech vectors;

the plurality of candidate identity voice vectors are processed based on an extraction policy to obtain a sample identity voice vector.

According to a voice conversion method provided by the present invention, the extraction strategy includes:

random extraction;

sequentially extracting;

extraction is specified.

According to a voice conversion method provided by the invention, before determining a first low-dimensional vector corresponding to a target voice based on the target voice, the method comprises the following steps:

determining a rotation frequency based on the number of times of execution of the sample voice and time;

and judging whether to replace the sample voice based on the rotation frequency and the sample voice cluster, wherein the sample voice data cluster consists of a plurality of sample voices.

The present invention also provides a voice conversion apparatus, comprising:

the first determination means: determining a first low-dimensional vector corresponding to a target voice based on the target voice;

an acquisition device: separating the first low-dimensional vector to obtain a target identity voice vector;

second determining means: a final converted voice is determined based on the sample identity voice vector and the target identity voice vector.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and operable on the processor, wherein the processor implements the steps of the speech conversion method when executing the program.

The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the speech conversion method.

The method determines a first low-dimensional vector through the target voice, separates the first low-dimensional vector to determine a target identity voice vector, determines a third low-dimensional vector based on the sample identity voice vector and the target identity voice vector, and further obtains the final converted voice. Compared with the prior art, the invention can realize one-to-many voice conversion without adding an additional conversion model for the system, and simultaneously, because a mode of calculating the MFCC characteristic offset to obtain the target voice is adopted, a large amount of parallel linguistic data are not needed to be used as training data, thereby reducing the difficulty of collecting the training data; the invention is applicable to the customer service platform, and aiming at the problem of single voice of the voice conversion function of the customer service platform, the one-to-many voice conversion technology is adopted to increase the types of the convertible voice of the customer service system, thereby achieving the purpose of protecting the privacy of the customer; the invention is convenient and simple, has strong functions and can effectively and quickly realize voice conversion.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart of a voice conversion method according to the present invention;

FIG. 2 is a schematic flow chart of determining a first low-dimensional vector corresponding to the target speech according to the present invention;

FIG. 3 is a schematic flow chart of determining final converted speech provided by the present invention;

FIG. 4 is a second flowchart of a voice conversion method according to the present invention;

FIG. 5 is a schematic flow chart of determining a plurality of second low-dimensional vectors provided by the present invention;

FIG. 6 is a schematic flow chart of obtaining a sample identity speech vector according to the present invention;

FIG. 7 is a schematic structural diagram of a voice conversion apparatus provided in the present invention;

fig. 8 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Embodiments of the present invention are described below in conjunction with fig. 1-8.

Fig. 1 is a schematic flow diagram of a voice conversion method provided by the present invention, and the present invention is directed to a method for performing voice conversion quickly after selecting a voice type to be converted from multiple sample voice clusters through a target voice, wherein a low-dimensional vector is separated through i-vector to obtain a voice content portion and a voice identity portion, so that the voice content portion is kept unchanged, the voice identity portion is changed, and a final converted voice is determined through reverse derivation, including:

firstly, step S101 is executed to determine a first low-dimensional vector corresponding to a target voice, where the target voice is original voice information of a speaker, and it needs to determine the first low-dimensional vector, that is, an i-vector corresponding to the original voice information of the speaker, by means of a mel-frequency cepstrum coefficient MFCC feature extraction, a maximum likelihood estimation algorithm, a compressed super-vector, and the like, and the purpose of obtaining the first low-dimensional vector is to determine a target identity voice vector in the first low-dimensional vector, and how the target voice determines the first low-dimensional vector will be further described in the detailed description below.

Then, step S102 is executed to separate the first low-dimensional vector to obtain the target identity voice vector, and in such an embodiment, the separating the first low-dimensional vector to obtain the target identity voice vector includes:

wherein, the

in the case of a speech content-related item,

for the target identity voice vector, that is, the first low-dimensional vector is separated by adopting Probability Linear Discriminant Analysis (PLDA), three parts of voice content related items, the target identity voice vector and the constant can be separated, and the parameter of the target identity voice vector is mainly used in the application.

Finally, step S103 is entered, and the final converted speech is determined based on the sample identity speech vector and the target identity speech vector, where the sample identity speech vector is data preset in the present invention, that is, in the actual process, a step that must be executed in each speech conversion is not required.

As a preferred embodiment of the present invention, before determining a first low-dimensional vector corresponding to a target speech based on the target speech, the method includes:

the rotation frequency is determined based on the number of executions and the time of the sample voice, and in the voice switching process, the converted voice identity vector is selected from a plurality of voices of the sample voice, so that after the voice switching process, the situation that the executed sample voice is used for a plurality of times can be caused, and the situation that the sample voice is frequently used due to long-time non-replacement of the sample voice can also be caused.

Further, in order to better implement replacement management of sample voices, a sample voice data cluster may be set, where the sample voice data cluster is composed of a plurality of sample voices, that is, in this embodiment, whether to replace a sample voice is determined based on the rotation frequency and the sample voice cluster, for example, if a sample voice is used 100 times, a sample voice is determined to be replaced, and if a sample voice is not replaced for more than 1 month, a sample voice is determined to be replaced.

In the invention, because the voice switching is involved, the switched voice identity content is the voice identity content, the related items of the voice content are kept unchanged, and the voice identity content depends on the voice identity vector, so the method can directly change the converted low-dimensional vector by changing the voice identity vector and finally determine the converted voice.

Fig. 2 is a schematic flowchart of the process for determining the first low-dimensional vector corresponding to the target speech provided by the present invention, and as shown in fig. 2, fig. 2 shows specific steps of determining the first low-dimensional vector corresponding to the target speech based on the target speech, including:

firstly, step S1011 is executed, MFCC characteristics of the target voice are determined based on MFCC characteristic extraction, voice signal detection is carried out on a target sample, and MFCC characteristics are extracted and optimized.

MFCC feature extraction contains two key steps: converted to mel frequency and then subjected to cepstrum analysis, the mel scale being a non-linear frequency scale based on the sensory judgment of the human ear on equidistant pitch (pitch) changes, the cepstrum analysis meaning: and performing Fourier transform on the time domain signal, then taking a log, and then performing inverse Fourier transform. Can be divided into complex cepstrum, real cepstrum and power cepstrum.

Then, step S1012 is performed, the MFCC features of the target speech are processed by using a maximum likelihood estimation algorithm to determine a supervector corresponding to the target speech, the first low-dimensional vector is determined by first adjusting the mean value of the gaussian mixture model GMM of the target speech, and the MFCC features are processed based on the acoustic model TDNN, so as to obtain a mean-centered supervector, more specifically, the maximum probability of the MFCC corresponding to the target speech is obtained after maximum likelihood estimation, and then the adjusted mean values are added to obtain a supervector representation of the target speech.

Finally, step S1013 is performed, the supervectors are compressed to obtain the first low-dimensional vectors, and the general dimensions of the supervectors in step S1012 are high, so that the supervectors need to be compressed to hundreds of dimensions to obtain the first low-dimensional vectors.

Fig. 3 is a schematic flow chart of determining final converted speech provided by the present invention, and as shown in fig. 3, fig. 3 shows a specific flow chart of determining final converted speech based on a sample identity speech vector and a target identity speech vector, including:

firstly, go to step S1031, determine the difference between the sample identity speech vector and the target identity speech vector, in the present invention, based on the formula

Wherein, the

in the case of a speech content-related item,

for the target identity voice vector, it can be known that the low-dimensional vector can be separated into the correlation of voice content and the multiplier of the target identity voice vector and linear transformation, i.e. the difference between the sample identity voice vector and the target identity voice vector is the same

The product of the difference between the sample identity voice vector and the target identity voice vector and the linear transformation S can be understood as the difference between the low-dimensional vector of the sample voice and the low-dimensional vector of the target voice.

Then, proceeding to step S1032, determining a third low-dimensional vector based on the difference value and the first low-dimensional vector, where the determining a third low-dimensional vector based on the difference value and the first low-dimensional vector includes:

wherein, in the step (A),

is a third low-dimensional vector of

Is a first low-dimensional vector, and is,

for the sample identity voice vector, the identity of the voice vector,

based on step S1031, the product of the difference between the sample identity speech vector and the target identity speech vector and the linear transformation S can be understood as the difference between the low-dimensional vector of the sample speech and the low-dimensional vector of the target speech, and the formula is shown above

Is that

Transformed to said third low-dimensional vector

I.e. the transformed low-dimensional vector.

Finally, step S1033 is performed, and the final converted speech is determined based on the third low-dimensional vector, in such an embodiment, the second low-dimensional vector of the sample speech is extracted first, and the identity speech vector of the sample speech is obtained after the training phase, that is, the identity vectors of all the speakers in the sample are obtained, and then the second low-dimensional vector of the speaker to be converted is obtained, and the final converted speech is determined based on the third low-dimensional vector

The method is characterized in that the method is a method for obtaining the change quantity of a super vector by inversely compressing the change quantity of a low-dimensional vector, namely the change quantity of the low-dimensional vector, of sample voice and the change quantity of a target voice, the super vector is formed by adding the mean values of a voice Gaussian mixture model GMM, the change quantity of the Gaussian mixture model GMM can be obtained, the bias of MFCC characteristics can be obtained, and the Mel frequency cepstrum coefficient MFCC characteristics can be directly synthesized into final converted voice.

Fig. 4 is a second flowchart of a speech conversion method provided by the present invention, as shown in fig. 4, fig. 4 shows steps before determining final converted speech based on a sample identity speech vector and a target identity speech vector, including:

firstly, step S201 is executed, a gaussian mixture model is used to fit a general speech model to a sample speech, the general speech model at least includes MFCC features, the sample speech is a speech set fusing a plurality of pieces of speech, the gaussian mixture model is a model formed by decomposing things into a plurality of gaussian probability density functions and accurately quantizing things by using the gaussian probability density functions, and MFCC features of the general speech model are extracted and determined based on MFCC features.

Then, step S202 is executed to process the generic speech model to determine a plurality of second low-dimensional vectors, which are vectors corresponding to the generic speech model and containing a plurality of pieces of speech information, and which will be further described in fig. 5.

Finally, step S203 is executed to extract the identity voice information of any one of the second low-dimensional vectors to obtain a sample identity voice vector, where the second low-dimensional vector may adopt PLDA to separate out the identity voice information corresponding to all the low-dimensional vectors, where the identity voice information is a product of the identity voice vector and linear transformation, and the purpose of this step is to determine one of the second low-dimensional vectors as a sample low-dimensional vector to be transformed based on user selection, so as to determine the sample identity voice vector.

Fig. 5 is a schematic flowchart of the process for determining a plurality of second low-dimensional vectors according to the present invention, and as shown in fig. 5, the specific steps of processing the generic speech model to determine a plurality of second low-dimensional vectors include:

firstly, step S2021 is performed, an estimated value of each voice in the universal voice model is determined by using a maximum likelihood estimation algorithm, all estimated values are accumulated to obtain a supervector corresponding to all voices, specifically, the determination of the second low-dimensional vector first needs to adjust a mean value of GMMs of each voice in the universal language model, that is, a process of maximum likelihood estimation, obtain a maximum probability of MFCC corresponding to each voice after the maximum likelihood estimation, and then needs to add the adjusted mean values to obtain a supervector representation of each voice.

Then, step S2022 is performed to compress the supervectors corresponding to all the speeches to obtain a plurality of second low-dimensional vectors, which are obtained after compressing to hundreds of dimensions because the supervectors have high dimensions.

Fig. 6 is a schematic flow chart of obtaining a sample identity speech vector provided by the present invention, and as shown in fig. 6, fig. 6 shows a specific step of extracting identity speech information of any one of the second low-dimensional vectors to obtain a sample identity speech vector, which includes:

first, step S2031 is entered, the second low-dimensional vector is separated based on the probability linear discriminant analysis PLDA to obtain a plurality of candidate identity voice vectors, and the separation of the second low-dimensional vector based on the probability linear discriminant analysis PLDA may refer to the aforementioned step S102, the first low-dimensional vector is separated to obtain the target identity voice vector, that is:

wherein, the

in the case of a speech content-related item,

and correspondingly separating a plurality of second low-dimensional vectors based on the Probability Linear Discriminant Analysis (PLDA) to obtain a plurality of candidate identity voice vectors for the target identity voice vector, wherein the candidate identity voice vectors are used as alternative samples for the user, and when the user needs to select, one of the candidate identity voice vectors is selected.

Then, step S2032 is performed, and the plurality of candidate identity voice vectors are processed based on an extraction policy to obtain a sample identity voice vector, where the extraction policy includes random extraction, that is, one of the plurality of candidate identity voice vectors is randomly selected as the sample identity voice vector; the extraction strategy also comprises sequential extraction, namely numbering the candidate identity voice vectors, and sequentially circularly traversing and selecting one of the candidate identity voice vectors as a sample identity voice vector; the extraction strategy also comprises appointed extraction, namely selecting from a plurality of candidate identity voice vectors each time through preference selection of a user so as to determine a sample identity voice vector.

Fig. 7 is a schematic structural diagram of a speech conversion apparatus provided in the present invention, including:

the first determination device 1: the first low-dimensional vector corresponding to the target speech is determined based on the target speech, and the working principle of the first determining apparatus 1 may refer to the foregoing step S101, which is not described herein again.

The voice conversion apparatus further includes an acquisition means: the first low-dimensional vector is separated to obtain the target identity voice vector, and the working principle of the obtaining apparatus 2 may refer to the step S102, which is not described herein again.

The speech conversion apparatus further includes second determination means 3: the final converted voice is determined based on the sample identity voice vector and the target identity voice vector, and the working principle of the second determining device 3 may refer to the step S103, which is not described herein again.

Fig. 8 illustrates a physical structure diagram of an electronic device, and as shown in fig. 8, the electronic device may include: a processor (processor)310, a communication Interface (communication Interface)320, a memory (memory)330 and a communication bus 340, wherein the processor 310, the communication Interface 320 and the memory 330 communicate with each other via the communication bus 340. Processor 310 may invoke logic instructions in memory 330 to perform a speech conversion method comprising: determining a first low-dimensional vector corresponding to a target voice based on the target voice; separating the first low-dimensional vector to obtain a target identity voice vector; a final converted voice is determined based on the sample identity voice vector and the target identity voice vector.

In addition, the logic instructions in the memory 330 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer-readable storage medium, the computer program, when executed by a processor, being capable of executing a speech conversion method provided by the above methods, the method comprising: determining a first low-dimensional vector corresponding to a target voice based on the target voice; separating the first low-dimensional vector to obtain a target identity voice vector; a final converted voice is determined based on the sample identity voice vector and the target identity voice vector.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor, implements a method of speech conversion provided by the above methods, the method comprising: determining a first low-dimensional vector corresponding to a target voice based on the target voice; separating the first low-dimensional vector to obtain a target identity voice vector; a final converted voice is determined based on the sample identity voice vector and the target identity voice vector.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of speech conversion, comprising:

2. The method of claim 1, wherein determining a first low-dimensional vector corresponding to the target speech based on the target speech comprises:

the supervectors are compressed to obtain a first low-dimensional vector.

3. The method of claim 1, wherein the separating the first low-dimensional vector to obtain a target identity voice vector comprises:

wherein, the

in the case of a speech content-related item,

is a target identity voice vector.

4. The method of claim 1, wherein determining the final converted speech based on the sample identity speech vector and the target identity speech vector comprises:

determining a final converted speech based on the third low-dimensional vector.

5. The method of claim 4, wherein determining a third low-dimensional vector based on the difference value and the first low-dimensional vector comprises:

wherein, in the step (A),

is a third low-dimensional vector of

Is a first low-dimensional vector, and is,

for the sample identity voice vector, the identity of the voice vector,

for the target identity voice vector, S is a linear transformation.

6. The method of claim 1, prior to determining final converted speech based on the sample identity speech vector and the target identity speech vector, comprising:

fitting a general speech model by adopting a Gaussian mixture model for sample speech, wherein the general speech model at least comprises Mel Frequency Cepstrum Coefficient (MFCC) characteristics, and the sample speech is a speech set fusing a plurality of pieces of speech;

7. The method of claim 6, wherein the processing the generic speech model to determine a plurality of second low-dimensional vectors comprises:

8. The method of claim 6, wherein extracting identity voice information of any one of the plurality of second low-dimensional vectors to obtain a sample identity voice vector comprises:

9. The method of claim 8, wherein the extraction strategy comprises:

random extraction;

sequentially extracting;

extraction is specified.

10. The method of claim 1, prior to determining a first low-dimensional vector corresponding to a target speech based on the target speech, comprising:

11. A speech conversion apparatus, comprising:

12. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the speech conversion method according to any of claims 1 to 10 are implemented when the processor executes the program.

13. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the speech conversion method according to any one of claims 1 to 10.