CN114420142A - Voice conversion method, device, equipment and storage medium - Google Patents

Voice conversion method, device, equipment and storage medium Download PDF

Info

Publication number
CN114420142A
CN114420142A CN202210308610.2A CN202210308610A CN114420142A CN 114420142 A CN114420142 A CN 114420142A CN 202210308610 A CN202210308610 A CN 202210308610A CN 114420142 A CN114420142 A CN 114420142A
Authority
CN
China
Prior art keywords
voice
vector
speech
low
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210308610.2A
Other languages
Chinese (zh)
Inventor
胡明櫆
赵超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Wofeng Times Data Technology Co ltd
Original Assignee
Beijing Wofeng Times Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Wofeng Times Data Technology Co ltd filed Critical Beijing Wofeng Times Data Technology Co ltd
Priority to CN202210308610.2A priority Critical patent/CN114420142A/en
Publication of CN114420142A publication Critical patent/CN114420142A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention provides a voice conversion method, a device, equipment and a storage medium, comprising the following steps: determining a first low-dimensional vector corresponding to a target voice based on the target voice; separating the first low-dimensional vector to obtain a target identity voice vector; a final converted voice is determined based on the sample identity voice vector and the target identity voice vector. The invention can realize one-to-many voice conversion without adding an additional conversion model for the system, and simultaneously, because a mode of calculating the feature offset of the Mel Frequency Cepstrum Coefficient (MFCC) to obtain the target voice is adopted, a large amount of parallel linguistic data are not needed to be used as training data, and the difficulty of collecting the training data is reduced; the invention is applicable to the customer service platform, and aiming at the problem of single voice of the voice conversion function of the customer service platform, the one-to-many voice conversion technology is adopted to increase the types of the convertible voice of the customer service system, thereby achieving the purpose of protecting the privacy of the customer; the invention is convenient and simple, has strong functions and can effectively and quickly realize voice conversion.

Description

Voice conversion method, device, equipment and storage medium
Technical Field
The present invention relates to the field of voice conversion, and in particular, to a voice conversion method, apparatus, device, and storage medium.
Background
In recent years, the development of artificial intelligence technology is rapid, and meanwhile, the development of artificial intelligence brings convenience to daily life of people. Speech conversion, as a newly emerging technology in artificial intelligence, has wide applications in the fields of telecine dubbing, Artificial Intelligence (AI) anchor, privacy protection, and the like. Although the speech conversion technology has a short time, the speech conversion is developed rapidly, and the development is in the golden period after three times of technical iterations. At present, the voice conversion technology is close to natural human voice in fluency and naturalness.
Because all the conversation voices are public in the customer service platform, the effect of protecting the privacy of the customer cannot be achieved, and therefore, a voice conversion function is generally embedded in the customer service platform, and the voice of the customer can be converted into specific voice to achieve the purpose of protecting the privacy of the individual.
Speaker Recognition (SR), also known as voiceprint Recognition, is a biometric authentication technique that uses specific Speaker information contained in a speech signal to identify the identity of a Speaker. In recent years, the introduction of an identity authentication vector (i-vector) speaker modeling method based on factor analysis has obviously improved the performance of a speaker recognition system. Experiments have shown that in the factorial analysis of speaker speech, usually the channel subspace contains the information of the speaker. Thus, the speaker subspace and the channel subspace are represented by a low-dimensional total variable space, and the speaker's voice is mapped to this space to obtain a fixed-length vector representation (i.e., i-vector). The speaker recognition system based on the i-vector mainly comprises 3 steps of sufficient statistic extraction, i-vector mapping and likelihood ratio score calculation. Firstly, extracting voice signal characteristics to train a Gaussian mixture model-universal background model (GMM-UBM) representing a voice space, calculating sufficient statistics of each frame of voice characteristics by using the trained universal background model, and mapping the sufficient statistics to a total variable space to obtain i-vector of each speaker voice. And finally, modeling the i-vector by utilizing a Probabilistic Linear Discriminant Analysis (PLDA) model, calculating a likelihood ratio score, and making final judgment according to a set threshold value.
The current voice conversion function based on the customer service system is based on one-to-one voice conversion, the conversion voice is monotonous, if a model needs to be additionally added for voice conversion, more system space is occupied, each model needs to be trained independently, and much time is consumed for training. However, at present, there is no technical solution for solving the above technical problems, and specifically, there is no method, apparatus, device and storage medium for voice conversion.
Disclosure of Invention
The invention aims to provide a voice conversion method, which comprises the following steps:
determining a first low-dimensional vector corresponding to a target voice based on the target voice;
separating the first low-dimensional vector to obtain a target identity voice vector;
a final converted voice is determined based on the sample identity voice vector and the target identity voice vector.
According to a voice conversion method provided by the present invention, the determining a first low-dimensional vector corresponding to a target voice based on the target voice includes:
determining MFCC features of the target speech based on Mel frequency cepstral coefficient MFCC feature extraction;
processing the MFCC characteristics of the target voice by adopting a maximum likelihood estimation algorithm to determine a supervector corresponding to the target voice;
the supervectors are compressed to obtain a first low-dimensional vector.
According to a voice conversion method provided by the present invention, the separating the first low-dimensional vector to obtain a target identity voice vector includes:
Figure 143067DEST_PATH_IMAGE001
wherein, the
Figure 831537DEST_PATH_IMAGE002
Is the first low-dimensional vector, b is a constant, S is a linear transformation,
Figure 546552DEST_PATH_IMAGE003
in the case of a speech content-related item,
Figure 981601DEST_PATH_IMAGE004
is a target identity voice vector.
According to a voice conversion method provided by the present invention, the determining of the final converted voice based on the sample identity voice vector and the target identity voice vector includes:
determining a difference value between the sample identity voice vector and the target identity voice vector;
determining a third low-dimensional vector based on the difference value and the first low-dimensional vector;
determining a final converted speech based on the third low-dimensional vector.
According to a voice conversion method provided by the present invention, the determining a third low-dimensional vector based on the difference and the first low-dimensional vector includes:
Figure 200092DEST_PATH_IMAGE005
wherein, in the step (A),
Figure 200409DEST_PATH_IMAGE006
is a third low-dimensional vector of
Figure 137141DEST_PATH_IMAGE007
Is a first low-dimensional vector, and is,
Figure 638530DEST_PATH_IMAGE008
for the sample identity voice vector, the identity of the voice vector,
Figure 977107DEST_PATH_IMAGE004
for the target identity voice vector, S is a linear transformation.
According to a voice conversion method provided by the present invention, before determining a final converted voice based on a sample identity voice vector and a target identity voice vector, the method comprises:
fitting a general speech model by adopting a Gaussian mixture model for sample speech, wherein the general speech model at least comprises MFCC (Mel frequency cepstrum coefficient) characteristics, and the sample speech is a speech set fusing a plurality of pieces of speech;
processing the generic speech model to determine a plurality of second low-dimensional vectors, the plurality of second low-dimensional vectors being vectors corresponding to the generic speech model that contain a plurality of pieces of speech information;
and extracting the identity voice information of any one of the second low-dimensional vectors to obtain a sample identity voice vector.
According to a speech conversion method provided by the present invention, the processing the generic speech model to determine a plurality of second low-dimensional vectors includes:
determining an estimated value of each voice in the universal voice model by adopting a maximum likelihood estimation algorithm, and accumulating all the estimated values to obtain a supervector corresponding to all the voices;
and compressing the supervectors corresponding to all the voices to obtain a plurality of second low-dimensional vectors.
According to a voice conversion method provided by the present invention, the extracting identity voice information of any one of the second low-dimensional vectors to obtain a sample identity voice vector includes:
separating the second low-dimensional vector based on Probabilistic Linear Discriminant Analysis (PLDA) to obtain a plurality of candidate identity speech vectors;
the plurality of candidate identity voice vectors are processed based on an extraction policy to obtain a sample identity voice vector.
According to a voice conversion method provided by the present invention, the extraction strategy includes:
random extraction;
sequentially extracting;
extraction is specified.
According to a voice conversion method provided by the invention, before determining a first low-dimensional vector corresponding to a target voice based on the target voice, the method comprises the following steps:
determining a rotation frequency based on the number of times of execution of the sample voice and time;
and judging whether to replace the sample voice based on the rotation frequency and the sample voice cluster, wherein the sample voice data cluster consists of a plurality of sample voices.
The present invention also provides a voice conversion apparatus, comprising:
the first determination means: determining a first low-dimensional vector corresponding to a target voice based on the target voice;
an acquisition device: separating the first low-dimensional vector to obtain a target identity voice vector;
second determining means: a final converted voice is determined based on the sample identity voice vector and the target identity voice vector.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and operable on the processor, wherein the processor implements the steps of the speech conversion method when executing the program.
The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the speech conversion method.
The method determines a first low-dimensional vector through the target voice, separates the first low-dimensional vector to determine a target identity voice vector, determines a third low-dimensional vector based on the sample identity voice vector and the target identity voice vector, and further obtains the final converted voice. Compared with the prior art, the invention can realize one-to-many voice conversion without adding an additional conversion model for the system, and simultaneously, because a mode of calculating the MFCC characteristic offset to obtain the target voice is adopted, a large amount of parallel linguistic data are not needed to be used as training data, thereby reducing the difficulty of collecting the training data; the invention is applicable to the customer service platform, and aiming at the problem of single voice of the voice conversion function of the customer service platform, the one-to-many voice conversion technology is adopted to increase the types of the convertible voice of the customer service system, thereby achieving the purpose of protecting the privacy of the customer; the invention is convenient and simple, has strong functions and can effectively and quickly realize voice conversion.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a flow chart of a voice conversion method according to the present invention;
FIG. 2 is a schematic flow chart of determining a first low-dimensional vector corresponding to the target speech according to the present invention;
FIG. 3 is a schematic flow chart of determining final converted speech provided by the present invention;
FIG. 4 is a second flowchart of a voice conversion method according to the present invention;
FIG. 5 is a schematic flow chart of determining a plurality of second low-dimensional vectors provided by the present invention;
FIG. 6 is a schematic flow chart of obtaining a sample identity speech vector according to the present invention;
FIG. 7 is a schematic structural diagram of a voice conversion apparatus provided in the present invention;
fig. 8 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Embodiments of the present invention are described below in conjunction with fig. 1-8.
Fig. 1 is a schematic flow diagram of a voice conversion method provided by the present invention, and the present invention is directed to a method for performing voice conversion quickly after selecting a voice type to be converted from multiple sample voice clusters through a target voice, wherein a low-dimensional vector is separated through i-vector to obtain a voice content portion and a voice identity portion, so that the voice content portion is kept unchanged, the voice identity portion is changed, and a final converted voice is determined through reverse derivation, including:
firstly, step S101 is executed to determine a first low-dimensional vector corresponding to a target voice, where the target voice is original voice information of a speaker, and it needs to determine the first low-dimensional vector, that is, an i-vector corresponding to the original voice information of the speaker, by means of a mel-frequency cepstrum coefficient MFCC feature extraction, a maximum likelihood estimation algorithm, a compressed super-vector, and the like, and the purpose of obtaining the first low-dimensional vector is to determine a target identity voice vector in the first low-dimensional vector, and how the target voice determines the first low-dimensional vector will be further described in the detailed description below.
Then, step S102 is executed to separate the first low-dimensional vector to obtain the target identity voice vector, and in such an embodiment, the separating the first low-dimensional vector to obtain the target identity voice vector includes:
Figure 148326DEST_PATH_IMAGE009
wherein, the
Figure 572354DEST_PATH_IMAGE010
Is the first low-dimensional vector, b is a constant, S is a linear transformation,
Figure 346275DEST_PATH_IMAGE011
in the case of a speech content-related item,
Figure 539359DEST_PATH_IMAGE012
for the target identity voice vector, that is, the first low-dimensional vector is separated by adopting Probability Linear Discriminant Analysis (PLDA), three parts of voice content related items, the target identity voice vector and the constant can be separated, and the parameter of the target identity voice vector is mainly used in the application.
Finally, step S103 is entered, and the final converted speech is determined based on the sample identity speech vector and the target identity speech vector, where the sample identity speech vector is data preset in the present invention, that is, in the actual process, a step that must be executed in each speech conversion is not required.
As a preferred embodiment of the present invention, before determining a first low-dimensional vector corresponding to a target speech based on the target speech, the method includes:
the rotation frequency is determined based on the number of executions and the time of the sample voice, and in the voice switching process, the converted voice identity vector is selected from a plurality of voices of the sample voice, so that after the voice switching process, the situation that the executed sample voice is used for a plurality of times can be caused, and the situation that the sample voice is frequently used due to long-time non-replacement of the sample voice can also be caused.
Further, in order to better implement replacement management of sample voices, a sample voice data cluster may be set, where the sample voice data cluster is composed of a plurality of sample voices, that is, in this embodiment, whether to replace a sample voice is determined based on the rotation frequency and the sample voice cluster, for example, if a sample voice is used 100 times, a sample voice is determined to be replaced, and if a sample voice is not replaced for more than 1 month, a sample voice is determined to be replaced.
In the invention, because the voice switching is involved, the switched voice identity content is the voice identity content, the related items of the voice content are kept unchanged, and the voice identity content depends on the voice identity vector, so the method can directly change the converted low-dimensional vector by changing the voice identity vector and finally determine the converted voice.
Fig. 2 is a schematic flowchart of the process for determining the first low-dimensional vector corresponding to the target speech provided by the present invention, and as shown in fig. 2, fig. 2 shows specific steps of determining the first low-dimensional vector corresponding to the target speech based on the target speech, including:
firstly, step S1011 is executed, MFCC characteristics of the target voice are determined based on MFCC characteristic extraction, voice signal detection is carried out on a target sample, and MFCC characteristics are extracted and optimized.
MFCC feature extraction contains two key steps: converted to mel frequency and then subjected to cepstrum analysis, the mel scale being a non-linear frequency scale based on the sensory judgment of the human ear on equidistant pitch (pitch) changes, the cepstrum analysis meaning: and performing Fourier transform on the time domain signal, then taking a log, and then performing inverse Fourier transform. Can be divided into complex cepstrum, real cepstrum and power cepstrum.
Then, step S1012 is performed, the MFCC features of the target speech are processed by using a maximum likelihood estimation algorithm to determine a supervector corresponding to the target speech, the first low-dimensional vector is determined by first adjusting the mean value of the gaussian mixture model GMM of the target speech, and the MFCC features are processed based on the acoustic model TDNN, so as to obtain a mean-centered supervector, more specifically, the maximum probability of the MFCC corresponding to the target speech is obtained after maximum likelihood estimation, and then the adjusted mean values are added to obtain a supervector representation of the target speech.
Finally, step S1013 is performed, the supervectors are compressed to obtain the first low-dimensional vectors, and the general dimensions of the supervectors in step S1012 are high, so that the supervectors need to be compressed to hundreds of dimensions to obtain the first low-dimensional vectors.
Fig. 3 is a schematic flow chart of determining final converted speech provided by the present invention, and as shown in fig. 3, fig. 3 shows a specific flow chart of determining final converted speech based on a sample identity speech vector and a target identity speech vector, including:
firstly, go to step S1031, determine the difference between the sample identity speech vector and the target identity speech vector, in the present invention, based on the formula
Figure 3182DEST_PATH_IMAGE013
Wherein, the
Figure 789873DEST_PATH_IMAGE014
Is the first low-dimensional vector, b is a constant, S is a linear transformation,
Figure 367485DEST_PATH_IMAGE015
in the case of a speech content-related item,
Figure 680654DEST_PATH_IMAGE016
for the target identity voice vector, it can be known that the low-dimensional vector can be separated into the correlation of voice content and the multiplier of the target identity voice vector and linear transformation, i.e. the difference between the sample identity voice vector and the target identity voice vector is the same
Figure 52730DEST_PATH_IMAGE017
The product of the difference between the sample identity voice vector and the target identity voice vector and the linear transformation S can be understood as the difference between the low-dimensional vector of the sample voice and the low-dimensional vector of the target voice.
Then, proceeding to step S1032, determining a third low-dimensional vector based on the difference value and the first low-dimensional vector, where the determining a third low-dimensional vector based on the difference value and the first low-dimensional vector includes:
Figure 451350DEST_PATH_IMAGE018
wherein, in the step (A),
Figure 708019DEST_PATH_IMAGE019
is a third low-dimensional vector of
Figure 610116DEST_PATH_IMAGE020
Is a first low-dimensional vector, and is,
Figure 418672DEST_PATH_IMAGE021
for the sample identity voice vector, the identity of the voice vector,
Figure 39009DEST_PATH_IMAGE016
based on step S1031, the product of the difference between the sample identity speech vector and the target identity speech vector and the linear transformation S can be understood as the difference between the low-dimensional vector of the sample speech and the low-dimensional vector of the target speech, and the formula is shown above
Figure 226933DEST_PATH_IMAGE022
Is that
Figure 124481DEST_PATH_IMAGE023
Transformed to said third low-dimensional vector
Figure 103939DEST_PATH_IMAGE019
I.e. the transformed low-dimensional vector.
Finally, step S1033 is performed, and the final converted speech is determined based on the third low-dimensional vector, in such an embodiment, the second low-dimensional vector of the sample speech is extracted first, and the identity speech vector of the sample speech is obtained after the training phase, that is, the identity vectors of all the speakers in the sample are obtained, and then the second low-dimensional vector of the speaker to be converted is obtained, and the final converted speech is determined based on the third low-dimensional vector
Figure 477151DEST_PATH_IMAGE024
The method is characterized in that the method is a method for obtaining the change quantity of a super vector by inversely compressing the change quantity of a low-dimensional vector, namely the change quantity of the low-dimensional vector, of sample voice and the change quantity of a target voice, the super vector is formed by adding the mean values of a voice Gaussian mixture model GMM, the change quantity of the Gaussian mixture model GMM can be obtained, the bias of MFCC characteristics can be obtained, and the Mel frequency cepstrum coefficient MFCC characteristics can be directly synthesized into final converted voice.
Fig. 4 is a second flowchart of a speech conversion method provided by the present invention, as shown in fig. 4, fig. 4 shows steps before determining final converted speech based on a sample identity speech vector and a target identity speech vector, including:
firstly, step S201 is executed, a gaussian mixture model is used to fit a general speech model to a sample speech, the general speech model at least includes MFCC features, the sample speech is a speech set fusing a plurality of pieces of speech, the gaussian mixture model is a model formed by decomposing things into a plurality of gaussian probability density functions and accurately quantizing things by using the gaussian probability density functions, and MFCC features of the general speech model are extracted and determined based on MFCC features.
Then, step S202 is executed to process the generic speech model to determine a plurality of second low-dimensional vectors, which are vectors corresponding to the generic speech model and containing a plurality of pieces of speech information, and which will be further described in fig. 5.
Finally, step S203 is executed to extract the identity voice information of any one of the second low-dimensional vectors to obtain a sample identity voice vector, where the second low-dimensional vector may adopt PLDA to separate out the identity voice information corresponding to all the low-dimensional vectors, where the identity voice information is a product of the identity voice vector and linear transformation, and the purpose of this step is to determine one of the second low-dimensional vectors as a sample low-dimensional vector to be transformed based on user selection, so as to determine the sample identity voice vector.
Fig. 5 is a schematic flowchart of the process for determining a plurality of second low-dimensional vectors according to the present invention, and as shown in fig. 5, the specific steps of processing the generic speech model to determine a plurality of second low-dimensional vectors include:
firstly, step S2021 is performed, an estimated value of each voice in the universal voice model is determined by using a maximum likelihood estimation algorithm, all estimated values are accumulated to obtain a supervector corresponding to all voices, specifically, the determination of the second low-dimensional vector first needs to adjust a mean value of GMMs of each voice in the universal language model, that is, a process of maximum likelihood estimation, obtain a maximum probability of MFCC corresponding to each voice after the maximum likelihood estimation, and then needs to add the adjusted mean values to obtain a supervector representation of each voice.
Then, step S2022 is performed to compress the supervectors corresponding to all the speeches to obtain a plurality of second low-dimensional vectors, which are obtained after compressing to hundreds of dimensions because the supervectors have high dimensions.
Fig. 6 is a schematic flow chart of obtaining a sample identity speech vector provided by the present invention, and as shown in fig. 6, fig. 6 shows a specific step of extracting identity speech information of any one of the second low-dimensional vectors to obtain a sample identity speech vector, which includes:
first, step S2031 is entered, the second low-dimensional vector is separated based on the probability linear discriminant analysis PLDA to obtain a plurality of candidate identity voice vectors, and the separation of the second low-dimensional vector based on the probability linear discriminant analysis PLDA may refer to the aforementioned step S102, the first low-dimensional vector is separated to obtain the target identity voice vector, that is:
Figure 934677DEST_PATH_IMAGE025
wherein, the
Figure 811366DEST_PATH_IMAGE026
Is the first low-dimensional vector, b is a constant, S is a linear transformation,
Figure 102671DEST_PATH_IMAGE027
in the case of a speech content-related item,
Figure 697600DEST_PATH_IMAGE028
and correspondingly separating a plurality of second low-dimensional vectors based on the Probability Linear Discriminant Analysis (PLDA) to obtain a plurality of candidate identity voice vectors for the target identity voice vector, wherein the candidate identity voice vectors are used as alternative samples for the user, and when the user needs to select, one of the candidate identity voice vectors is selected.
Then, step S2032 is performed, and the plurality of candidate identity voice vectors are processed based on an extraction policy to obtain a sample identity voice vector, where the extraction policy includes random extraction, that is, one of the plurality of candidate identity voice vectors is randomly selected as the sample identity voice vector; the extraction strategy also comprises sequential extraction, namely numbering the candidate identity voice vectors, and sequentially circularly traversing and selecting one of the candidate identity voice vectors as a sample identity voice vector; the extraction strategy also comprises appointed extraction, namely selecting from a plurality of candidate identity voice vectors each time through preference selection of a user so as to determine a sample identity voice vector.
Fig. 7 is a schematic structural diagram of a speech conversion apparatus provided in the present invention, including:
the first determination device 1: the first low-dimensional vector corresponding to the target speech is determined based on the target speech, and the working principle of the first determining apparatus 1 may refer to the foregoing step S101, which is not described herein again.
The voice conversion apparatus further includes an acquisition means: the first low-dimensional vector is separated to obtain the target identity voice vector, and the working principle of the obtaining apparatus 2 may refer to the step S102, which is not described herein again.
The speech conversion apparatus further includes second determination means 3: the final converted voice is determined based on the sample identity voice vector and the target identity voice vector, and the working principle of the second determining device 3 may refer to the step S103, which is not described herein again.
Fig. 8 illustrates a physical structure diagram of an electronic device, and as shown in fig. 8, the electronic device may include: a processor (processor)310, a communication Interface (communication Interface)320, a memory (memory)330 and a communication bus 340, wherein the processor 310, the communication Interface 320 and the memory 330 communicate with each other via the communication bus 340. Processor 310 may invoke logic instructions in memory 330 to perform a speech conversion method comprising: determining a first low-dimensional vector corresponding to a target voice based on the target voice; separating the first low-dimensional vector to obtain a target identity voice vector; a final converted voice is determined based on the sample identity voice vector and the target identity voice vector.
In addition, the logic instructions in the memory 330 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer-readable storage medium, the computer program, when executed by a processor, being capable of executing a speech conversion method provided by the above methods, the method comprising: determining a first low-dimensional vector corresponding to a target voice based on the target voice; separating the first low-dimensional vector to obtain a target identity voice vector; a final converted voice is determined based on the sample identity voice vector and the target identity voice vector.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor, implements a method of speech conversion provided by the above methods, the method comprising: determining a first low-dimensional vector corresponding to a target voice based on the target voice; separating the first low-dimensional vector to obtain a target identity voice vector; a final converted voice is determined based on the sample identity voice vector and the target identity voice vector.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (13)

1. A method of speech conversion, comprising:
determining a first low-dimensional vector corresponding to a target voice based on the target voice;
separating the first low-dimensional vector to obtain a target identity voice vector;
a final converted voice is determined based on the sample identity voice vector and the target identity voice vector.
2. The method of claim 1, wherein determining a first low-dimensional vector corresponding to the target speech based on the target speech comprises:
determining MFCC features of the target speech based on Mel frequency cepstral coefficient MFCC feature extraction;
processing the MFCC characteristics of the target voice by adopting a maximum likelihood estimation algorithm to determine a supervector corresponding to the target voice;
the supervectors are compressed to obtain a first low-dimensional vector.
3. The method of claim 1, wherein the separating the first low-dimensional vector to obtain a target identity voice vector comprises:
Figure 663147DEST_PATH_IMAGE001
wherein, the
Figure 394343DEST_PATH_IMAGE002
Is the first low-dimensional vector, b is a constant, S is a linear transformation,
Figure 981182DEST_PATH_IMAGE003
in the case of a speech content-related item,
Figure 794899DEST_PATH_IMAGE004
is a target identity voice vector.
4. The method of claim 1, wherein determining the final converted speech based on the sample identity speech vector and the target identity speech vector comprises:
determining a difference value between the sample identity voice vector and the target identity voice vector;
determining a third low-dimensional vector based on the difference value and the first low-dimensional vector;
determining a final converted speech based on the third low-dimensional vector.
5. The method of claim 4, wherein determining a third low-dimensional vector based on the difference value and the first low-dimensional vector comprises:
Figure 752DEST_PATH_IMAGE005
wherein, in the step (A),
Figure 852034DEST_PATH_IMAGE006
is a third low-dimensional vector of
Figure 344195DEST_PATH_IMAGE007
Is a first low-dimensional vector, and is,
Figure 648137DEST_PATH_IMAGE008
for the sample identity voice vector, the identity of the voice vector,
Figure 516736DEST_PATH_IMAGE004
for the target identity voice vector, S is a linear transformation.
6. The method of claim 1, prior to determining final converted speech based on the sample identity speech vector and the target identity speech vector, comprising:
fitting a general speech model by adopting a Gaussian mixture model for sample speech, wherein the general speech model at least comprises Mel Frequency Cepstrum Coefficient (MFCC) characteristics, and the sample speech is a speech set fusing a plurality of pieces of speech;
processing the generic speech model to determine a plurality of second low-dimensional vectors, the plurality of second low-dimensional vectors being vectors corresponding to the generic speech model that contain a plurality of pieces of speech information;
and extracting the identity voice information of any one of the second low-dimensional vectors to obtain a sample identity voice vector.
7. The method of claim 6, wherein the processing the generic speech model to determine a plurality of second low-dimensional vectors comprises:
determining an estimated value of each voice in the universal voice model by adopting a maximum likelihood estimation algorithm, and accumulating all the estimated values to obtain a supervector corresponding to all the voices;
and compressing the supervectors corresponding to all the voices to obtain a plurality of second low-dimensional vectors.
8. The method of claim 6, wherein extracting identity voice information of any one of the plurality of second low-dimensional vectors to obtain a sample identity voice vector comprises:
separating the second low-dimensional vector based on Probabilistic Linear Discriminant Analysis (PLDA) to obtain a plurality of candidate identity speech vectors;
the plurality of candidate identity voice vectors are processed based on an extraction policy to obtain a sample identity voice vector.
9. The method of claim 8, wherein the extraction strategy comprises:
random extraction;
sequentially extracting;
extraction is specified.
10. The method of claim 1, prior to determining a first low-dimensional vector corresponding to a target speech based on the target speech, comprising:
determining a rotation frequency based on the number of times of execution of the sample voice and time;
and judging whether to replace the sample voice based on the rotation frequency and the sample voice cluster, wherein the sample voice data cluster consists of a plurality of sample voices.
11. A speech conversion apparatus, comprising:
the first determination means: determining a first low-dimensional vector corresponding to a target voice based on the target voice;
an acquisition device: separating the first low-dimensional vector to obtain a target identity voice vector;
second determining means: a final converted voice is determined based on the sample identity voice vector and the target identity voice vector.
12. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the speech conversion method according to any of claims 1 to 10 are implemented when the processor executes the program.
13. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the speech conversion method according to any one of claims 1 to 10.
CN202210308610.2A 2022-03-28 2022-03-28 Voice conversion method, device, equipment and storage medium Pending CN114420142A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210308610.2A CN114420142A (en) 2022-03-28 2022-03-28 Voice conversion method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210308610.2A CN114420142A (en) 2022-03-28 2022-03-28 Voice conversion method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114420142A true CN114420142A (en) 2022-04-29

Family

ID=81262761

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210308610.2A Pending CN114420142A (en) 2022-03-28 2022-03-28 Voice conversion method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114420142A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065022A (en) * 2018-06-06 2018-12-21 平安科技(深圳)有限公司 I-vector vector extracting method, method for distinguishing speek person, device, equipment and medium
CN109599091A (en) * 2019-01-14 2019-04-09 南京邮电大学 Multi-to-multi voice conversion method based on STARWGAN-GP and x vector
CN110085254A (en) * 2019-04-22 2019-08-02 南京邮电大学 Multi-to-multi phonetics transfer method based on beta-VAE and i-vector
CN111462729A (en) * 2020-03-31 2020-07-28 因诺微科技(天津)有限公司 Fast language identification method based on phoneme log-likelihood ratio and sparse representation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065022A (en) * 2018-06-06 2018-12-21 平安科技(深圳)有限公司 I-vector vector extracting method, method for distinguishing speek person, device, equipment and medium
CN109599091A (en) * 2019-01-14 2019-04-09 南京邮电大学 Multi-to-multi voice conversion method based on STARWGAN-GP and x vector
CN110085254A (en) * 2019-04-22 2019-08-02 南京邮电大学 Multi-to-multi phonetics transfer method based on beta-VAE and i-vector
CN111462729A (en) * 2020-03-31 2020-07-28 因诺微科技(天津)有限公司 Fast language identification method based on phoneme log-likelihood ratio and sparse representation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TOMI KINNUNEN: ""Non-parallel voice conversion using i-vector PLDA:towards unifying speaker verification and transformation"", 《ICASSP》, 31 December 2017 (2017-12-31), pages 1 - 5 *

Similar Documents

Publication Publication Date Title
CN105976812B (en) A kind of audio recognition method and its equipment
Hossan et al. A novel approach for MFCC feature extraction
CN108597496B (en) Voice generation method and device based on generation type countermeasure network
CN110457432B (en) Interview scoring method, interview scoring device, interview scoring equipment and interview scoring storage medium
CN102509547B (en) Method and system for voiceprint recognition based on vector quantization based
JP5853029B2 (en) Passphrase modeling device and method for speaker verification, and speaker verification system
CN111524527A (en) Speaker separation method, device, electronic equipment and storage medium
CN112071330B (en) Audio data processing method and device and computer readable storage medium
CN106991312B (en) Internet anti-fraud authentication method based on voiceprint recognition
CN110428853A (en) Voice activity detection method, Voice activity detection device and electronic equipment
Bagul et al. Text independent speaker recognition system using GMM
CN102376306B (en) Method and device for acquiring level of speech frame
Ayoub et al. Gammatone frequency cepstral coefficients for speaker identification over VoIP networks
CN115171731A (en) Emotion category determination method, device and equipment and readable storage medium
CN110136726A (en) A kind of estimation method, device, system and the storage medium of voice gender
CN113823323A (en) Audio processing method and device based on convolutional neural network and related equipment
Desai et al. Speaker recognition using MFCC and hybrid model of VQ and GMM
CN109545226A (en) A kind of audio recognition method, equipment and computer readable storage medium
CN114141237A (en) Speech recognition method, speech recognition device, computer equipment and storage medium
CN114125506B (en) Voice auditing method and device
CN111462762B (en) Speaker vector regularization method and device, electronic equipment and storage medium
CN111091809B (en) Regional accent recognition method and device based on depth feature fusion
Khanna et al. Application of vector quantization in emotion recognition from human speech
Kamble et al. Emotion recognition for instantaneous Marathi spoken words
Koolagudi et al. Speaker recognition in the case of emotional environment using transformation of speech features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20220429