CN112750446B

CN112750446B - Voice conversion method, device and system and storage medium

Info

Publication number: CN112750446B
Application number: CN202011609527.6A
Authority: CN
Inventors: 武剑桃; 李秀林
Original assignee: Beibei Qingdao Technology Co ltd
Current assignee: Beibei Qingdao Technology Co ltd
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2024-05-24
Anticipated expiration: 2040-12-30
Also published as: CN112750446A

Abstract

The invention provides a voice conversion method, a voice conversion device, a voice conversion system and a storage medium. The method comprises the following steps: acquiring source voice of a source speaker; extracting the characteristics of the source voice; inputting the source recognition acoustic features into a speech recognition model to obtain a speech posterior probability of the source speaker; inputting posterior probability vectors corresponding to at least some of the plurality of time frames into a feature conversion model to obtain a target synthetic acoustic feature, the target synthetic acoustic feature comprising synthetic acoustic feature vectors in one-to-one correspondence with the at least some of the time frames, the at least some of the time frames comprising all of the valid time frames of the plurality of time frames; performing speech synthesis based on the effective acoustic features to obtain effective speech of the target speaker; the speech recognition model or feature transformation model also outputs source audio state information, and whether each of the plurality of time frames belongs to a valid time frame or an invalid time frame is determined based on the source audio state information. The joint modeling mode can effectively improve the real-time performance of voice conversion.

Description

Voice conversion method, device and system and storage medium

Technical Field

The present invention relates to the field of speech signal processing technologies, and in particular, to a speech conversion method, apparatus and system, and a storage medium.

Background

In the field of speech signal processing, speech conversion (i.e., speech timbre conversion) technology is currently one of the more important research directions. The speech conversion aims at modifying the timbre of any speaker, converting it to the timbre of a certain fixed speaker, while the speaking content remains unchanged. Speech conversion involves front-end signal processing, speech recognition, and speech synthesis techniques. A speech conversion system based on Automatic Speech Recognition (ASR) technology can extract speaker-independent features from any source input speech, and further convert the features into sound with the tone of a specified target speaker through a feature conversion model and a vocoder.

In the existing speech conversion technology, source speech data is generally input into a pre-trained endpoint detection network to detect the start point and the tail point of an effective audio signal, then the effective audio signal is input into an automatic speech recognition (SI-ASR) system irrelevant to a speaker to extract a speech Posterior Probability (PPG), and subsequent processing is performed. The endpoint detection network is typically constructed from a deep learning model, which takes a certain amount of time to train, and often requires waiting until a portion of the speech data is provided to determine the starting point of the valid audio signal during endpoint detection through the network, which can cause a certain delay.

Disclosure of Invention

In order to at least partially solve the problems in the prior art, a voice conversion method, apparatus and system, and a storage medium are provided.

According to an aspect of the present invention, there is provided a voice conversion method including: acquiring source voice of a source speaker; extracting features of the source voice to obtain source recognition acoustic features of a source speaker; inputting the source recognition acoustic features into a voice recognition model to obtain the voice posterior probability of a source speaker output by the voice recognition model, wherein the voice posterior probability comprises a plurality of posterior probability vectors which are in one-to-one correspondence with a plurality of time frames; inputting posterior probability vectors corresponding to at least some time frames in the plurality of time frames into a feature conversion model to obtain target synthesized acoustic features of a target speaker output by the feature conversion model, wherein the target synthesized acoustic features comprise synthesized acoustic feature vectors corresponding to at least some time frames one by one, each time frame in the plurality of time frames belongs to an effective time frame or an ineffective time frame, the effective time frame refers to a time frame of which the corresponding source voice audio segment is an effective audio segment, the ineffective time frame refers to a time frame of which the corresponding source voice audio segment is an ineffective audio segment, and at least some time frames comprise all effective time frames in the plurality of time frames; performing speech synthesis based on the effective acoustic features to obtain effective speech of the target speaker, wherein the effective acoustic features comprise synthetic acoustic feature vectors which are in one-to-one correspondence with all effective time frames in the target synthetic acoustic features; the voice recognition model or the feature conversion model further outputs source audio state information, the source audio state information comprises a plurality of groups of frame audio state information corresponding to a plurality of time frames one by one, each group of frame audio state information indicates whether a source voice audio segment under the corresponding time frame belongs to an effective audio segment or an ineffective audio segment, and whether each time frame in the plurality of time frames belongs to the effective time frame or the ineffective time frame is determined based on the source audio state information.

Illustratively, after speech synthesis based on the effective acoustic features to obtain the effective speech of the target speaker, the method further comprises: and combining the effective voice with preset mute audio to obtain target voice of a target speaker, wherein the preset mute audio comprises mute audio fragments corresponding to all invalid time frames in a plurality of time frames one by one.

Illustratively, the speech recognition model includes a first shared network layer, a speech posterior probability output layer, and an audio state output layer, and inputting the source recognition acoustic features into the speech recognition model to obtain a speech posterior probability of the source speaker output by the speech recognition model includes: inputting the source identification acoustic features into a first shared network layer to obtain first shared features output by the first shared network layer; and respectively inputting the first shared features into the voice posterior probability output layer and the audio state output layer to obtain the voice posterior probability output by the voice posterior probability output layer and the source audio state information output by the audio state output layer.

Illustratively, the speech recognition model further outputting source audio state information, inputting posterior probability vectors corresponding to at least some of the plurality of time frames into the feature transformation model to obtain target synthesized acoustic features of the target speaker output by the feature transformation model includes: determining whether each of the plurality of time frames belongs to a valid time frame or an invalid time frame based on the source audio state information; extracting posterior probability vectors corresponding to all effective time frames from the posterior probability of the voice; the extracted posterior probability vector is input into a feature transformation model to obtain a target synthetic acoustic feature.

Illustratively, the speech recognition model also outputs source audio state information, and the method further comprises, prior to obtaining the source speech of the source speaker: acquiring sample training voice of a sample speaker, labeling voice type information corresponding to the sample training voice and labeling audio state information corresponding to the sample training voice, wherein the labeling voice type information is used for indicating voice types included in the sample training voice, and the labeling audio state information is used for indicating whether each audio segment in the sample training voice belongs to an effective audio segment or an ineffective audio segment; extracting characteristics of the sample training voice to obtain sample recognition acoustic characteristics of a sample speaker; inputting the sample recognition acoustic features into a voice recognition model to obtain the predicted voice posterior probability and the predicted audio state information of the sample speaker output by the voice recognition model; calculating a first loss based on the labeled speech class information and the predicted speech posterior probability; calculating a second loss based on the labeled audio state information and the predicted audio state information; calculating a first total loss by combining the first loss and the second loss; the speech recognition model is trained based on the first total loss.

Illustratively, calculating the first total loss in combination with the first loss and the second loss includes:

calculating a first total loss based on the following formula:

loss_net₁＝α₁*loss₁+β₁*loss₂；

α₁+β₁＝1；

Where loss_net ₁ is a first total loss, loss ₁ is a first loss, loss ₂ is a second loss, α ₁ and β ₁ are preset coefficients, and the value ranges of α ₁ and β ₁ are (0, 1).

calculating a first total loss based on the following formula:

loss_net₁＝f₁(loss₁*loss₂)+α₂*loss₁+β₂*loss₂;

Where loss_net ₁ is a first total loss, loss ₁ is a first loss, loss ₂ is a second loss, f ₁(loss₁*loss₂) is a preset function related to loss ₁ and loss ₂, α ₂ and β ₂ are preset coefficients, and the value ranges of α ₂ and β ₂ are (0, 1).

Illustratively, the feature transformation model includes a second shared network layer, a transformed feature output layer, and an audio state output layer, and inputting posterior probability vectors corresponding to at least some of the plurality of time frames into the feature transformation model to obtain a target synthesized acoustic feature of a target speaker output by the feature transformation model includes: inputting the posterior probability of the voice into a second shared network layer to obtain a second shared characteristic output by the second shared network layer; and respectively inputting the second shared characteristic into a conversion characteristic output layer and an audio state output layer to obtain the target synthesized acoustic characteristic output by the conversion characteristic output layer and the source audio state information output by the audio state output layer.

Illustratively, the feature transformation model also outputs source audio state information, and the method further comprises, prior to speech synthesis based on the valid acoustic features to obtain valid speech of the target speaker: determining whether each of the plurality of time frames belongs to a valid time frame or an invalid time frame based on the source audio state information; and extracting synthetic acoustic feature vectors corresponding to all effective time frames from the target synthetic acoustic features to obtain effective acoustic features.

Illustratively, the feature transformation model also outputs source audio state information, and the method further comprises, prior to obtaining the source speech of the source speaker: acquiring target training voice of a target speaker and labeled audio state information corresponding to the target training voice, wherein the labeled audio state is used for indicating whether each audio segment in the target training voice belongs to an effective audio segment or an ineffective audio segment; extracting features of the target training voice to obtain labeled recognition acoustic features and labeled synthesis acoustic features of the target speaker; inputting the labeling recognition acoustic features into a voice recognition model to obtain the predicted voice posterior probability of the target speaker output by the voice recognition model; inputting the posterior probability of the predicted voice into a feature conversion model to obtain the predicted synthesized acoustic feature and the predicted audio state information of the target speaker output by the feature conversion model; calculating a third loss based on the labeled and predicted synthesized acoustic features; calculating a fourth loss based on the labeled audio state information and the predicted audio state information; calculating a second total loss by combining the third loss and the fourth loss; the feature transformation model is trained based on the second total loss.

Illustratively, calculating the second total loss in combination with the third loss and the fourth loss includes:

The second total loss is calculated based on the following formula:

loss_net₂＝α₃*loss₃+β₃*loss₄；

α₃+β₃＝1；

Where loss_net ₂ is the second total loss, loss ₃ is the third loss, loss ₄ is the fourth loss, α ₃ and β ₃ are preset coefficients, and the value ranges of α ₃ and β ₃ are (0, 1).

The second total loss is calculated based on the following formula:

loss_net₂＝f₂(loss₃*loss₄)+α₄*loss₃+β₄*loss₄;

Where loss_net ₂ is the second total loss, loss ₃ is the third loss, loss ₄ is the fourth loss, f ₂(loss₃*loss₄) is a preset function related to loss ₃ and loss ₄, α ₄ and β ₄ are preset coefficients, and the value ranges of α ₄ and β ₄ are (0, 1).

Illustratively, the speech recognition model includes one or more of the following network models: a long-term and short-term memory network model, a convolutional neural network model, a time delay neural network model and a deep neural network model; and/or the feature transformation model includes one or more of the following network models: tensor-tensor network model, convolutional neural network model, sequence-to-sequence model, attention model.

Illustratively, the source identifying acoustic feature is a mel-frequency cepstrum coefficient feature, a perceptual linear prediction feature, a filter bank feature, or a constant Q cepstrum coefficient feature, and the target synthesizing acoustic feature is a mel-frequency cepstrum feature, a line spectrum pair feature after mel-frequency, a line spectrum pair feature based on mel-generalized cepstrum analysis, or a linear prediction encoding feature.

According to another aspect of the present invention, there is provided a voice conversion apparatus including: the acquisition module is used for acquiring the source voice of the source speaker; the extraction module is used for extracting the characteristics of the source voice so as to obtain the source identification acoustic characteristics of the source speaker; the first input module is used for inputting the source recognition acoustic characteristics into the voice recognition model so as to obtain the voice posterior probability of the source speaker output by the voice recognition model, wherein the voice posterior probability comprises a plurality of posterior probability vectors which are in one-to-one correspondence with a plurality of time frames; the second input module is used for inputting posterior probability vectors corresponding to at least partial time frames in the plurality of time frames into the feature conversion model to obtain target synthesized acoustic features of a target speaker output by the feature conversion model, wherein the target synthesized acoustic features comprise synthesized acoustic feature vectors corresponding to at least partial time frames one by one, each time frame in the plurality of time frames belongs to an effective time frame or an ineffective time frame, the effective time frame refers to a time frame of which the corresponding source voice audio segment is an effective audio segment, the ineffective time frame refers to a time frame of which the corresponding source voice audio segment is an ineffective audio segment, and at least partial time frames comprise all effective time frames in the plurality of time frames; the synthesis module is used for carrying out voice synthesis based on the effective acoustic features to obtain the effective voice of the target speaker, wherein the effective acoustic features comprise synthetic acoustic feature vectors which are in one-to-one correspondence with all effective time frames in the target synthetic acoustic features; the voice recognition model or the feature conversion model further outputs source audio state information, the source audio state information comprises a plurality of groups of frame audio state information corresponding to a plurality of time frames one by one, each group of frame audio state information indicates whether a source voice audio segment under the corresponding time frame belongs to an effective audio segment or an ineffective audio segment, and whether each time frame in the plurality of time frames belongs to the effective time frame or the ineffective time frame is determined based on the source audio state information.

According to another aspect of the present invention, there is also provided a speech conversion system comprising a processor and a memory, wherein the memory stores computer program instructions for executing the above-described speech conversion method when the computer program instructions are executed by the processor.

According to another aspect of the present invention, there is also provided a storage medium on which program instructions are stored, the program instructions being operable, when executed, to perform the above-described speech conversion method.

According to the voice conversion method, the voice conversion device, the voice conversion system and the storage medium, a function for screening an audio state is added in a voice recognition model or a feature conversion model in a joint modeling mode. The voice conversion technique can make a judgment of the audio state in synchronization with voice recognition or feature conversion in the course of making voice conversion. Compared with the prior art adopting an endpoint detection network, the scheme can save the time of network training, and avoid the delay in the conversion process, so that the real-time performance of voice conversion can be effectively improved.

In the summary, a series of concepts in a simplified form are introduced, which will be further described in detail in the detailed description section. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

Advantages and features of the invention are described in detail below with reference to the accompanying drawings.

Drawings

The following drawings are included to provide an understanding of the invention and are incorporated in and constitute a part of this specification. Embodiments of the present invention and their description are shown in the drawings to explain the principles of the invention. In the drawings of which there are shown,

FIG. 1 shows a schematic flow chart of a speech conversion method according to one embodiment of the invention;

FIG. 2 shows a schematic flow diagram of a training and conversion phase of a speech conversion system according to one embodiment of the invention;

FIG. 3 shows a schematic diagram of a training architecture of a speech recognition model according to one embodiment of the invention;

FIG. 4 shows a schematic flow chart of a training and conversion phase of a speech conversion system according to another embodiment of the invention;

FIG. 5 shows a schematic diagram of a training architecture of a feature transformation model, according to one embodiment of the invention;

FIG. 6 shows a schematic block diagram of a speech conversion apparatus according to one embodiment of the present invention; and

Fig. 7 shows a schematic block diagram of a speech conversion system according to one embodiment of the invention.

Detailed Description

In the following description, numerous details are provided to provide a thorough understanding of the invention. However, it will be understood by those skilled in the art that the following description illustrates preferred embodiments of the invention by way of example only and that the invention may be practiced without one or more of these details. Furthermore, some technical features that are known in the art have not been described in detail in order to avoid obscuring the invention.

The existing speech conversion scheme based on ASR technology firstly extracts acoustic features from massive speech training data, obtains a corresponding phoneme state set from pre-labeled texts corresponding to the speech training data, models the relation between the acoustic features and the phoneme states by adopting a deep learning model, and trains to obtain an SI-ASR model. PPG may then be extracted from the speech of the target speaker using the trained SI-ASR model. PPG is a speaker-independent posterior probability of speech, and is mainly used to characterize the audio content of speech. The correspondence between the acoustic features of the target speaker and the PPG may then be modeled using a deep learning model, which is trained to yield a feature transformation model (Feature Converter, FC). After training of all models is completed, when the voice of any source speaker is input, the PPG of the source speaker can be firstly extracted through an SI-ASR model, then the PPG of the source speaker is converted through FC to obtain the acoustic characteristics of the target speaker, and then the voice is synthesized through a vocoder. The content of the finally obtained target voice and the content of the source voice are the same, and the tone color is basically the same as that of the target speaker.

As described above, in the existing speech conversion technology, the source speech data is generally input into a pre-trained endpoint detection network, the start point and the tail point of the effective audio signal are detected, and then the effective audio signal is input into the SI-ASR system to be extracted to obtain the PPG. The voice conversion technology adopting the endpoint detection network has long network training time, and can cause delay in the conversion process, so that the real-time requirement of voice conversion is difficult to meet.

In order to at least partially solve the above technical problems, embodiments of the present invention provide a method, an apparatus and a system for voice conversion, and a storage medium. According to the embodiment of the invention, the function for discriminating the audio state is added to the voice recognition model or the feature conversion model in a joint modeling mode (corresponding modeling units are added to the model), so that the source voice can be detected without an endpoint detection network, the audio state can be predicted along with the synthesized acoustic features of the PPG or the target speaker, voice synthesis can be performed only on the synthesized acoustic feature vectors corresponding to the effective audio segments, and the voice conversion results corresponding to the effective audio segments are output. The joint modeling mode provided by the invention not only can save the time of network training, but also can avoid the delay in the conversion process and promote the real-time performance of voice conversion. In addition, the joint modeling mode is also beneficial to improving the accuracy of the voice conversion result.

For ease of understanding, the implementation of the speech conversion method according to an embodiment of the present invention will be described below in conjunction with fig. 1-5. First, fig. 1 shows a schematic flow chart of a speech conversion method 100 according to one embodiment of the invention. As shown in fig. 1, the voice conversion method 100 includes steps S110-S150.

In step S110, the source speech of the source speaker is acquired.

In step S120, feature extraction is performed on the source speech to obtain source-identifying acoustic features of the source speaker.

For distinction, in the present invention, acoustic features obtained by feature extraction that can be used for input speech recognition by a speech recognition model may be referred to as recognition acoustic features (similar to acoustic features recognized in conventional speech recognition techniques), and acoustic features used for input speech synthesis by a vocoder may be referred to as synthesis acoustic features (similar to acoustic features recognized in conventional speech synthesis techniques).

The feature extraction described herein may be implemented using any existing or future feature extraction method that may be considered part of speech recognition. Illustratively, the source-identifying acoustic features of the source speaker extracted herein may be mel-frequency cepstrum coefficient features (MFCCs), perceptual linear prediction features (PLPs), filter bank features (FBank), or constant Q cepstrum coefficient features (CQCC).

In step S130, the source-identifying acoustic features are input into the speech-identifying model to obtain PPG (which may be referred to as a source PPG) of a source speaker output by the speech-identifying model, the PPG including a plurality of posterior probability vectors corresponding to a plurality of time frames one by one. As will be appreciated by those skilled in the art, each posterior probability vector in the PPG may include C ₁ elements that are in one-to-one correspondence with C ₁ voice classes, each element representing a posterior probability of the corresponding voice class at a corresponding time frame, where C ₁ is an integer greater than 0.

The speech recognition model is the above SI-ASR model. Illustratively, the speech recognition model may include one or more of the following network models: long and short term memory network model (LSTM), convolutional neural network model (CNN), time delay neural network model (TDNN), deep neural network model (DNN).

The PPG includes a set of values corresponding to a time range and a voice class range. The time range includes a plurality of time frames, the voice class range includes a plurality of preset voice classes, and each value in the value set represents a posterior probability of the corresponding voice class under the corresponding time frame. In particular, the PPG may be a matrix of time versus categories representing the posterior probability of each speech category for each particular time frame of an utterance. The phonetic category may refer to words, phonemes, or phoneme states (senones), etc. Where the linguistic content/pronunciation of different speech utterances is the same, the PPG obtained from the SI-ASR model is the same. In some embodiments, the PPG obtained from the SI-ASR model may represent an audible clarity (articulation) of the speech data in the speaker normalization space and corresponds to the speech content independent of the speaker. These PPGs are therefore considered as bridges between the source and target speakers.

The time frames described herein may be obtained based on a framing technique, the time span of any two time frames being the same, e.g. 5ms, etc. There may be an overlap between adjacent time frames. The speech data (e.g., source speech, sample speech, target speech, etc.) may be pre-processed prior to feature extraction. The preprocessing may include, for example, framing, etc. Those skilled in the art will understand the preprocessing method of voice data, and this will not be repeated herein. By framing an audio segment (e.g., a source speech audio segment) of speech data at each of a plurality of time frames, feature extraction is performed on the audio segment at each time frame, and acoustic feature vectors, such as MFCC feature vectors, at each time frame can be obtained. Assuming that the MFCC feature vector of the t-th time frame is input to the speech recognition model, denoted as X _t, the speech recognition model may output a posterior probability vector P _t＝(p(s|X_t)|s＝1,2,......,C₁ at the t-th time frame), where P (s|x _t) is the posterior probability of each speech class s. The PPG may comprise a posterior probability vector at several time frames. It will be understood by those skilled in the art that the number of time frames in the PPG depends on the length of the voice data (e.g., source voice, sample voice, target voice, etc.) input to the voice conversion system, i.e., the number of "multiple time frames" mentioned in step S130 and subsequent step S140 depends on the length of the source voice.

In step S140, the posterior probability vectors corresponding to at least some of the time frames are input into the feature conversion model to obtain the target synthesized acoustic feature of the target speaker output by the feature conversion model, where the target synthesized acoustic feature includes synthesized acoustic feature vectors corresponding to at least some of the time frames one to one, each of the time frames belongs to an active time frame or an inactive time frame, the active time frame refers to a time frame in which the corresponding source speech audio segment is an active audio segment, the inactive time frame refers to a time frame in which the corresponding source speech audio segment is an inactive audio segment, and at least some of the time frames include all active time frames in the time frames.

As described above, the source speech may be divided into audio segments corresponding one-to-one to a plurality of time frames, and the audio segment corresponding to each time frame may be referred to as a source speech audio segment. The valid audio piece means an audio piece in which the contained audio content is speech content, and the invalid audio piece means an audio piece in which the contained audio content is non-speech content. The inactive audio segments may include silence, noise, and the like. Each audio segment in the speech data (e.g., source speech, sample speech, target speech, etc.) may have a corresponding audio state, which may be classified as valid and invalid, and may optionally be represented by the values "1" and "0", respectively. The time frame corresponding to the effective audio segment is an effective time frame, and the time frame corresponding to the ineffective audio segment is an ineffective time frame.

The target synthetic acoustic feature may comprise synthetic acoustic feature vectors that are in one-to-one correspondence with at least a portion of the time frames, and the number of synthetic acoustic feature vectors may be one or more, the number of synthetic acoustic feature vectors being dependent on the number of at least portions of the time frames. In one example, all posterior probability vectors (i.e., the entire PPG) corresponding to a plurality of time frames may be input into a feature transformation model to obtain synthetic acoustic feature vectors corresponding one-to-one to the plurality of time frames, which constitute the target synthetic acoustic feature. In another example, only all posterior probability vectors corresponding to all valid time frames in the plurality of time frames may be input into the feature transformation model to obtain synthetic acoustic feature vectors corresponding one-to-one to all valid time frames, which constitute the target synthetic acoustic feature.

Illustratively, the feature transformation model may include one or more of the following network models: tensor-to-tensor network model (T2T), CNN, sequence-to-sequence model (Seq 2 Seq), attention model (attention). For example, the feature transformation model may be a two-way long and short term memory network model (DBLSTM).

Illustratively, the target synthesized acoustic features of the target speaker are Mel-cepstral features (MCEP), line spectrum pair features (LSP), line spectrum pair features after Mel frequency (Mel-LSP), line spectrum pair features based on Mel-generalized cepstral analysis (MGC-LSP), or linear predictive coding features (LPC).

In step S150, performing speech synthesis based on the effective acoustic features to obtain effective speech of the target speaker, where the effective acoustic features include synthetic acoustic feature vectors corresponding to all effective time frames in the target synthetic acoustic features; the voice recognition model or the feature conversion model further outputs source audio state information, the source audio state information comprises a plurality of groups of frame audio state information corresponding to a plurality of time frames one by one, each group of frame audio state information indicates whether a source voice audio segment under the corresponding time frame belongs to an effective audio segment or an ineffective audio segment, and whether each time frame in the plurality of time frames belongs to the effective time frame or the ineffective time frame is determined based on the source audio state information.

By way of example and not limitation, each set of frame audio state information may be represented by a single state data and the source audio state information may be represented by a series of state data. The state data may be valued at 0 or 1, where 0 represents an inactive audio segment (e.g., mute, noisy ambient sound, etc.); 1 represents a valid audio piece.

In one embodiment, the speech recognition model may output source audio state information in addition to PPG. In another embodiment, the feature transformation model may output source audio state information in addition to the target synthetic acoustic features. Whether each of the plurality of time frames is a valid time frame or an invalid time frame may be determined based on the source audio state information.

The speech generation may be implemented by different vocoders, and those skilled in the art will understand the implementation of this, and this is not repeated here. The vocoder may be, for example WaveRNN, LPCNet, griffin-Lim, etc.

In the speech synthesis, only the synthesized acoustic feature vector corresponding to the valid time frame may be input to the vocoder for speech synthesis, and the information of the invalid time frame may be ignored. Whether each time frame belongs to a valid time frame or an invalid time frame may be determined based on source audio state information, which may be output by a speech recognition model or feature conversion model along with PPG or target synthetic acoustic features. Thus, this scheme can make a judgment of the audio state in synchronization with the speech recognition or feature conversion in making the speech conversion. Compared with the prior art adopting an endpoint detection network, the scheme can save the time of network training, and avoid the delay in the conversion process, so that the real-time performance of voice conversion can be effectively improved.

In addition, for some ineffective audio signals (such as audio portions of the voice data including noisy noise), the endpoint detection network cannot accurately detect the ineffective audio signals, so that the ineffective audio signals enter a subsequent voice recognition and synthesis stage, which not only causes waste of computing resources, but also seriously affects the accuracy of the voice conversion result. In contrast, according to the voice conversion method of the embodiment of the present invention, the prediction function of the audio state is integrated into the voice recognition model or the feature conversion model by means of joint modeling, and therefore, the audio state modeling unit (modeling unit related to audio state judgment) will participate in the training of the model together with the voice recognition modeling unit (modeling unit related to voice recognition) or the feature conversion modeling unit (modeling unit related to feature conversion). In theory, compared with single endpoint detection, the speech conversion system trained by the joint modeling mode can have higher endpoint detection precision, namely, the judgment precision of invalid audio can be improved, the waste of computing resources is reduced, and the accuracy of speech conversion results is improved.

After performing speech synthesis based on the effective acoustic features to obtain the effective speech of the target speaker (step S150), the method 100 may further include: and combining the effective voice with preset mute audio to obtain target voice of a target speaker, wherein the preset mute audio comprises mute audio fragments corresponding to all invalid time frames in a plurality of time frames one by one.

The preset mute audio may include one or more mute audio segments. In one example, each mute audio clip is machine-generated. For example, a mute audio clip may be obtained by setting the energy value of the audio clip to be less than or equal to a preset value. The preset value is set manually. The preset value is a relatively small value, for example 0. In another example, each mute audio segment may be a real audio segment representing a mute. In an actual mute scene, audio segments having different durations (i.e., corresponding to different frames) may be collected as the one or more mute audio segments. In the description herein, different audio clips may have the same or different durations. An "audio segment" is understood to mean an audio segment under each time frame, i.e. the duration of the different audio segments is the same.

And placing the corresponding mute audio fragments at the time line positions of the invalid time frames. At the time line position where the active time frame is located, a corresponding active audio segment of active speech is placed. The effective audio segment and the mute audio segment last for corresponding durations according to the respective frames of the continuous time frames. For example, assuming that 5 source speech audio segments corresponding to 1 st to 5 th frames of source speech are inactive audio segments, 15 source speech audio segments corresponding to 6 th to 20 th frames are active audio segments, and 10 source speech audio segments corresponding to 21 st to 30 th frames are inactive audio segments, in the finally generated target speech, a mute audio segment with a duration of 5 frames is placed at 1 st to 5 th frames, an active audio segment with a duration of 15 frames (the active audio segment is at least a part of the active speech) is placed at 6 th to 20 th frames, and a mute audio segment with a duration of 10 frames is placed at 21 st to 30 th frames until the corresponding active audio segment or mute audio segment is placed at the position where all time frames divided by the source speech are located. In this way, the target speech, i.e., the final speech conversion result, can be obtained.

By combining the active speech directly with the preset mute audio, a target speech that approximately matches the source speech can be simply and quickly generated.

According to an embodiment of the present invention, the speech recognition model also outputs source audio state information, and the method 100 may further include, prior to acquiring the source speech of the source speaker (step S110): acquiring sample training voice of a sample speaker, labeling voice type information corresponding to the sample training voice and labeling audio state information corresponding to the sample training voice, wherein the labeling voice type information is used for indicating voice types included in the sample training voice, and the labeling audio state information is used for indicating whether each audio segment in the sample training voice belongs to an effective audio segment or an ineffective audio segment; extracting characteristics of the sample training voice to obtain sample recognition acoustic characteristics of a sample speaker; inputting the sample recognition acoustic features into a speech recognition model to obtain predicted PPG and predicted audio state information of a sample speaker output by the speech recognition model; calculating a first loss based on the labeled speech class information and the predicted PPG; calculating a second loss based on the labeled audio state information and the predicted audio state information; calculating a first total loss by combining the first loss and the second loss; the speech recognition model is trained based on the first total loss.

In this embodiment, the source audio state information is also output by the speech recognition model while the PPG is extracted.

The following briefly describes the training and practical application of a speech conversion system according to the present invention with reference to fig. 2. Fig. 2 shows a schematic flow diagram of the training and conversion phases of a speech conversion system according to one embodiment of the invention. The speech conversion system may include a speech recognition model, a feature conversion model, and a vocoder. The overall flow of PPG-based model training and actual speech conversion can be divided into three phases: a first training phase, a second training phase, and a transition phase. The first training phase is a training phase of a speech recognition model, the second training phase is a training phase of a feature conversion model, and the conversion phase refers to an actual conversion phase executed when speech conversion is actually performed after the model is trained.

During the training phase, model training may be performed using the speech of the sample speaker (which may be referred to as the first sample training speech) and the speech of the target speaker (which may be referred to as the first target training speech). The sample speaker and the target speaker may be any speaker, where the target speaker involved in training the model is consistent with the target speaker in the actual speech conversion, and the sample speaker involved in training the model may or may not be consistent with the source speaker in the actual speech conversion. For example, the speech of the sample speaker may be from a TIMIT corpus.

Referring to fig. 2, in a first training phase, first sample training speech of a sample speaker and labeled speech class information (which may be referred to as first labeled speech class information) corresponding to the first sample training speech and labeled audio state information (which may be referred to as first labeled audio state information) corresponding to the first sample training speech are obtained from a sample speech library (e.g., the above-described timt corpus). The first tagged voice category information and the first tagged audio state information belong to tagged data (ground truth). The first labeled speech category information may be obtained from pre-labeled text labeled with each of the speech categories (e.g., phoneme states) included in the first sample training speech. Feature extraction may be performed on the first sample training speech to obtain sample recognition acoustic features of the sample speaker (which may be referred to as first sample recognition acoustic features). Fig. 2 shows that the first sample identifies the acoustic feature as a MFCC, but this is merely an example and not a limitation of the present invention. Subsequently, the first sample-recognized acoustic feature may be input into a speech recognition model net1, obtaining a predicted PPG (which may be referred to as a first predicted PPG) and predicted audio state information (which may be referred to as first predicted audio state information) of the sample speaker. The data form of the first predicted PPG may be understood with reference to the above description of PPG of the source speaker, which is not repeated here. Similarly, the data form of the first predicted audio state information may be understood with reference to the above description about the source audio state information, and will not be repeated here. Subsequently, a first loss may be calculated based on the first labeled speech class information and the first predicted PPG, and a second loss may be calculated based on the first labeled audio state information and the first predicted audio state information. The first total loss is calculated in combination with the first loss and the second loss in a manner to be described later. Subsequently, the speech recognition model net1 is trained based on the first total loss, and a trained speech recognition model net1 is obtained. Those skilled in the art will understand how to train a network model based on the loss (i.e., loss value), and will not be described in detail herein.

Referring to fig. 2, in a second training phase, a first target training speech of a target speaker is obtained from a target speech library. And extracting the characteristics of the first target training voice to obtain the acoustic characteristics of the target speaker. In the feature extraction step of the second training phase, in addition to the synthetic acoustic features of the target speaker (which may be referred to as first labeled synthetic acoustic features), the recognition acoustic features of the target speaker (which may be referred to as first labeled recognition acoustic features) may be extracted. FIG. 2 shows the first labeling synthesized acoustic feature as MCEP, the first labeling identified acoustic feature as MFCC, but this is merely an example and not a limitation of the present invention. The first annotation-identifying acoustic feature may then be input into a trained speech recognition model net1, and the PPG (which may be referred to as a second predicted PPG) of the target speaker output by the model is obtained. And then training the feature conversion model net2 based on the second predicted PPG and the first labeled synthesized acoustic features to obtain a trained feature conversion model. The trained feature transformation model may enable mapping between the PPG and the synthetic acoustic features of the target speaker. In fig. 2, the feature transformation model may be a DBLSTM model, which is merely an example and not a limitation of the present invention.

The vocoder may be pre-trained, which may be implemented using a vocoder similar to that used in conventional speech synthesis techniques.

Subsequently, referring to fig. 2, in the conversion phase, source-identifying acoustic features, which may be MFCC features, may be extracted from the source speech of any source speaker, but this is merely an example. The extracted source recognition acoustic features are input into a trained speech recognition model net1 to obtain PPG of the source speaker and source audio state information. Each source voice audio segment in the source voice can be screened based on the source audio state information, each effective audio segment and a corresponding effective time frame are determined, and each ineffective audio segment and a corresponding ineffective time frame are determined. Then, the posterior probability vector corresponding to the effective time frame can be input into a trained feature conversion model net2 to obtain effective acoustic features, and then voice synthesis is carried out through a vocoder to obtain effective voice. Optionally, the effective voice may be further combined with a preset mute audio to obtain the desired target voice. Alternatively, the valid voice may be directly outputted as the target voice.

For example, in the conversion phase, additional parameters may also be extracted, such as extracting fundamental frequency information F0 of the source speech and the non-periodic component AP. Furthermore, F0 may be linearly converted. Additional parameters may be added when speech synthesis is performed in the vocoder. For example, the effective acoustic features of the target speaker may be input to the vocoder along with the converted F0 and AP to synthesize effective speech.

The general flow of training of a speech conversion system has been described above in connection with fig. 2. The manner in which the loss function of the speech recognition model is calculated during training is described below in conjunction with fig. 3. FIG. 3 shows a schematic diagram of a training architecture of a speech recognition model according to one embodiment of the invention. In the embodiment shown in fig. 3, the speech recognition model is constructed to include a first shared network layer, a PPG output layer, and an audio state output layer.

As shown in fig. 3, the acoustic feature x of the first sample training speech (i.e., the first sample recognition acoustic feature) is input into the first shared network layer, so as to obtain a first training shared feature. The first training shared feature is input to the PPG output layer, a first predicted PPG may be obtained, which may be denoted by F1 (x). The first training share feature is input to the audio state output layer, and first predicted audio state information, which may be represented by F2 (x), may be obtained.

In addition, the first labeled speech class information and the first labeled audio state information may be obtained in advance. The first tagged speech category information may be obtained from pre-tagged text. The pre-labeled text labels the voice class at each time in the first sample training voice, and the label of the voice class can be in units of voice class or in units of time frame. For example, a pre-labeled text of a sample voice with "hello" content may be labeled with a phoneme sequence of "n-i-h-ao" and a duration of each phoneme, and a phoneme under each time frame may be determined from the phoneme sequence, so as to obtain the required first labeled voice category information. For another example, the first labeled speech category information of the sample speech with "hello" content may label the phonemes under each time frame, for example, "n" lasts for 10 time frames, and then the 1 st to 10 th time frames may be labeled as the phonemes "n", and the other phonemes and the like, which are not described again.

Similarly, the first annotated audio state information may be obtained from pre-annotated state information. The pre-labeling state information may label the audio state at each time instant in the first sample training speech, where the labeling of the audio state may be in units of consecutive audio segments or in units of time frames. For example, for the first sample training speech, the middle 10 th-20 th frames are valid audio segments, and the other frames are invalid audio segments, the beginning (10 th frame) and the ending (20 th frame) of the 10 th-20 th frames can be marked, then a machine automatically analyzes whether the audio segment under each time frame belongs to the valid audio segment or the invalid audio segment, and the audio state under each time frame can be represented by 1 or 0, so that a series of numerical sequences consisting of 1 and 0, namely the first marked audio state information, can be obtained. For another example, the audio status of each time frame may be directly marked, so that the pre-marked status information is the required first marked audio status information.

As shown in fig. 3, a first loss ₁ may be calculated based on the first labeled speech category information and the first predicted PPG, and a second loss ₂ may be calculated based on the first labeled audio state information and the first predicted audio state information. Subsequently, a first total loss loss_net ₁ can be obtained in combination with loss ₁ and loss ₂. The parameters of the speech recognition model may be iteratively optimized based on the first total loss loss_net ₁ until the model converges.

According to an embodiment of the present invention, calculating the first total loss in combination with the first loss and the second loss may include:

calculating a first total loss based on the following formula:

loss_net₁＝α₁*loss₁+β₁*loss₂；

α₁+β₁＝1；

calculating a first total loss based on the following formula:

loss_net₁＝f₁(loss₁*loss₂)+α₂*loss₁+β₂*loss₂;

Illustratively, f ₁ may be a log function, a sigmoid function, or the like. Illustratively, f ₁ may also be a constant function, i.e. its function value is a fixed value.

According to an embodiment of the present invention, the speech recognition model includes a first shared network layer, a PPG output layer, and an audio state output layer, and inputting the source recognition acoustic features into the speech recognition model to obtain the PPG of the source speaker output by the speech recognition model (step S130) may include: inputting the source identification acoustic features into a first shared network layer to obtain first shared features output by the first shared network layer; and respectively inputting the first sharing characteristic into the PPG output layer and the audio state output layer to obtain the PPG output by the PPG output layer and the source audio state information output by the audio state output layer.

In this embodiment, the source audio state information is also output by the speech recognition model while the PPG is extracted. At this time, the speech recognition model may be constructed to include a first shared network layer, a PPG output layer, and an audio state output layer. Fig. 3 shows a model structure of the speech recognition model of the present embodiment, which can be understood with reference to fig. 3.

Illustratively, the first shared network layer may include one or more of the following network models: LSTM, CNN, TDNN, DNN, etc. Alternatively, the first shared network layer may be formed by splicing and combining the above several network models. Illustratively, either of the PPG output layer and the audio state output layer may include a basic network structure and an output function layer circumscribing the basic network structure. Illustratively, the basic network structure may include one or more of the following network models: LSTM, CNN, TDNN, DNN, etc. Illustratively, the output function layer may include a softmax function layer or the like.

According to an embodiment of the present invention, the speech recognition model further outputs source audio state information, and inputting the posterior probability vector corresponding to at least some of the plurality of time frames into the feature conversion model to obtain the target synthesized acoustic feature of the target speaker output by the feature conversion model (step S140) may include: determining whether each of the plurality of time frames belongs to a valid time frame or an invalid time frame based on the source audio state information; extracting posterior probability vectors corresponding to all effective time frames from the PPG; the extracted posterior probability vector is input into a feature transformation model to obtain a target synthetic acoustic feature.

In one example, in the case that the speech recognition model outputs the source audio state information, before performing feature conversion, it may be analyzed based on the source audio state information whether each time frame belongs to a valid time frame or an invalid time frame, the valid time frames are filtered out before performing feature conversion, and only the posterior probability vectors corresponding to all the valid time frames are input into the feature conversion model to perform feature conversion. In another example, in the case that the speech recognition model outputs the source audio state information, the entire PPG input feature conversion model may also be subjected to feature conversion, and the valid time frames may be screened out before speech synthesis is performed, and speech synthesis may be performed only based on the synthesized acoustic feature vectors corresponding to all the valid time frames. The former example is smaller in calculation amount than the latter example, which is advantageous in improving the speech conversion speed.

According to an embodiment of the present invention, the feature transformation model also outputs source audio state information, and the method 100 may further include, before acquiring the source speech of the source speaker (step S110): acquiring target training voice of a target speaker and labeled audio state information corresponding to the target training voice, wherein the labeled audio state is used for indicating whether each audio segment in the target training voice belongs to an effective audio segment or an ineffective audio segment; extracting features of the target training voice to obtain labeled recognition acoustic features and labeled synthesis acoustic features of the target speaker; inputting the labeling recognition acoustic features into a voice recognition model to obtain a predicted PPG of a target speaker output by the voice recognition model; inputting the predicted PPG into a feature conversion model to obtain predicted synthesized acoustic features and predicted audio state information of a target speaker output by the feature conversion model; calculating a third loss based on the labeled and predicted synthesized acoustic features; calculating a fourth loss based on the labeled audio state information and the predicted audio state information; calculating a second total loss by combining the third loss and the fourth loss; the feature transformation model is trained based on the second total loss.

The following briefly describes the training and practical application of another speech conversion system according to the present invention in conjunction with fig. 4. Fig. 4 shows a schematic flow chart of a training and conversion phase of a speech conversion system according to another embodiment of the invention. Similarly as in fig. 2, the overall flow of PPG-based model training and actual speech conversion can be divided into three phases: a first training phase, a second training phase, and a transition phase.

During the training phase, model training may be performed using the speech of the sample speaker (which may be referred to as the second sample training speech) and the speech of the target speaker (which may be referred to as the second target training speech). For example, the second sample training speech may be from a TIMIT corpus.

Referring to fig. 4, in a first training phase, second sample training speech of a sample speaker and labeled speech class information (which may be referred to as second labeled speech class information) corresponding to the second sample training speech are obtained from a sample speech library (e.g., the above-described timt corpus). The second labeled speech category information is similar to the first labeled speech category information described above, and is not described here. Feature extraction may be performed on the second sample training speech to obtain sample recognition acoustic features of the sample speaker (which may be referred to as second sample recognition acoustic features). Fig. 4 shows that the second sample identifies the acoustic feature as MFCC, but this is merely an example and not a limitation of the present invention. Subsequently, the second sample-recognition acoustic feature may be input into the speech recognition model net1, obtaining a predicted PPG (which may be referred to as a third predicted PPG) of the sample speaker. Subsequently, the speech recognition model net1 may be trained based on the second labeled speech category information and the third predicted PPG, obtaining a trained speech recognition model net1.

Referring to fig. 4, in a second training phase, a second target training speech of the target speaker is obtained from the target speech library, and labeled audio state information (which may be referred to as second labeled audio state information) corresponding to the second target training speech is obtained. And extracting the characteristics of the second target training voice to obtain the acoustic characteristics of the target speaker. In the feature extraction step of the second training stage, in addition to the synthetic acoustic features of the target speaker (which may be referred to as second labeled synthetic acoustic features), the recognition acoustic features of the target speaker (which may be referred to as second labeled recognition acoustic features) may be extracted. FIG. 4 shows a second labeling synthesized acoustic feature of MCEP, which identifies an acoustic feature as an MFCC, but this is merely an example and not a limitation of the present invention. Subsequently, the second label-recognition acoustic feature may be input into a trained speech recognition model net1, and the PPG (which may be referred to as a fourth predicted PPG) of the target speaker output by the model is obtained. Subsequently, the fourth predictive PPG is input to the feature conversion model net2, and the predicted synthesized acoustic feature and the predicted audio state information (which may be referred to as second predicted audio state information) of the target speaker output by the feature conversion model net2 are obtained. Then, a third loss may be calculated based on the second annotated synthetic acoustic feature and the predicted synthetic acoustic feature, and a fourth loss may be calculated based on the second annotated audio state information and the second predicted audio state information. The second total loss is calculated in combination with the third loss and the fourth loss in a manner to be described later. Subsequently, the feature transformation model net2 is trained based on the second total loss, and a trained feature transformation model net2 is obtained. Those skilled in the art will understand how to train a network model based on the loss (i.e., loss value), and will not be described in detail herein.

The vocoder of fig. 4 is similar to the vocoder of fig. 2 and is not described in detail.

Subsequently, referring to fig. 4, in the conversion phase, source-identifying acoustic features, which may be MFCC features, may be extracted from the source speech of any source speaker, but this is merely an example. The extracted source recognition acoustic features are input into a trained speech recognition model net1 to obtain PPG of the source speaker. Subsequently, the PPG may be input into a trained feature transformation model net2 to obtain target synthetic acoustic features and source audio state information. Each source voice audio segment in the source voice can be screened based on the source audio state information, each effective audio segment and a corresponding effective time frame are determined, and each ineffective audio segment and a corresponding ineffective time frame are determined. And screening synthetic acoustic feature vectors corresponding to all effective time frames from target synthetic acoustic features, and performing voice synthesis through a vocoder to obtain effective voice. Optionally, the effective voice may be further combined with a preset mute audio to obtain the desired target voice. Alternatively, the valid voice may be directly outputted as the target voice.

The general flow of training of a speech conversion system has been described above in connection with fig. 4. The manner in which the feature transformation model is calculated as a function of loss during training is described below in conjunction with fig. 5. FIG. 5 illustrates a schematic diagram of a training architecture of a feature transformation model, according to one embodiment of the invention. In the embodiment shown in fig. 5, the feature transformation model is built to include a second shared network layer, a transformed feature output layer, and an audio state output layer.

As shown in fig. 5, a fourth predicted PPG (denoted by y) output by the speech recognition model net1 is input to the second shared network layer, and a second training shared feature is obtained. The second training shared feature is input to the conversion feature output layer, and a predicted composite acoustic feature, which may be represented by F3 (y), may be obtained. The second training share feature is input to the audio state output layer, and second predicted audio state information, which may be represented by F4 (y), may be obtained.

Further, the second tagged audio state information may be obtained in advance. The second labeled audio state information is similar to the first labeled audio state information, and will not be described again.

As shown in fig. 5, a third loss ₃ may be calculated based on the second annotated synthetic acoustic feature and the predicted synthetic acoustic feature, and a fourth loss ₄ may be calculated based on the second annotated audio state information and the second predicted audio state information. Subsequently, a second total loss loss_net ₂ can be obtained in combination with loss ₃ and loss ₄. The parameters of the feature transformation model may be iteratively optimized based on the second total loss loss_net ₂ until the model converges.

According to an embodiment of the present invention, calculating the second total loss in combination with the third loss and the fourth loss includes:

The second total loss is calculated based on the following formula:

loss_net₂＝α₃*loss₃+β₃*loss₄；

α₃+β₃＝1；

The second total loss is calculated based on the following formula:

loss_net₂＝f₂(loss₃*loss₄)+α₄*loss₃+β₄*loss₄;

Illustratively, f ₂ may be a log function, a sigmoid function, or the like. Illustratively, f ₂ may also be a constant function, i.e. its function value is a fixed value.

According to an embodiment of the present invention, the feature conversion model includes a second shared network layer, a conversion feature output layer, and an audio state output layer, and inputting posterior probability vectors corresponding to at least some of the time frames into the feature conversion model to obtain a target synthesized acoustic feature of a target speaker output by the feature conversion model (step S140) may include: inputting the PPG into a second shared network layer to obtain a second shared characteristic output by the second shared network layer; and respectively inputting the second shared characteristic into a conversion characteristic output layer and an audio state output layer to obtain the target synthesized acoustic characteristic output by the conversion characteristic output layer and the source audio state information output by the audio state output layer.

In this embodiment, the feature conversion model converts the target synthesized acoustic feature and outputs the source audio state information. At this time, the feature conversion model may be constructed to include a second shared network layer, a conversion feature output layer, and an audio state output layer. Fig. 5 shows a model structure of the feature transformation model of the present embodiment, which can be understood with reference to fig. 5.

Illustratively, the second shared network layer may include one or more of the following network models: LSTM, CNN, TDNN, DNN, etc. Alternatively, the second shared network layer may be formed by splicing and combining the above several network models. Illustratively, either of the transition feature output layer and the audio state output layer may include a basic network structure and an output function layer circumscribing the basic network structure. Illustratively, the basic network structure may include one or more of the following network models: LSTM, CNN, TDNN, DNN, etc. Illustratively, the output function layer may include a softmax function layer or the like.

According to an embodiment of the present invention, the feature transformation model also outputs source audio state information, and the method 100 may further include, before performing speech synthesis based on the valid acoustic features to obtain valid speech of the target speaker (step S150): determining whether each of the plurality of time frames belongs to a valid time frame or an invalid time frame based on the source audio state information; and extracting synthetic acoustic feature vectors corresponding to all effective time frames from the target synthetic acoustic features to obtain effective acoustic features.

The present embodiment may be understood in conjunction with the above description, and will not be described in detail.

In practical speech conversion applications, the scene of sound conversion is complex, and the input data is a variety of, including not only clean human voice, human voice mixed with various background noise, but also possibly some mixed background noise. In addition, in actual use, the user has a relatively high requirement on the real-time performance of voice conversion. The invention combines the audio state prediction function in the modeling process, does not need independent advanced audio state detection, can synchronously execute the audio state prediction along with the steps of voice recognition or feature conversion, can reduce the resource consumption and is beneficial to improving the real-time performance of voice conversion. In addition, the voice conversion system adopted by the joint modeling mode enables more accurate endpoint detection, so that unnecessary calculation and influence of invalid audio on a voice conversion result are reduced, and the accuracy of the voice conversion result can be improved while the instantaneity is improved.

According to another aspect of the present invention, a voice conversion apparatus is provided. Fig. 6 shows a schematic block diagram of a speech conversion apparatus 600 according to one embodiment of the invention.

As shown in fig. 6, the voice conversion apparatus 600 according to an embodiment of the present invention includes an acquisition module 610, an extraction module 620, a first input module 630, a second input module 640, and a synthesis module 650. The various modules may perform the various steps/functions of the speech conversion method 100 described above in connection with fig. 1, respectively. Only the main functions of the respective components of the voice conversion apparatus 600 will be described below, and the details already described above will be omitted.

The acquisition module 610 is configured to acquire source speech of a source speaker.

The extraction module 620 is configured to perform feature extraction on the source speech to obtain acoustic features of the source speaker.

The first input module 630 is configured to input the source recognition acoustic feature into the speech recognition model to obtain a speech posterior probability of a source speaker output by the speech recognition model, where the speech posterior probability includes a plurality of posterior probability vectors corresponding to a plurality of time frames one-to-one.

The second input module 640 is configured to input posterior probability vectors corresponding to at least some time frames of the plurality of time frames into the feature conversion model to obtain a target synthesized acoustic feature of the target speaker output by the feature conversion model, where the target synthesized acoustic feature includes a synthesized acoustic feature vector corresponding to at least some time frames one to one, each time frame of the plurality of time frames belongs to an active time frame or an inactive time frame, the active time frame refers to a time frame of the corresponding source voice audio segment being an active audio segment, the inactive time frame refers to a time frame of the corresponding source voice audio segment being an inactive audio segment, and at least some time frames include all active time frames of the plurality of time frames.

The synthesizing module 650 is configured to perform speech synthesis based on the valid acoustic features to obtain valid speech of the target speaker, where the valid acoustic features include synthetic acoustic feature vectors corresponding to all valid time frames in the target synthetic acoustic features; the voice recognition model or the feature conversion model further outputs source audio state information, the source audio state information comprises a plurality of groups of frame audio state information corresponding to a plurality of time frames one by one, each group of frame audio state information indicates whether a source voice audio segment under the corresponding time frame belongs to an effective audio segment or an ineffective audio segment, and whether each time frame in the plurality of time frames belongs to the effective time frame or the ineffective time frame is determined based on the source audio state information.

According to another aspect of the present invention, a speech conversion system is provided. Fig. 7 shows a schematic block diagram of a speech conversion system 700 according to one embodiment of the invention. The speech conversion system 700 includes a processor 710 and a memory 720.

The memory 720 stores computer program instructions for implementing the corresponding steps in the speech conversion method 100 according to an embodiment of the present invention.

The processor 710 is configured to execute computer program instructions stored in the memory 720 to perform the corresponding steps of the speech conversion method 100 according to an embodiment of the present invention.

According to another aspect of the present invention, there is provided a storage medium on which program instructions are stored, which program instructions, when being executed by a computer or a processor, are adapted to carry out the respective steps of the speech conversion method 100 of an embodiment of the present invention and to carry out the respective modules in the speech conversion apparatus 600 according to an embodiment of the present invention. The storage medium may include, for example, a memory card of a smart phone, a memory component of a tablet computer, a hard disk of a personal computer, read-only memory (ROM), erasable programmable read-only memory (EPROM), portable compact disc read-only memory (CD-ROM), USB memory, or any combination of the foregoing storage media.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, e.g., the division of the elements is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple elements or components may be combined or integrated into another device, or some features may be omitted or not performed.

Similarly, it should be appreciated that in order to streamline the invention and aid in understanding one or more of the various inventive aspects, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof in the description of exemplary embodiments of the invention. However, the method of the present invention should not be construed as reflecting the following intent: i.e., the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functions of some of the modules in a speech conversion system according to embodiments of the present invention may be implemented in practice using a microprocessor or Digital Signal Processor (DSP). The present invention can also be implemented as an apparatus program (e.g., a computer program and a computer program product) for performing a portion or all of the methods described herein. Such a program embodying the present invention may be stored on a computer readable medium, or may have the form of one or more signals. Such signals may be downloaded from an internet website, provided on a carrier signal, or provided in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.

The foregoing description is merely illustrative of specific embodiments of the present invention and the scope of the present invention is not limited thereto, and any person skilled in the art can easily think about variations or substitutions within the scope of the present invention. The protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A voice conversion method, comprising:

Acquiring source voice of a source speaker;

Extracting features of the source voice to obtain source recognition acoustic features of the source speaker;

Inputting the source recognition acoustic features into a speech recognition model to obtain a speech posterior probability of the source speaker output by the speech recognition model, wherein the speech posterior probability comprises a plurality of posterior probability vectors which are in one-to-one correspondence with a plurality of time frames;

inputting posterior probability vectors corresponding to at least some of the plurality of time frames into a feature conversion model to obtain target synthesized acoustic features of a target speaker output by the feature conversion model, wherein the target synthesized acoustic features comprise synthesized acoustic feature vectors corresponding to the at least some time frames one by one, each of the plurality of time frames belongs to an active time frame or an inactive time frame, the active time frame refers to a time frame in which the corresponding source voice audio segment is an active audio segment, the inactive time frame refers to a time frame in which the corresponding source voice audio segment is an inactive audio segment, and the at least some time frames comprise all active time frames in the plurality of time frames;

Performing speech synthesis based on effective acoustic features to obtain effective speech of the target speaker, wherein the effective acoustic features comprise synthetic acoustic feature vectors, which are in one-to-one correspondence with all effective time frames, in the target synthetic acoustic features;

The voice recognition model or the feature conversion model further outputs source audio state information, the source audio state information comprises a plurality of groups of frame audio state information corresponding to the time frames one by one, each group of frame audio state information indicates whether a source voice audio segment under a corresponding time frame belongs to an effective audio segment or an ineffective audio segment, and each time frame in the time frames belongs to an effective time frame or an ineffective time frame is determined based on the source audio state information;

The speech recognition model also outputs the source audio state information, and the method further comprises, prior to the obtaining the source speech of the source speaker:

Acquiring sample training voice of a sample speaker, labeling voice category information corresponding to the sample training voice and labeling audio state information corresponding to the sample training voice, wherein the labeling voice category information is used for indicating voice categories included in the sample training voice, and the labeling audio state information is used for indicating whether each audio segment in the sample training voice belongs to an effective audio segment or an ineffective audio segment;

Extracting features of the sample training voice to obtain sample recognition acoustic features of the sample speaker;

Inputting the sample recognition acoustic features into the speech recognition model to obtain predicted speech posterior probability and predicted audio state information of the sample speaker output by the speech recognition model;

calculating a first loss based on the labeled speech category information and the predicted speech posterior probability;

calculating a second loss based on the annotated audio state information and the predicted audio state information;

Calculating a first total loss by combining the first loss and the second loss;

Training the speech recognition model based on the first total loss;

Or alternatively, the first and second heat exchangers may be,

The feature transformation model also outputs the source audio state information, and the method further comprises, prior to the obtaining the source speech of the source speaker:

acquiring target training voice of the target speaker and labeling audio state information corresponding to the target training voice, wherein the labeling audio state is used for indicating whether each audio segment in the target training voice belongs to an effective audio segment or an ineffective audio segment;

Extracting the characteristics of the target training voice to obtain the labeled recognition acoustic characteristics and labeled synthesis acoustic characteristics of the target speaker;

inputting the labeling recognition acoustic features into the voice recognition model to obtain the predicted voice posterior probability of the target speaker output by the voice recognition model;

Inputting the predicted speech posterior probability into the feature conversion model to obtain predicted synthesized acoustic features and predicted audio state information of the target speaker output by the feature conversion model;

Calculating a third loss based on the labeled synthetic acoustic features and the predicted synthetic acoustic features;

calculating a fourth loss based on the labeled audio state information and the predicted audio state information;

Calculating a second total loss by combining the third loss and the fourth loss;

training the feature transformation model based on the second total loss.

2. The speech conversion method of claim 1, wherein after the speech synthesis based on the effective acoustic features to obtain the effective speech of the target speaker, the method further comprises:

And combining the effective voice with preset mute audio to obtain target voice of the target speaker, wherein the preset mute audio comprises mute audio fragments corresponding to all invalid time frames in the plurality of time frames one by one.

3. The speech conversion method of claim 1 wherein the speech recognition model comprises a first shared network layer, a speech posterior probability output layer, and an audio state output layer, the inputting the source recognition acoustic features into the speech recognition model to obtain the speech posterior probability of the source speaker output by the speech recognition model comprising:

Inputting the source identification acoustic features into the first shared network layer to obtain first shared features output by the first shared network layer;

And respectively inputting the first shared characteristic into the voice posterior probability output layer and the audio state output layer to obtain the voice posterior probability output by the voice posterior probability output layer and the source audio state information output by the audio state output layer.

4. The speech conversion method of claim 1, wherein the speech recognition model further outputs the source audio state information, and the inputting the posterior probability vector corresponding to at least some of the plurality of time frames into the feature conversion model to obtain the target synthesized acoustic feature of the target speaker output by the feature conversion model comprises:

determining whether each of the plurality of time frames belongs to a valid time frame or an invalid time frame based on the source audio state information;

extracting posterior probability vectors corresponding to all effective time frames from the voice posterior probability;

The extracted posterior probability vector is input to the feature transformation model to obtain the target synthetic acoustic feature.

5. The speech conversion method of claim 1, wherein the calculating a first total loss by combining the first loss and the second loss comprises:

calculating the first total loss based on the following formula:

loss_net₁ = α₁*loss₁+β₁*loss₂；

α₁+β₁ = 1；

Wherein loss_net ₁ is the first total loss, loss ₁ is the first loss, loss ₂ is the second loss, α ₁ and β ₁ are preset coefficients, and the value ranges of α ₁ and β ₁ are (0, 1).

6. The speech conversion method of claim 1, wherein the calculating a first total loss by combining the first loss and the second loss comprises:

calculating the first total loss based on the following formula:

loss_net₁ = f₁(loss₁*loss₂) + α₂*loss₁+ β₂*loss₂;

Where loss_net ₁ is the first total loss, loss ₁ is the first loss, loss ₂ is the second loss, f ₁(loss₁*loss₂) is a preset function related to loss ₁ and loss ₂, α ₂ and β ₂ are preset coefficients, and the value ranges of α ₂ and β ₂ are both (0, 1).

7. The speech conversion method according to claim 1, wherein the feature conversion model includes a second shared network layer, a conversion feature output layer, and an audio state output layer, and the inputting the posterior probability vector corresponding to at least a portion of the time frames into the feature conversion model to obtain the target synthesized acoustic feature of the target speaker output by the feature conversion model includes:

inputting the voice posterior probability into the second shared network layer to obtain a second shared characteristic output by the second shared network layer;

and respectively inputting the second shared feature into the conversion feature output layer and the audio state output layer to obtain the target synthesized acoustic feature output by the conversion feature output layer and the source audio state information output by the audio state output layer.

8. The speech conversion method of claim 1 wherein the feature conversion model further outputs the source audio state information, the method further comprising, prior to the speech synthesis based on the valid acoustic features to obtain valid speech for the target speaker:

And extracting synthetic acoustic feature vectors corresponding to all effective time frames from the target synthetic acoustic features to obtain the effective acoustic features.

9. The speech conversion method of claim 1, wherein the calculating a second total loss by combining the third loss and the fourth loss comprises:

calculating the second total loss based on the following formula:

loss_net₂ = α₃*loss₃+β₃*loss₄；

α₃+β₃ = 1；

Wherein loss_net ₂ is the second total loss, loss ₃ is the third loss, loss ₄ is the fourth loss, α ₃ and β ₃ are preset coefficients, and the value ranges of α ₃ and β ₃ are (0, 1).

10. The speech conversion method of claim 1, wherein the calculating a second total loss by combining the third loss and the fourth loss comprises:

calculating the second total loss based on the following formula:

loss_net₂ = f₂(loss₃*loss₄) + α₄*loss₃*β₄*loss₄;

Where loss_net ₂ is the second total loss, loss ₃ is the third loss, loss ₄ is the fourth loss, f ₂(loss₃*loss₄) is a preset function related to loss ₃ and loss ₄, α ₄ and β ₄ are preset coefficients, and the value ranges of α ₄ and β ₄ are both (0, 1).

11. The speech conversion method of any one of claims 1 to 10, wherein the speech recognition model comprises one or more of the following network models: a long-term and short-term memory network model, a convolutional neural network model, a time delay neural network model and a deep neural network model; and/or the number of the groups of groups,

The feature transformation model includes one or more of the following network models: tensor-tensor network model, convolutional neural network model, sequence-to-sequence model, attention model.

12. The speech conversion method according to any one of claims 1 to 10, wherein the source-identifying acoustic feature is a mel-frequency cepstral coefficient feature, a perceptual linear prediction feature, a filter bank feature, or a constant Q cepstral coefficient feature,

The target synthesized acoustic features are Mel cepstrum features, line spectrum pair features after Mel frequency, line spectrum pair features based on Mel generalized cepstrum analysis or linear prediction coding features.

13. A speech conversion apparatus comprising:

The acquisition module is used for acquiring the source voice of the source speaker;

The extraction module is used for extracting the characteristics of the source voice so as to obtain the source recognition acoustic characteristics of the source speaker;

a first input module, configured to input the source recognition acoustic feature into a speech recognition model, so as to obtain a speech posterior probability of the source speaker output by the speech recognition model, where the speech posterior probability includes a plurality of posterior probability vectors that are in one-to-one correspondence with a plurality of time frames;

A second input module, configured to input posterior probability vectors corresponding to at least some of the time frames into a feature conversion model to obtain a target synthesized acoustic feature of a target speaker output by the feature conversion model, where the target synthesized acoustic feature includes synthesized acoustic feature vectors corresponding to the at least some time frames one to one, each of the time frames belongs to an active time frame or an inactive time frame, the active time frame refers to a time frame in which the corresponding source voice audio segment is an active audio segment, and the inactive time frame refers to a time frame in which the corresponding source voice audio segment is an inactive audio segment, and the at least some time frames include all active time frames in the time frames;

A synthesis module, configured to perform speech synthesis based on effective acoustic features to obtain effective speech of the target speaker, where the effective acoustic features include synthetic acoustic feature vectors in the target synthetic acoustic features that are in one-to-one correspondence with all effective time frames;

The speech recognition model also outputs the source audio state information, the apparatus being further for:

Before the source voice of the source speaker is obtained, obtaining sample training voice of the sample speaker, labeling voice category information corresponding to the sample training voice and labeling audio state information corresponding to the sample training voice, wherein the labeling voice category information is used for indicating voice categories included in the sample training voice, and the labeling audio state information is used for indicating whether each audio segment in the sample training voice belongs to an effective audio segment or an ineffective audio segment;

Calculating a first total loss by combining the first loss and the second loss;

Training the speech recognition model based on the first total loss;

Or alternatively, the first and second heat exchangers may be,

The feature transformation model also outputs the source audio state information, the apparatus further being for:

before the source voice of the source speaker is acquired, acquiring target training voice of the target speaker and marked audio state information corresponding to the target training voice, wherein the marked audio state is used for indicating whether each audio segment in the target training voice belongs to an effective audio segment or an ineffective audio segment;

inputting the predicted speech posterior probability into the feature conversion model before the source speech of the source speaker is acquired to obtain predicted synthesized acoustic features and predicted audio state information of the target speaker output by the feature conversion model;

training the feature transformation model based on the second total loss.

14. A speech conversion system comprising a processor and a memory, wherein the memory has stored therein computer program instructions which, when executed by the processor, are adapted to carry out the speech conversion method of any of claims 1 to 12.

15. A storage medium having stored thereon program instructions for performing the speech conversion method of any of claims 1 to 12 when run.