WO2022083083A1 - 一种声音变换系统以及声音变换系统的训练方法 - Google Patents

一种声音变换系统以及声音变换系统的训练方法 Download PDF

Info

Publication number
WO2022083083A1
WO2022083083A1 PCT/CN2021/088507 CN2021088507W WO2022083083A1 WO 2022083083 A1 WO2022083083 A1 WO 2022083083A1 CN 2021088507 W CN2021088507 W CN 2021088507W WO 2022083083 A1 WO2022083083 A1 WO 2022083083A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
voice
training
bottleneck
feature
Prior art date
Application number
PCT/CN2021/088507
Other languages
English (en)
French (fr)
Inventor
司马华鹏
毛志强
龚雪飞
Original Assignee
南京硅基智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南京硅基智能科技有限公司 filed Critical 南京硅基智能科技有限公司
Priority to EP21749056.4A priority Critical patent/EP4016526B1/en
Priority to US17/430,793 priority patent/US11875775B2/en
Publication of WO2022083083A1 publication Critical patent/WO2022083083A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/173Transcoding, i.e. converting between two coded representations avoiding cascaded coding-decoding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the present disclosure relates to the field of speech computing algorithms, and in particular, to a voice transformation system and a training method for the voice transformation system.
  • voice robots for the purpose of voice interaction have gradually entered the public eye.
  • the emergence of voice robots has changed the working nature of the existing telephone business.
  • voice robots are used in real estate, education, finance, tourism and other industries to perform voice interaction functions, thereby replacing manual voice interaction with users.
  • Speech conversion technology is a research branch of speech signal processing, which covers the fields of speaker recognition, speech recognition and speech synthesis.
  • the speech of a particular speaker ie the source speaker
  • the speech of another particular speaker ie the target speaker.
  • the statistical conversion method represented by the Gaussian mixture model has become a classic method in this field.
  • this kind of algorithm still has some defects.
  • the classical methods of using Gaussian mixture model for speech conversion are mostly based on one-to-one speech conversion tasks, which require the use of parallel text, that is, the source speaker and the target speaker use.
  • the content of the training sentences is the same, and the spectral features of the training sentences need to be aligned frame by frame by dynamic time warping (DTW), so that the mapping relationship between the spectral features can be obtained through model training. Therefore, the above-mentioned Gaussian mixture model cannot be used for speech conversion.
  • DTW dynamic time warping
  • the embodiment of the present disclosure proposes a voice conversion solution for non-parallel corpus training, which can get rid of the dependence on parallel text, and can achieve the effect of voice changing in small samples, so as to solve the problem that it is difficult to realize voice conversion under the condition of limited resources and equipment. technical issues.
  • a sound transformation system including:
  • the speaker-independent speech recognition model includes at least a bottleneck layer, and the speaker-independent speech recognition model is configured to transform the Mel cepstrum feature of the input source speech into the bottleneck feature of the source speech through the bottleneck layer;
  • An attention voice-changing network which is configured to transform the bottleneck features of the source speech into Mel cepstral features that match the target speech
  • a neural network vocoder configured to convert Mel cepstral features corresponding to the target speech into speech output.
  • a sound transformation method including:
  • a method for training a speaker-independent speech recognition model including:
  • the number of the character code converted from the text in the multi-person speech recognition training corpus and the Mel cepstrum feature of the multi-person speech recognition training corpus are input into the speaker-independent speech recognition model, and the back-propagation algorithm is run for iterative optimization until The speaker-independent speech recognition model converges.
  • a training method for an attention voice-changing network including:
  • the bottleneck feature of the target speech is input into the basic attention voice-changing network, and the Mel cepstrum feature corresponding to the target speaker is used as the real value, and the attention-voice-changing network is trained by the method of deep transfer learning.
  • a method for training a neural network vocoder including:
  • the Mel cepstrum feature of the target speech and the sound signal of the target speech are input into the pre-trained neural network vocoder, and the neural network vocoder is trained by the method of deep transfer learning.
  • a terminal including the aforementioned sound conversion system.
  • a computer-readable storage medium is provided, and a computer program is stored in the computer-readable storage medium, wherein the computer program is configured to execute one of the aforementioned methods when running.
  • a sound conversion system including:
  • a speech recognition model including at least a bottleneck layer, the speech recognition model is configured to: convert the Mel cepstrum feature of the input source speech into a bottleneck feature through the bottleneck layer; and pass the bottleneck feature through the bottleneck layer input to the attention voice-changing network;
  • the attention voice-changing network is configured to: convert the input bottleneck features into Mel cepstral features consistent with the target speech;
  • a neural network vocoder configured to: convert the Mel cepstrum features consistent with the target speech into imitation target speech and output; the imitation target speech is the audio generated by the conversion of the source speech.
  • a ninth aspect provides a training method for a sound transformation system, which is applied to the sound transformation system described in the fifth aspect, including:
  • mapping relationship between the characters in the multi-person speech recognition training corpus and character codes convert the characters in the multi-person speech recognition training corpus into numbers
  • the embodiments of the present disclosure get rid of the dependence on parallel text, realize the conversion of any speaker to multiple speakers, improve the flexibility, and solve the technical problem that it is difficult to realize voice conversion under the condition of limited resources and equipment.
  • the trained speaker-independent speech recognition model can be used for any source speaker, that is, speaker-independent (Speaker-Independent, SI); the speaker-independent speech recognition model training method only needs to be trained once, and the subsequent small samples only need to be trained once. It is necessary to use the trained model to extract the corresponding features, so that the sound can be transformed in real time to meet the needs of real-time transformation of the sound.
  • speaker-Independent Sound-Independent
  • the audio bottleneck feature is more abstract. It can not only reflect the decoupling of the speech content and the speaker's timbre, but also is not so closely bound to the phoneme category. It is not a clear one-to-one correspondence, so it can alleviate the problem of inaccurate pronunciation caused by ASR (Automatic Speech Recognition) recognition errors to a certain extent.
  • ASR Automatic Speech Recognition
  • the embodiment of the present disclosure realizes the rapid training of voice transformation. Compared with the general voice transformation network, the data volume requirement is significantly reduced.
  • the training time of the system in the embodiment of the present disclosure can be shortened to 5 to 20 minutes, which greatly reduces the dependence of training corpus. Practicality is significantly enhanced.
  • SI-ASR model speaker-independent speech recognition model
  • FIG. 2 is a training flow chart of an attention voice-changing network (Attention voice-changing network) according to an embodiment of the present disclosure
  • FIG. 3 is a flowchart of training a neural network vocoder according to an embodiment of the present disclosure
  • FIG. 5 is an architectural diagram of an attention voice-changing network (Attention voice-changing network) according to an embodiment of the present disclosure
  • FIG. 6 is a network architecture diagram of a neural network vocoder according to an embodiment of the present disclosure.
  • FIG. 7 is a network architecture diagram of a speaker-independent speech recognition model (SI-ASR model) according to an embodiment of the present disclosure.
  • SI-ASR model speaker-independent speech recognition model
  • Embodiments of the present disclosure provide a sound conversion system, including:
  • SI-ASR Speaker Independent Speech Recognition
  • MFCC Mel cepstral features of the source speech
  • the ASR model converts speech into text, and the model outputs the probability that the audio corresponds to each word, and PPG is the carrier of this probability.
  • PPG-based methods use PPG as the output of the SI-ASR model.
  • PPG Phonetic PosteriorGram, which is a matrix that maps each audio time frame to the posterior probability of a phoneme category.
  • PPG can represent the rhythm and prosody information of a speech to a certain extent, and at the same time remove the features related to the speaker's timbre, so it is speaker-independent.
  • PPG is defined as follows:
  • C is the number of phonemes
  • s is a phoneme (digital representation)
  • Xt is the MFCC feature of the t-th audio frame
  • Xt) is the posterior probability of the phoneme s.
  • the PPG feature can remove the timbre feature of the speaker, due to the certain error rate of the ASR model for text recognition, the posterior probability output by the model may be inaccurate. This will lead to inaccurate pronunciation of individual words in the audio output by the final transformation, or mispronunciation, or even noise.
  • the speaker-independent speech recognition model setting in the embodiment of the present disclosure includes a bottleneck layer, which transforms the Mel cepstrum feature of the input source speech into the bottleneck feature of the source speech through the bottleneck layer, and converts the source speech
  • the bottleneck feature is output from the bottleneck layer to the attention voice-changing network.
  • Bottleneck features is related to the construction of the ASR model. Different ASR models have different depths. Taking the five-layer DNN neural network structure as an example, you can set one of the layers to use the Bottleneck layer (bottleneck layer), that is, a Bottleneck layer. (DNN) is placed in a four-layer DNN structure to form an ASR model. In order to achieve better results, it is better to place the Bottleneck layer in the output of the third or fourth layer of the ASR model, that is, the output of the third or fourth layer. as a Bottleneck feature. It should be noted that, in the embodiment of the present disclosure, the Bottleneck layer is preferably placed on the fourth layer of the ASR model to avoid interference of possible timbre information in other positions.
  • This embodiment adopts a five-layer DNN structure, of which the fourth layer uses a Bottleneck layer (bottleneck layer), that is, a three-layer DNN structure, a Bottleneck layer (DNN) and a DNN structure to form an ASR model.
  • a Bottleneck layer that is, a three-layer DNN structure, a Bottleneck layer (DNN) and a DNN structure to form an ASR model.
  • the Bottleneck feature is different from the acoustic feature of the audio. It is a language feature of the audio and does not contain information such as the speaker's timbre.
  • the Bottleneck layer is generally the middle layer of the model, but in order to keep the output features as far away from the timbre information as possible, the embodiments of the present disclosure improve the network design so that the Bottleneck layer is as close to the output of the ASR network as possible. Only in this way, the extracted features do not contain timbre information. Therefore, the embodiment of the present disclosure extracts the output of the penultimate layer of the ASR network as the Bottleneck feature. Practice has proved that this can remove timbre information well, while preserving language information.
  • Bottleneck feature is more abstract than PPG feature. It can not only reflect the decoupling of speech content and speaker's timbre, but also is not so closely bound to phoneme categories, which is not a clear one-to-one correspondence, so it can alleviate ASR recognition errors to a certain extent. lead to inaccurate pronunciation. In the actual test, using Bottleneck feature to transform the audio, the accuracy of pronunciation is significantly higher than that of the PPG method, and there is no significant difference in timbre.
  • Attention voice-changing network which transforms the bottleneck feature of the source speech into a Mel cepstral feature consistent with the target speech
  • Attention voice changing network is based on seq2seq (end-to-end) architecture.
  • the main improvements are as follows: First, a layer of bidirectional RNN (Bidirectional Recurrent Neural Network) is used to encode the BN features output by the SI-ASR model into high-dimensional features. Second, combine the Attention mechanism to link encode (encoding) and decode (decoding) to avoid instability caused by manual alignment. Third, simplify the decoder network, use a two-layer DNN network structure, followed by a layer of RNN structure, and then use a multi-layer SelfAttention (self-attention) with residual connections as Post-Net to convert BN features into acoustic features (see details for details). Figure 5).
  • GRU Gated Recurrent Unit
  • encoding refers to Gated Recurrent Unit encoding.
  • the voice changing network of the embodiment of the present disclosure uses simpler and more direct acoustic features.
  • the PPG method converts PPG into the features required by the vocoder, and also needs to combine the F0 (fundamental frequency, fundamental frequency) and AP (aperiodic component, aperiodic component) features, and then use the vocoder to restore the audio.
  • the F0 feature contains speaker information, although it can make the transformed voice more saturated, it also sacrifices the timbre as a cost.
  • the Attention voice-changing network of the embodiment of the present disclosure can directly predict and output all required vocoder parameters, without the need to manually extract filter features such as F0 and AP.
  • the input and output and process design of the network are greatly simplified, making the model more concise and efficient, and on the other hand, the audio after voice change is more like the target speaker.
  • the network scale of the embodiment of the present disclosure is relatively small, the running speed is fast, and real-time voice change can be realized.
  • 10s audio frequency only needs 1s conversion time, and the effect of real-time voice change can be achieved by engineering streaming packaging.
  • Neural network vocoder which converts the Mel cepstrum features consistent with the target speech into speech output.
  • the Neural Network Vocoder employs a variant of WaveRNN that restores acoustic features to audio output.
  • the embodiment of the present disclosure encodes the acoustic features into features in a high-dimensional space, and then uses the high-dimensional features of the recurrent neural network to restore the audio output.
  • the specific neural network vocoding structure is shown in FIG. 6, wherein the 2GRU (Gated Recurrent Unit gated loop Unit) network is a gated recurrent unit network, 2-layer FC (Fully Connected to fully connected) refers to 2 pairs of fully connected layers, and softmax is a normalized exponential function.
  • the embodiment of the present disclosure also introduces a voice transformation system training method, which includes the following three parts: A1-A3:
  • SI-ASR model Sounder Independent Speech Recognition Model
  • This stage of training obtains the features of the Attention voice-changing network (attention voice-changing network) training stage and the SI-ASR model used to extract Bottleneck features (literally translated as bottleneck features, also referred to as BN features) in the voice conversion stage; It is trained from training corpus of many speakers. After training, it can be used for any source speaker, that is, it is speaker-independent (SI), so it is called SI-ASR model; after training, it can be used directly later without repetition train.
  • SI speaker-independent
  • the SI-ASR model (Speaker Independent Speech Recognition Model) training phase includes the following steps (see Figure 1):
  • This preprocessing includes de-blanking and normalizing the training audio.
  • De-blanking detects and cuts out long pauses and silences in the audio (excluding normal pauses between words). Normalization is to unify the audio data within a certain range.
  • the text in the training corpus is cleaned and proofread to correct the inaccurate correspondence between text and audio; the text is regularized to convert numbers, years, months, days, decimals, and unit symbols into Chinese characters. If the model training is based on words, you need to call word segmentation tools (such as jieba, pcuseg, etc.) to segment the text. Then convert the text into pinyin and phonemes. All characters, words, pinyin or phonemes appearing in the corpus are counted, and a vocabulary is generated, which is uniformly encoded as an integer representation, and the training is based on phonemes.
  • word segmentation tools such as jieba, pcuseg, etc.
  • MFCC Mel-scale Frequency Cepstral Coefficients
  • Kaldi is an open source speech recognition framework for research purposes.
  • the model adopts a Deep Neural Network (DNN) architecture.
  • DNN Deep Neural Network
  • SI-ASR model The network architecture of the speaker-independent speech recognition model (SI-ASR model) is shown in Figure 7.
  • Attention voice-changing network (attention voice-changing network) training stage.
  • a voice-changing network based on the Sequence-to-sequence (end-to-end) architecture and the Attention mechanism (hereinafter referred to as the Attention voice-changing network) is obtained, which is used in the voice conversion stage to transform the BN features extracted by the SI-ASR model into the vocoder. Acoustic features are required.
  • the voice-changing network at this stage needs to be trained separately for different target speakers. After training, the voice timbre of any source speaker can be converted into the target speaker timbre;
  • the training phase of the Attention voice-changing network includes the following steps (see Figure 2):
  • the training speech preprocessing includes noise reduction, blank removal, volume normalization, etc.
  • the audio is blanked and normalized in the same way as in B1.
  • Noise reduction uses existing noise reduction models to process training speech to reduce the impact of noise.
  • MFCC features are the features required by neural network vocoders.
  • the embodiment of the present disclosure uses the mel spectrum as the acoustic feature. In order to be closer to the perception of the human ear, the embodiment of the present disclosure uses the mel spectrum as the acoustic feature.
  • the pronunciation gap of various languages is not very large, so deep transfer learning method can be used to reuse previously trained network parameters and learned features, which greatly reduces the difficulty of model training, training data requirements and collection costs.
  • Neural network vocoder training stage In this stage, a deep neural network-based vocoder (Deep Neural Network Vocoder) is obtained by training, which is used to transform acoustic features into target speech signals.
  • a deep neural network-based vocoder (Deep Neural Network Vocoder) is obtained by training, which is used to transform acoustic features into target speech signals.
  • the vocoder training phase includes the following steps (see Figure 3):
  • This step is the same as the preprocessing operation in C1.
  • Different vocoders use different acoustic features, and the present embodiment of the present disclosure uses the mel spectrum as the acoustic feature.
  • D4 Input the acoustic features and the sound signal of the target speech into the pre-trained neural network vocoder model, and train the model by the method of deep transfer learning.
  • Embodiments of the present disclosure also include a sound transformation method.
  • the input source voice is transformed into the target voice signal output, that is, the voice conforms to the target speaker's voice characteristics, but the speech content is the same as the source voice.
  • the sound conversion stage includes the following steps (see Figure 4):
  • E4 Use the neural network vocoder trained in D4 to convert the acoustic features (mel spectrum) into speech output.
  • the trained speaker-independent speech recognition model can be used for any source speaker, that is, speaker-independent (Speaker-Independent, SI); the speaker-independent speech recognition model training method only needs to be trained once, and the follow-up Small samples only need to use the trained model to extract the corresponding features.
  • the embodiment of the present disclosure further includes a terminal, where the terminal uses the sound conversion system described in the first embodiment.
  • the terminal may be a mobile terminal, PC device, wearable device, etc. equipped with an automatic voice response or prompt service system, or may be a voice robot with automatic voice response or prompt service function, which is not covered in this embodiment of the present disclosure. limited.
  • Embodiments of the present disclosure further include a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, wherein the computer program is configured to execute the method described in Embodiment A4 above to perform sound recording when running. Transformation, module training uses the methods described in the above-mentioned Embodiments A1-A3.
  • the present disclosure also provides a sound transformation system, which includes: a speech recognition model, including at least a bottleneck layer, the speech recognition model can be configured to: the Mel cepstrum feature of the input source speech, through the bottleneck layer Convert it into a bottleneck feature; and input the bottleneck feature to the attention voice changing network through the bottleneck layer.
  • a speech recognition model including at least a bottleneck layer
  • the speech recognition model can be configured to: the Mel cepstrum feature of the input source speech, through the bottleneck layer Convert it into a bottleneck feature; and input the bottleneck feature to the attention voice changing network through the bottleneck layer.
  • the technical solutions provided by the embodiments of the present disclosure are applied in scenarios where the voice (source voice) spoken by the source speaker needs to be converted into the voice that matches the target speaker, that is, the source voice is before the conversion starts. , the source speech spoken by the source speaker.
  • the speech recognition model is trained from a large number of training corpora, and the trained speech recognition model can be applied to any source speaker, that is, the speech recognition model is independent.
  • the speech recognition model may include a five-layer DNN structure, wherein the third layer or the fourth layer may be the bottleneck layer.
  • the Bottleneck layer is placed in the fourth layer of the ASR model to avoid the interference of possible timbre information in other positions.
  • the attention voice-changing network can be configured to: convert the input bottleneck features into Mel cepstral features consistent with the target speech;
  • the target voice is the voice spoken by the target speaker. Therefore, in order to convert the source speech into a speech that matches the target speech, an attention voice-changing network is used to convert the bottleneck features of the source speech into Mel cepstral features that match the target speech.
  • the attention voice-changing network may include a one-layer bidirectional RNN structure.
  • the technical solution provided by the present disclosure further includes a neural network vocoder, and the neural network vocoder can be configured to: convert the Mel cepstrum features consistent with the target speech into imitation target speech and output; the imitation target speech Audio generated by transformation for the source speech.
  • the neural network vocoder can convert Mel cepstrum features into audio output, and the generated audio is the imitation target speech. That is to say, the imitation target voice is the voice that matches the target voice, and through the above process, the voice transformation is realized.
  • the present disclosure also provides a training method for a sound transformation system, which is applied to the sound transformation system described in the above-mentioned embodiments, including:
  • S1 Convert the characters in the multi-person speech recognition training corpus into numbers according to the mapping relationship between the characters in the multi-person speech recognition training corpus and character codes.
  • the character encoding refers to a conversion form that converts any input character into a fixed form.
  • Chinese characters need to be converted into corresponding phonemes so that neural network calculations can be performed, and the character encoding can be ASCII code can also take other forms, which are not specifically limited in the present disclosure.
  • S4 Perform iterative optimization until the speech recognition model converges, so as to train the speech recognition model.
  • Training the speech recognition model is the process of establishing the connection between the model and the training corpus.
  • the specific text content of the training corpus is not limited in this disclosure.
  • the method also includes preprocessing the multi-person speech recognition training corpus; the preprocessing includes removing blanks and normalizing.
  • the process of removing blanks can remove excessively long pauses and silences in the training corpus, thereby improving the quality of the training corpus.
  • the process of normalization is to normalize the volume of the training corpus. If the volume of the training corpus is sometimes large and sometimes small, it will affect the training effect. Therefore, the volume of the training corpus is controlled within a certain range through normalization. , the specific range can be designed according to the actual situation, which is not specifically limited in the present disclosure.
  • the method also includes:
  • S6 Input the bottleneck feature of the target speech into the attention voice-changing network; input the Mel cepstrum feature of the target voice as the real value to the attention-voice-changing network;
  • the target speech refers to the speech spoken by the target speaker during the training process.
  • Using the target voice to train the attention voice-changing network can establish a connection between the target voice and the attention voice-changing network, so that the attention voice-changing network can convert the bottleneck feature of the source voice into a Mel consistent with the target voice. Cepstral features.
  • the step of converting the Mel cepstral features of the target speech into the bottleneck features is performed by the pre-trained speech recognition model.
  • the method also includes:
  • the neural network vocoder is trained through the relationship between the Mel cepstrum feature of the target speech and the sound signal of the target speech, so that the neural network vocoder can convert the Mel cepstral feature consistent with the target speech Cepstral features are converted to audio and output.
  • the method also includes preprocessing the target speech; the preprocessing includes de-blanking and normalization.
  • de-blanking and normalization can prevent excessive pauses, silences, and excessive or low volume in the audio from affecting the subsequent training of the attention voice-changing network and the neural network vocoder.
  • the method further includes: extracting parameters to obtain the Mel cepstrum feature of the multi-person speech recognition training corpus, the Mel cepstrum feature of the target speech, and the Mel cepstrum feature of the source speech.
  • the voice conversion system and the training method of the voice conversion system provided by the present disclosure can convert the source voice into the corresponding voice based on the source voice spoken by any source speaker and the target voice spoken by the target speaker.
  • the audio output that matches the target voice is highly practical.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Machine Translation (AREA)

Abstract

一种非平行语料训练的语音转换方案,摆脱对平行文本的依赖,解决在资源、设备有限的条件下难以实现语音转换的技术问题,包括一种声音变换系统及声音变换系统的训练方法,训练好的说话人独立的语音识别模型可用于任意的源说话人,即说话人独立;音频的瓶颈特征相比语音后验概率特征更加抽象,既能反映说话内容和说话人音色解耦,同时又和音素类别的绑定没有那么紧密,不是明确的一一对应关系,一定程度上缓解了ASR识别错误导致发音不准的问题。使用瓶颈特征作声音变换得到的音频,发音的准确率明显高于语音后验概率方法,并且音色没有显著区别;利用迁移学习方式,可以大幅度缩小训练语料的依赖。

Description

一种声音变换系统以及声音变换系统的训练方法
本公开要求在2020年10月21日提交中国专利局、申请号为202011129857.5、发明名称为“一种声音变换系统、方法及应用”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本公开涉及语音计算算法领域,尤其涉及一种声音变换系统以及声音变换系统的训练方法。
背景技术
随着计算机技术的不断发展,人工智能领域的不断深耕,以语音交互为目的的语音机器人逐渐进入大众视野当中。语音机器人的出现改变了现有电话业务的工作性质,目前语音机器人应用房产、教育、金融、旅游等行业中执行语音交互的功能,从而代替人工与用户进行语音交互。
为优化客户体验,利用语音转换技术变换语音机器人声音特征是其中一个重要改进方向。
语音转换技术是语音信号处理的一个研究分支,它涵盖了说话人识别、语音识别及语音合成等领域的内容,拟在保留原有的语义信息不变的情况下改变语音的个性化信息,使特定说话人(即源说话人)的语音听起来像另一个特定说话人(即目标说话人)的语音。语音转换技术的研究经过多年发展,语音转换领域已经涌现出多种不同的方法,其中以高斯混合模型为代表的统计转换方法已经成为该领域中的经典方法。但是这类算法还是存在某些缺陷,比如:使用高斯混合模型来进行语音转换的经典方法多是基于一对一的语音转换任务,要求使用平行文本,也即源说话人和目标说话人使用的训练语句内容相同,需将训练语句的频谱特征进行动态时间规整(Dynamic Time Warping,DTW)逐帧对齐,才能通过模型训练得到频谱特征间的映射关系,而非平行语料的文本是不平行的,也就不能使用上述高斯混合模型来进行语音转换。
发明内容
为解决上述问题,本公开实施例提出一种非平行语料训练的语音转换方案,摆脱对平行文本的依赖,可以在小样本下实现变声效果,解决在资源、设备有限的条件下难以实现语音转换的技术问题。
本公开实施例采用如下技术方案:
第一方面,提供一种声音变换系统,包括:
说话人独立的语音识别模型,至少包括瓶颈层,所述说话人独立的语音识别模型配置为,将输入的源语音的梅尔倒谱特征通过所述瓶颈层变换为源语音的瓶颈特征;
注意力变声网络,其配置为,将源语音的瓶颈特征变换为与目标语音相符的梅尔倒谱特征;
神经网络声码器,其配置为,将与目标语音相符的梅尔倒谱特征转化为语音输出。
第二方面,提供一种声音变换方法,包括:
把源语音的梅尔倒谱特征变换为源语音瓶颈特征;
把源语音的瓶颈特征变换为与目标语音相符的梅尔倒谱特征;
将与目标语音相符的梅尔倒谱特征转化为语音输出。
第三方面,提供一种说话人独立的语音识别模型的训练方法,包括:
把多人语音识别训练语料中的文字转换成的字符编码的编号和多人语音识别训练语料的梅尔倒谱特征一起输入说话人独立的语音识别模型,运行反向传播算法进行迭代优化,直到说话人独立的语音识别模型收敛。
第四方面,提供一种注意力变声网络的训练方法,包括:
将目标语音的梅尔倒谱特征变换为目标语音的瓶颈特征;
把目标语音的瓶颈特征输入基础注意力变声网络,以目标说话人对应的梅尔倒谱特征作为真实值,用深度迁移学习的方法训练注意力变声网络。
第五方面,提供一种神经网络声码器的训练方法,包括:
把目标语音的梅尔倒谱特征和目标语音的声音信号输入预训练神经网络声码器,用深度迁移学习的方法训练神经网络声码器。
第六方面,提供一种终端,包括前述的声音变换系统。
第七方面,提供一种计算机可读的存储介质,所述计算机可读的存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行前述方法之一。
第八方面,提供一种声音变换系统,其中,包括:
语音识别模型,至少包括瓶颈层,所述语音识别模型被配置为:将输入的源语音的梅尔倒谱特征,通过所述瓶颈层转换为瓶颈特征;并将所述瓶颈特征通过所述瓶颈层输入至注意力变声网络;
所述注意力变声网络被配置为:将输入的瓶颈特征转换为与目标语音相符的梅尔倒谱特征;
神经网络声码器,所述神经网络声码器被配置为:将与目标语音相符的梅尔倒谱特征转换为仿目标语音并输出;所述仿目标语音为源语音通过转换生成的音频。
第九方面,提供一种声音变换系统的训练方法,应用于前述第五方面所述的声音变换系统,其中,包括:
根据多人语音识别训练语料中的文字与字符编码之间的映射关系,将所述多人语音识别训练语料中的文字转换为编号;
将转换而得的编号和所述多人语音识别训练语料的梅尔倒谱特征输入至所述语音识别模型;
运行反向传播算法;
进行迭代优化至所述语音识别模型收敛,以训练所述语音识别模型。
本公开实施例通过以上技术方案,摆脱对平行文本的依赖,实现任意说话人对多说话人的转换,提高了灵活性,解决在资源、设备有限的条件下难以实现语音转换的技术问题。
具体来说:
1、训练好的说话人独立的语音识别模型可用于任意的源说话人,即说话人独立(Speaker-Independent,SI);说话人独立的语音识别模型训练方法只需要训练一次,后 续小样本只需要用训练好的模型提取对应的特征,即可对声音实现实时变换,实现实时变换声音的需求。
2、音频的瓶颈(Bottleneck)特征相比语音后验概率(Phonetic PosteriorGram,PPG)特征,更加抽象,既能反映说话内容和说话人音色解耦,同时又和音素类别的绑定没有那么紧密,不是明确的一一对应关系,所以一定程度上能缓解ASR(Automatic Speech Recognition语音自动识别技术)识别错误导致发音不准的问题。在实际测试时,使用Bottleneck特征作声音变换得到的音频,发音的准确率明显高于PPG方法,并且音色没有显著区别。
3、本公开实施例实现了声音变换的快速训练,与一般变声网络相比,数据量需求明显降低,本公开实施例系统训练时间可缩短至5到20分钟,大大减少训练语料的依赖,系统实用性明显增强。
附图说明
为了更清楚地说明本公开的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,对于本领域普通技术人员而言,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本公开实施例说话人独立的语音识别模型(SI-ASR模型)训练流程图;
图2为本公开实施例注意力变声网络(Attention变声网络)训练流程图;
图3为本公开实施例神经网络声码器训练流程图;
图4为本公开实施例声音变换流程图;
图5为本公开实施例注意力变声网络(Attention变声网络)架构图;
图6为本公开实施例神经网络声码器网络架构图;
图7为本公开实施例说话人独立的语音识别模型(SI-ASR模型)网络架构图。
具体实施方式
以下将结合附图和具体实施例对本公开进行详细说明,显然,所描述的实施例仅仅只是本公开一部分实施例,而不是全部的实施例,基于本公开中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。
本公开实施例提供一种声音变换系统,包括:
(1)说话人独立的语音识别(SI-ASR)模型,采用五层DNN(Deep Neural Network深度神经网络)结构,其中第四层使用瓶颈(Bottleneck)层,把源语音的梅尔倒谱特征(MFCC)变换为源语音瓶颈特征(Bottleneck特征);
ASR模型将语音转化为文字,模型会输出音频对应每个词的概率,而PPG就是这种概率的载体。基于PPG的方法,使用PPG作为SI-ASR模型的输出。
PPG即Phonetic PosteriorGram,它是一个把每个音频时间帧映射为某个音素类别的后验概率的矩阵。PPG在一定程度上可以表示一段语音说话内容的节奏和韵律信息,同时又去掉了和说话人音色相关的特征,因此它是说话人独立的。PPG定义如下:
P_t=(p(s|X_t),s=1,2,3,…,C);
其中,C是音素个数;s是一个音素(数字表示);Xt是音频第t帧的MFCC特征; P(s|Xt)是音素s的后验概率。
实践中发现,虽然PPG特征可以去掉说话人的音色特征,但由于ASR模型对文字的识别存在一定的错误率,模型输出的后验概率有可能是不准确的。这会导致最终变换输出的音频出现个别字发音不准,或者发音错误的情况,甚至会出现噪声等情况。
针对这个问题,本公开实施例说话人独立的语音识别模型设置包括瓶颈层,将输入的源语音的梅尔倒谱特征通过所述瓶颈层变换为源语音的瓶颈特征,并将所述源语音瓶颈特征由所述瓶颈层输出至注意力变声网络。
Bottleneck特征的提取和ASR模型构建有关系,不同的ASR模型对应的深度也不同,以五层DNN神经网络结构为例,可以设置其中某一层使用Bottleneck层(瓶颈层),即一层Bottleneck层(DNN)放在四层DNN结构中组成ASR模型,为取得较佳效果,优选Bottleneck层置于ASR模型的第三层或第四层的输出较佳,即将第三层或第四层的输出作为Bottleneck特征。需要说明的是,本公开实施例中优选Bottleneck层置于ASR模型的第四层,以避免其余位置可能的音色信息的干扰。
以下作为本公开实施例的一个优选实施方案予以详述,不作为对本公开实施例保护范围的限定,即以本公开实施例权利要求确定的保护范围内,其他仍可解决本公开实施例技术问题。
本实施例采用五层DNN结构,其中第四层使用Bottleneck层(瓶颈层),即三层DNN结构、一层Bottleneck层(DNN)和一层DNN结构组成ASR模型,模型训练好后,输入音频特征,输出Bottleneck层结果作为变声特征。Bottleneck特征与音频的声学特征不同,是音频的一种语言特征,不包含说话人的音色等信息。用包含大量不同说话人的训练语料训练ASR模型,就可以捕获所有说话人的共有信息,去掉说话人的个性信息。Bottleneck层一般是模型的中间层,但是为了让输出的特征尽量脱离音色信息,本公开实施例改进网络设计,使Bottleneck层尽量靠近ASR网络的输出。只有这样,提取的特征才不包含音色信息。因此本公开实施例提取ASR网络的倒数第二层的输出作为Bottleneck特征。实践证明这样可以很好地去除音色信息,同时保留语言信息。
Bottleneck特征相比PPG特征更加抽象,既能反映说话内容和说话人音色解耦,同时又和音素类别的绑定没有那么紧密,不是明确的一一对应关系,所以一定程度上能缓解ASR识别错误导致发音不准的问题。在实际测试时,使用Bottleneck特征作声音变换得到的音频,发音的准确率明显高于PPG方法,并且音色没有显著区别。
(2)注意力(Attention)变声网络,把源语音的瓶颈特征变换为与目标语音相符的梅尔倒谱特征;
Attention变声网络基于seq2seq(端对端)架构。主要改进点在于:第一、利用一层双向RNN(Bidirectional Recurrent Neural Network,循环神经网络)把SI-ASR模型输出的BN特征编码为高维特征。第二、结合Attention机制,将encode(编码)和decode(解码)联系起来,避免人工对齐产生的不稳定性。第三、简化decoder网络,利用两层DNN网络结构,后面接一层RNN结构,再利用带残差连接的多层SelfAttention(自注意)作为Post-Net,将BN特征转化为声学特征(详见附图5)。GRU(Gated Recurrent Unit门控循环单元)编码是指门控循环单元编码。
相比PPG方法,本公开实施例的变声网络使用更简单直接的声学特征。PPG方法将PPG转化为声码器所需要的特征,还需要结合F0(fundamental frequency,基频)和AP (aperiodic component,非周期分量)特征,再利用声码器还原音频。因为F0特征中包含说话人信息,虽然可以使变换后的声音更加饱和,但也牺牲了音色作为代价。而本公开实施例的Attention变声网络,可以直接预测输出所有需要的声码器参数,而不需要再手动提取F0和AP等滤波器特征。这样,一方面大大简化了网络的输入输出和流程设计,使模型更加简洁高效,另一方面也使得变声以后的音频更像目标说话人。另外,本公开实施例的网络规模比较小,运行速度快,可以实现实时变声,目前10s音频,只需要1s变换时间,通过工程化流式封装可以达到实时变声的效果。
(3)神经网络声码器,将与目标语音相符的梅尔倒谱特征转化为语音输出。
神经网络声码器采用WaveRNN的一种变体,将声学特征还原为音频输出。本公开实施例将声学特征编码成高维空间中的特征,然后利用循环神经网络高维特征还原为音频输出,具体神经网络声码结构参见附图6,其中,2GRU(Gated Recurrent Unit门控循环单元)网络为门控循环单元网络,2层FC(Fully Connected对全连接)是指2个对全连接层,softmax为归一化指数函数。
本公开实施例还介绍一种声音变换系统训练方法,包括以下A1-A3三个部分:
A1、SI-ASR模型(说话人独立的语音识别模型)训练阶段。该阶段训练得到Attention变声网络(注意力变声网络)训练阶段的特征和声音转换阶段提取Bottleneck特征(直译为瓶颈特征,也简称为BN特征)用到的SI-ASR模型;该模型的训练使用包含很多说话人的训练语料训练而成。训练好之后,可以用于任意的源说话人,也就是说,它是说话人独立的(Speaker-Independent,SI),因此称为SI-ASR模型;训练好以后,后面直接使用,不需要重复训练。
SI-ASR模型(说话人独立的语音识别模型)训练阶段包括以下步骤(参见附图1):
B1、对多人ASR训练语料进行预处理;
该预处理包括对训练音频去空白和归一化。去空白即检测并裁剪掉音频中过长的停顿、静音(不包括字词之间正常的停顿)。归一化就是把音频数据统一在一定范围内。
对训练语料中的文本进行清洗校对,修正文字和音频对应不准确的情况;对文本进行正则化处理,将数字、年月日、小数、单位符号等转换为汉字。如果模型训练以词为单位,就需要调用分词工具(如jieba,pkuseg等)对文本进行分词。然后把文字转换为拼音和音素。统计语料中出现的所有字、词、拼音或音素,生成词汇表,并统一编码为整数表示,训练以音素为单位。
B2、参数提取,得到训练语料音频的MFCC特征(梅尔倒谱特征);
梅尔频率倒谱系数(Mel-scale Frequency Cepstral Coefficients,简称MFCC)是语音识别和语音合成领域最常用到的语音特征之一。这种特征不依赖于信号的性质,对输入信号不做任何的假设和限制,有很好的鲁棒性,又利用了听觉模型的研究成果,更符合人耳的听觉特性,而且当信噪比降低时仍然具有较好的识别性能。MFCC的提取过程包括以下步骤:预加重、分帧、加窗、快速傅里叶变换、梅尔滤波、对数运算、离散余弦变换。
B3、SI-ASR模型训练。
SI-ASR模型的训练使用Kaldi框架。Kaldi是一个研究用途的开源语音识别框架。模型采用基于深度神经网络的(Deep Neural Network,DNN)架构。
把训练语料中的文字转换成字符编码的编号,和音频MFCC特征一起输入SI-ASR 模型,运行反向传播算法进行迭代优化,直到模型收敛。
说话人独立的语音识别模型(SI-ASR模型)网络架构见图7。
A2、Attention变声网络(注意力变声网络)训练阶段。该阶段训练得到基于Sequence-to-sequence(端对端)架构和Attention机制的变声网络(后面简称Attention变声网络),用于声音转换阶段把SI-ASR模型提取的BN特征变换为声码器所需声学特征。该阶段的变声网络需要针对不同目标说话人分别进行训练。训练好以后,可以将任意源说话人的声音音色转换为目标说话人音色;
Attention变声网络训练阶段包括以下步骤(参见附图2):
C1、对目标说话人的目标语音进行预处理;
训练语音预处理包括降噪,去空白,音量归一化等。音频去空白,归一化和B1步骤相同。降噪利用现有的降噪模型对训练语音进行处理,减小噪声的影响。
变换一个人的声音需要大约5~20min,50~200句录音,大大简化了繁琐的录音工作,而且变声不需要文本校验,所以这些训练音频既用于训练Attention变声网络,也用于训练神经网络声码器。
C2、参数提取,得到目标语音的MFCC特征和声学特征;
MFCC特征的提取和B2步骤相同。声学特征即神经网络声码器需要的特征。目前本公开实施例使用梅尔谱作为声学特征,为了更贴近人耳感知,本公开实施例利用梅尔谱作为声学特征。
C3、使用B3中训练好的SI-ASR模型把MFCC特征变换为BN特征;
C4、加载用大规模语料训练好的基础Attention变声网络;
C5、把BN特征输入基础Attention变声网络,以声学特征作为真实值ground truth,用深度迁移学习的方法训练模型。
理论上各种语言的发音差距不是很大,因此可以使用深度迁移学习方法,复用以前训练好的网络参数和学习到的特征,大大降低了模型训练的难度、训练数据需求量和采集成本。
A3、神经网络声码器训练阶段。该阶段训练得到基于深度神经网络的声码器(Deep Neural Network Vocoder,即神经网络声码器),用于把声学特征变换为目标语音信号。
声码器训练阶段包括以下步骤(参见附图3):
D1、对目标说话人的目标语音进行预处理;
这一步和C1中的预处理操作相同。
D2、参数提取,得到目标语音的声学特征;
不同的声码器使用不同的声学特征,目前本公开实施例使用梅尔谱作为声学特征。
D3、加载预训练声码器模型;
D4、把声学特征和目标语音的声音信号输入预训练神经网络声码器模型,用深度迁移学习的方法训练模型。
本公开实施例还包括一种声音变换方法。对输入的源语音进行声音变换,变换为目标语音信号输出,也就是声音符合目标说话人声音特点,但说话内容和源语音相同的语音。声音转换阶段包括以下步骤(参见附图4):
E1、对待转换的源语音进行参数提取,得到MFCC特征;
E2、使用B3中训练好的SI-ASR模型把MFCC特征变换为BN特征;
E3、使用C5中训练好的Attention变声网络把BN特征变换为声学特征(梅尔谱);
E4、使用D4中训练好的神经网络声码器将声学特征(梅尔谱)转换为语音输出。
通过这一方式,训练好的说话人独立的语音识别模型可用于任意的源说话人,即说话人独立(Speaker-Independent,SI);说话人独立的语音识别模型训练方法只需要训练一次,后续小样本只需要用训练好的模型提取对应的特征即可。
本公开实施例还包括一种终端,该终端使用实施例一所述的声音变换系统。
所述终端可以为搭载有自动语音应答或提示服务系统的移动终端、PC设备、可穿戴设备等,或者,可以为具有自动语音应答或提示服务功能的语音机器人等,本公开实施例对此不作限定。
本公开实施例还包括一种计算机可读的存储介质,所述计算机可读的存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行上述实施例A4所述方法进行声音变换,模块训练使用上述实施例A1-A3所述方法。
本公开还提供一种声音变换系统,其中,包括:语音识别模型,至少包括瓶颈层,所述语音识别模型可以被配置为:将输入的源语音的梅尔倒谱特征,通过所述瓶颈层转换为瓶颈特征;并将所述瓶颈特征通过所述瓶颈层输入至注意力变声网络。
本公开实施例提供的技术方案,应用在需要将源说话人说出的语音(源语音),转换成与目标说话人相匹配的语音的场景,也就是说,所述源语音为转换开始前,源说话人说出的源语音。在实际应用中,所述语音识别模型是由大量训练语料训练而成的,训练后的语音识别模型,可以应用于任一源说话人,也就是说,所述语音识别模型是独立的。
具体的,所述语音识别模型可以包括五层DNN结构,其中第三层或者第四层可以为所述瓶颈层。可选地,Bottleneck层置于ASR模型的第四层,以避免其余位置可能的音色信息的干扰。
所述注意力变声网络可以被配置为:将输入的瓶颈特征转换为与目标语音相符的梅尔倒谱特征;
在本实施例中,所述目标语音是目标说话人说出的语音。因此,为了将源语音转换为与目标语音相符的语音,使用注意力变声网络将源语音的瓶颈特征转换为与目标语音相符的梅尔倒谱特征。
所述注意力变声网络可以包括一层双向RNN结构。
本公开提供的技术方案还包括神经网络声码器,所述神经网络声码器可以被配置为:将与目标语音相符的梅尔倒谱特征转换为仿目标语音并输出;所述仿目标语音为源语音通过转换生成的音频。
在实际应用中,所述神经网络声码器能够将梅尔倒谱特征转换成音频输出,产生的音频即为仿目标语音。也就是说,所述仿目标语音是与目标语音相符的语音,经过上述过程的处理,实现了声音变换。
本公开还提供一种声音变换系统的训练方法,应用于上述实施例所述的声音变换系统,其中,包括:
S1:根据多人语音识别训练语料中的文字与字符编码之间的映射关系,将所述多人语音识别训练语料中的文字转换为编号。
在本公开技术方案中,所述字符编码是指把输入任意字符转换为固定形式的一种转 换形式,注意:汉字需要转化为对应的音素,以便可以进行神经网络计算,所述字符编码可以采用ASCII码,也可以采取其他形式,对此本公开不做具体限定。
S2:将转换而得的编号和所述多人语音识别训练语料的梅尔倒谱特征输入至所述语音识别模型;
S3:运行反向传播算法;
S4:进行迭代优化至所述语音识别模型收敛,以训练所述语音识别模型。
对语音识别模型进行训练,是建立模型与训练语料之间联系的过程。在本实施例中,所述训练语料的具体文字内容本公开不做限定。
所述方法还包括对所述多人语音识别训练语料进行预处理;预处理包括去空白和归一化。在本实施例中,去空白的过程能够去除训练语料中过长的停顿以及静音等,提高训练语料的质量。归一化的过程即为对训练语料进行音量归一化,训练语料中如果存在音量时而大时而小的情况,会影响训练的效果,因此通过归一化将训练语料的音量控制在一定范围内,具体范围可以根据实际情况进行设计,本公开不做具体限定。
所述方法还包括:
S5:将目标语音的梅尔倒谱特征转换为瓶颈特征;
S6:将所述目标语音的瓶颈特征输入至注意力变声网络;将所述目标语音的梅尔倒谱特征作为真实值输入至所述注意力变声网络;
S7:使用深度迁移学习的方法训练所述注意力变声网络。
在本实施例中,所述目标语音是指在训练过程中,目标说话人说出的语音。使用目标语音训练所述注意力变声网络,能够建立目标语音与注意力变声网络之间的联系,以使所述注意力变声网络能够实现将源语音的瓶颈特征转换为与目标语音相符的梅尔倒谱特征。
在本公开部分实施例中,将目标语音的梅尔倒谱特征转换为瓶颈特征的步骤,由预先训练的所述语音识别模型进行。
所述方法还包括:
S8:将所述目标语音的梅尔倒谱特征以及所述目标语音输入至神经网络声码器;
S9:使用深度迁移学习的方法训练所述神经网络声码器。
在本实施例中,通过目标语音的梅尔倒谱特征与目标语音的声音信号之间的关系,训练神经网络声码器,以使所述神经网络声码器能够将与目标语音相符的梅尔倒谱特征转换为音频并输出。
所述方法还包括对所述目标语音进行预处理;预处理包括去空白和归一化。
在本实施例中,去空白及归一化,能够避免音频中过长的停顿、静音以及过大或多小的音量等对后续训练注意力变声网络和神经网络声码器产生影响。
所述方法还包括:参数提取,以得到多人语音识别训练语料的梅尔倒谱特征、目标语音的梅尔倒谱特征以及源语音的梅尔倒谱特征。
由以上技术方案可知,本公开提供的声音变换系统及声音变换系统的训练方法,能够以任意源说话人说出的源语音以及目标说话人说出的目标语音为基础,将源语音变换为与目标语音相匹配的音频输出,实用性强。
以上实施例仅用以说明本公开的技术方案而非限制,尽管参照较佳实施例对本公开进行了详细说明,本领域的普通技术人员应当理解,可以对本公开的技术方案进行修改 或者等同替换,而不脱离本公开技术方案的宗旨和范围,其均应涵盖在本公开的权利要求范围当中。本公开未详细描述的技术、形状、构造部分均为公知技术。

Claims (10)

  1. 一种声音变换系统,其中,包括:
    语音识别模型,至少包括瓶颈层,所述语音识别模型被配置为:将输入的源语音的梅尔倒谱特征,通过所述瓶颈层转换为瓶颈特征;并将所述瓶颈特征通过所述瓶颈层输入至注意力变声网络;
    所述注意力变声网络被配置为:将输入的瓶颈特征转换为与目标语音相符的梅尔倒谱特征;
    神经网络声码器,所述神经网络声码器被配置为:将与目标语音相符的梅尔倒谱特征转换为仿目标语音并输出;所述仿目标语音为源语音通过转换生成的音频。
  2. 根据权利要求1所述的声音变换系统,其中,所述语音识别模型包括五层DNN结构,其中第三层或者第四层为所述瓶颈层。
  3. 根据权利要求1所述的声音变换系统,其中,所述注意力变声网络包括一层双向RNN结构。
  4. 一种声音变换系统的训练方法,应用于权利要求1-3任一项所述的声音变换系统,其中,包括:
    根据多人语音识别训练语料中的文字与字符编码之间的映射关系,将所述多人语音识别训练语料中的文字转换为编号;
    将转换而得的编号和所述多人语音识别训练语料的梅尔倒谱特征输入至所述语音识别模型;
    运行反向传播算法;
    进行迭代优化至所述语音识别模型收敛,以训练所述语音识别模型。
  5. 根据权利要求4所述的声音变换系统的训练方法,其中,所述方法还包括对所述多人语音识别训练语料进行预处理;预处理包括去空白和归一化。
  6. 根据权利要求4所述的声音变换系统的训练方法,其中,所述方法还包括:
    将目标语音的梅尔倒谱特征转换为瓶颈特征;
    将所述目标语音的瓶颈特征输入至注意力变声网络;将所述目标语音的梅尔倒谱特征作为真实值输入至所述注意力变声网络;
    使用深度迁移学习的方法训练所述注意力变声网络。
  7. 根据权利要求6所述的声音变换系统的训练方法,其中,将目标语音的梅尔倒谱特征转换为瓶颈特征的步骤,由预先训练的所述语音识别模型进行。
  8. 根据权利要求6所述的声音变换系统的训练方法,其中,所述方法还包括:
    将所述目标语音的梅尔倒谱特征以及所述目标语音输入至神经网络声码器;
    使用深度迁移学习的方法训练所述神经网络声码器。
  9. 根据权利要求6-8任一项所述的声音变换系统的训练方法,其中,所述方法还包括对所述目标语音进行预处理;预处理包括去空白和归一化。
  10. 根据权利要求4所述的声音变换系统的训练方法,其中,所述方法还包括:参数提取,以得到多人语音识别训练语料的梅尔倒谱特征、目标语音的梅尔倒谱特征以及源语音的梅尔倒谱特征。
PCT/CN2021/088507 2020-10-21 2021-04-20 一种声音变换系统以及声音变换系统的训练方法 WO2022083083A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21749056.4A EP4016526B1 (en) 2020-10-21 2021-04-20 Sound conversion system and training method for same
US17/430,793 US11875775B2 (en) 2020-10-21 2021-04-20 Voice conversion system and training method therefor

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011129857.5A CN112017644B (zh) 2020-10-21 2020-10-21 一种声音变换系统、方法及应用
CN202011129857.5 2020-10-21

Publications (1)

Publication Number Publication Date
WO2022083083A1 true WO2022083083A1 (zh) 2022-04-28

Family

ID=73527418

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/088507 WO2022083083A1 (zh) 2020-10-21 2021-04-20 一种声音变换系统以及声音变换系统的训练方法

Country Status (4)

Country Link
US (1) US11875775B2 (zh)
EP (1) EP4016526B1 (zh)
CN (1) CN112017644B (zh)
WO (1) WO2022083083A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115132204A (zh) * 2022-06-10 2022-09-30 腾讯科技(深圳)有限公司 一种语音处理方法、设备、存储介质及计算机程序产品

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112017644B (zh) 2020-10-21 2021-02-12 南京硅基智能科技有限公司 一种声音变换系统、方法及应用
CN112489629A (zh) * 2020-12-02 2021-03-12 北京捷通华声科技股份有限公司 语音转写模型、方法、介质及电子设备
CN112767958B (zh) * 2021-02-26 2023-12-26 华南理工大学 一种基于零次学习的跨语种音色转换系统及方法
CN113345452B (zh) * 2021-04-27 2024-04-26 北京搜狗科技发展有限公司 语音转换方法、语音转换模型的训练方法、装置和介质
CN113327583A (zh) * 2021-05-24 2021-08-31 清华大学深圳国际研究生院 一种基于ppg一致性的最优映射跨语言音色转换方法及系统
CN113345431A (zh) * 2021-05-31 2021-09-03 平安科技(深圳)有限公司 跨语言语音转换方法、装置、设备及介质
CN113724696A (zh) * 2021-08-09 2021-11-30 广州佰锐网络科技有限公司 一种语音关键词的识别方法及系统
CN113838452B (zh) * 2021-08-17 2022-08-23 北京百度网讯科技有限公司 语音合成方法、装置、设备和计算机存储介质
CN113689866B (zh) * 2021-08-18 2023-04-25 北京百度网讯科技有限公司 一种语音转换模型的训练方法、装置、电子设备及介质
CN113724690B (zh) * 2021-09-01 2023-01-03 宿迁硅基智能科技有限公司 Ppg特征的输出方法、目标音频的输出方法及装置
CN113724718B (zh) 2021-09-01 2022-07-29 宿迁硅基智能科技有限公司 目标音频的输出方法及装置、系统
CN113763987A (zh) * 2021-09-06 2021-12-07 中国科学院声学研究所 一种语音转换模型的训练方法及装置
CN113674735B (zh) * 2021-09-26 2022-01-18 北京奇艺世纪科技有限公司 声音转换方法、装置、电子设备及可读存储介质
CN114360557B (zh) * 2021-12-22 2022-11-01 北京百度网讯科技有限公司 语音音色转换方法、模型训练方法、装置、设备和介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090222258A1 (en) * 2008-02-29 2009-09-03 Takashi Fukuda Voice activity detection system, method, and program product
US20150161994A1 (en) * 2013-12-05 2015-06-11 Nuance Communications, Inc. Method and Apparatus for Speech Recognition Using Neural Networks with Speaker Adaptation
CN108777140A (zh) * 2018-04-27 2018-11-09 南京邮电大学 一种非平行语料训练下基于vae的语音转换方法
CN109671423A (zh) * 2018-05-03 2019-04-23 南京邮电大学 训练数据有限情形下的非平行文本语音转换方法
CN111680591A (zh) * 2020-05-28 2020-09-18 天津大学 一种基于特征融合和注意力机制的发音反演方法
CN112017644A (zh) * 2020-10-21 2020-12-01 南京硅基智能科技有限公司 一种声音变换系统、方法及应用

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10476872B2 (en) * 2015-02-20 2019-11-12 Sri International Joint speaker authentication and key phrase identification
US10176819B2 (en) * 2016-07-11 2019-01-08 The Chinese University Of Hong Kong Phonetic posteriorgrams for many-to-one voice conversion
EP4007997B1 (en) * 2019-08-03 2024-03-27 Google LLC Controlling expressivity in end-to-end speech synthesis systems

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090222258A1 (en) * 2008-02-29 2009-09-03 Takashi Fukuda Voice activity detection system, method, and program product
US20150161994A1 (en) * 2013-12-05 2015-06-11 Nuance Communications, Inc. Method and Apparatus for Speech Recognition Using Neural Networks with Speaker Adaptation
CN108777140A (zh) * 2018-04-27 2018-11-09 南京邮电大学 一种非平行语料训练下基于vae的语音转换方法
CN109671423A (zh) * 2018-05-03 2019-04-23 南京邮电大学 训练数据有限情形下的非平行文本语音转换方法
CN111680591A (zh) * 2020-05-28 2020-09-18 天津大学 一种基于特征融合和注意力机制的发音反演方法
CN112017644A (zh) * 2020-10-21 2020-12-01 南京硅基智能科技有限公司 一种声音变换系统、方法及应用

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4016526A4

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115132204A (zh) * 2022-06-10 2022-09-30 腾讯科技(深圳)有限公司 一种语音处理方法、设备、存储介质及计算机程序产品
CN115132204B (zh) * 2022-06-10 2024-03-22 腾讯科技(深圳)有限公司 一种语音处理方法、设备、存储介质及计算机程序产品

Also Published As

Publication number Publication date
EP4016526A1 (en) 2022-06-22
US20220310063A1 (en) 2022-09-29
CN112017644B (zh) 2021-02-12
CN112017644A (zh) 2020-12-01
EP4016526B1 (en) 2024-02-21
US11875775B2 (en) 2024-01-16
EP4016526A4 (en) 2022-06-22

Similar Documents

Publication Publication Date Title
WO2022083083A1 (zh) 一种声音变换系统以及声音变换系统的训练方法
CN112767958B (zh) 一种基于零次学习的跨语种音色转换系统及方法
CN110827801B (zh) 一种基于人工智能的自动语音识别方法及系统
Lin et al. A unified framework for multilingual speech recognition in air traffic control systems
Ghai et al. Literature review on automatic speech recognition
CN113470662A (zh) 生成和使用用于关键词检出系统的文本到语音数据和语音识别系统中的说话者适配
CN112581963B (zh) 一种语音意图识别方法及系统
KR20230056741A (ko) 목소리 변환 및 스피치 인식 모델을 사용한 합성 데이터 증강
JP7335569B2 (ja) 音声認識方法、装置及び電子機器
CN111710326A (zh) 英文语音的合成方法及系统、电子设备及存储介质
Bachate et al. Automatic speech recognition systems for regional languages in India
CN113744722A (zh) 一种用于有限句库的离线语音识别匹配装置与方法
Wu et al. Multilingual text-to-speech training using cross language voice conversion and self-supervised learning of speech representations
CN115836300A (zh) 用于文本到语音的自训练WaveNet
Zhao et al. Research on voice cloning with a few samples
CN114974218A (zh) 语音转换模型训练方法及装置、语音转换方法及装置
CN112216270B (zh) 语音音素的识别方法及系统、电子设备及存储介质
Ajayi et al. Systematic review on speech recognition tools and techniques needed for speech application development
Gupta et al. An Automatic Speech Recognition System: A systematic review and Future directions
CN115410596A (zh) 构音异常语料扩增方法及系统、语音辨识平台,及构音异常辅助装置
Tzudir et al. Low-resource dialect identification in Ao using noise robust mean Hilbert envelope coefficients
Xiao et al. Automatic voice query service for multi-accented mandarin speech
JP7146038B2 (ja) 音声認識システム及び方法
Vyas et al. Study of Speech Recognition Technology and its Significance in Human-Machine Interface
Swamy Speech Enhancement, Databases, Features and Classifiers in Automatic Speech Recognition: A Review

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2021749056

Country of ref document: EP

Effective date: 20210811

NENP Non-entry into the national phase

Ref country code: DE