CN111462769B - End-to-end accent conversion method - Google Patents

End-to-end accent conversion method Download PDF

Info

Publication number
CN111462769B
CN111462769B CN202010239586.2A CN202010239586A CN111462769B CN 111462769 B CN111462769 B CN 111462769B CN 202010239586 A CN202010239586 A CN 202010239586A CN 111462769 B CN111462769 B CN 111462769B
Authority
CN
China
Prior art keywords
accent
speaker
channel
speech
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010239586.2A
Other languages
Chinese (zh)
Other versions
CN111462769A (en
Inventor
刘颂湘
王迪松
曹悦雯
孙立发
吴锡欣
康世胤
吴志勇
刘循英
蒙美玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Dadan Shusheng Technology Co ltd
Original Assignee
Shenzhen Dadan Shusheng Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Dadan Shusheng Technology Co ltd filed Critical Shenzhen Dadan Shusheng Technology Co ltd
Priority to CN202010239586.2A priority Critical patent/CN111462769B/en
Publication of CN111462769A publication Critical patent/CN111462769A/en
Application granted granted Critical
Publication of CN111462769B publication Critical patent/CN111462769B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/0018Speech coding using phonetic or linguistical decoding of the source; Reconstruction using text-to-speech synthesis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention discloses an end-to-end accent conversion method, which is used for converting non-tunnel accents into tunnel accents, belongs to the technical field of voice processing, and can also be used for converting voices of patients with dysarthria into standard voices; the signal parameters of the non-channel accent and the speaker vector are input to a voice synthesis module, and the voice processed by the voice synthesis module finally passes through a neural network vocoder to synthesize the channel accent of a specific speaker; the beneficial effects are that: the non-channel accent can be converted into the channel accent without any guidance of the channel accent reference audio in the conversion process, and the original tone of the speaker is maintained.

Description

End-to-end accent conversion method
Technical Field
The invention relates to the technical field of voice processing, in particular to an end-to-end accent conversion method.
Background
The voice recognition technology is widely applied, the existing voice recognition library is basically based on the voice of the standard national language, the standard national language of a speaker is converted into characters, and the accuracy is high. However, most people in real life have nonstandard pronunciation, carry more or less local accents, and in order to communicate better with them, the non-channel accents need to be changed into channel accents. In addition, many patients with dysarthria (such as Chinese patients) cannot communicate with other people normally at present, and the daily communication of the patients is particularly important by converting the nonstandard pronunciation of the patients into the standard pronunciation. The traditional voice conversion method is to convert the source speaker identity of the channel accent into the speaker identity of the non-channel accent, i.e. only the tone color is changed, while the basic content and pronunciation remain unchanged. This prevents their use in real life due to the need to use the accent of the ground track during the transition phase. Therefore, this is a problem to be solved at present.
Disclosure of Invention
The invention aims to provide an end-to-end accent conversion method, which aims to solve the problem that the conventional voice recognition library needs to use the tunnel accent for converting non-tunnel accents into tunnel accents.
In order to solve the technical problems, the technical scheme of the invention is as follows: an end-to-end accent conversion method comprises an accent conversion system for realizing the accent conversion method, wherein the accent conversion system comprises a voice recognition module, a speaker encoder, a voice synthesis module and a neural network vocoder, the voice recognition module is used for adjusting the acoustic characteristics of an input non-channel accent into signal parameters of the channel accent, and the signal parameters are only related to the speaking content of the non-channel accent; the signal parameters of the non-channel accent and the speaker vector are input to the voice synthesis module, and the voice processed by the voice synthesis module finally passes through a neural network vocoder to synthesize the channel accent of the specific speaker.
As a preferred embodiment of the present invention, the speaker encoder is an extensible and speaker-independent neural network framework that converts arbitrary-length input speech-generated acoustic frames into fixed-dimensional speaker-embedded vectors that are associated with only the speaker.
As a preferred embodiment of the present invention, the speech synthesis module uses a mean square error loss L for the speaker-embedded vector TTS The speech synthesis model is trained to generate only the channel accents embedded by the speaker in the corresponding tone.
As a preferred embodiment of the present invention, the voice recognition module is configured to adjust the acoustic characteristic of the non-channel accent to the signal parameter of the channel accent.
As a preferred embodiment of the present invention, the neural network vocoder is a WaveRNN network, an LPCNet or a WaveNet.
As a preferred embodiment of the present invention, the method comprises the steps of: a. collecting voice information and corresponding text information of a plurality of speakers, and training a speaker encoder and a voice synthesis module; b. screening out speakers with non-tunnel accents as target speakers; c. the text sample and the speech sample of the target speaker are used for training of the speech recognition module.
The beneficial effects of adopting above-mentioned technical scheme are: the end-to-end accent conversion method provided by the invention can convert non-channel accents into channel accents without any guidance of channel accent reference audio frequency in the conversion process, and can keep the original tone of a speaker.
Drawings
FIG. 1 is a schematic diagram of a training phase of the present invention;
FIG. 2 is a schematic diagram of a transition phase according to the present invention;
FIG. 3 is a graph showing the mean opinion score results during the experimental stage of the present invention;
FIG. 4 is a schematic diagram of the test results of the present invention at the experimental stage.
Detailed Description
The following describes the embodiments of the present invention further with reference to the drawings. The description of these embodiments is provided to assist understanding of the present invention, but is not intended to limit the present invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
The embodiment provides an end-to-end accent conversion method, which comprises an accent conversion system for realizing the accent conversion method, wherein the accent conversion system comprises a voice recognition module, a speaker encoder, a voice synthesis module and a neural network vocoder, the voice recognition module is used for adjusting the acoustic characteristics of the input non-channel accent into signal parameters of the channel accent, and the signal parameters are only related to the speaking content of the non-channel accent; the signal parameters of the non-channel accent and the speaker vector are input into a voice synthesis module, and the voice processed by the voice synthesis module finally passes through a neural network vocoder to synthesize the channel accent of the specific speaker. The voice recognition module is used for adjusting the acoustic characteristics of the non-channel accent into the signal parameters of the channel accent.
The accent conversion method mainly comprises the following steps: collecting voice information and corresponding text information of a plurality of speakers, and training a speaker encoder and a voice synthesis module; b. screening out speakers with non-tunnel accents as target speakers; c. the text sample and the speech sample of the target speaker are used for training of the speech recognition module.
The speaker encoder is a neural network framework that is scalable and can verify the speaker, generates variable-length acoustic features from input speech of arbitrary length, and converts the variable-length acoustic features into fixed-dimension speaker-embedded vectors. The speaker encoder model used herein is an extensible and very accurate speaker verification neural network framework that generates a fixed-dimensional speaker-embedded vector from a series of acoustic frames computed from speech sentences of arbitrary length. The method utilizes the speaker embedded vector to adjust a TTS model of a reference voice signal of a target speaker, so that the generated voice has the speaker identity of the target speaker. Generalized end-to-end speaker verification loss-loss is optimized by training the speaker encoder such that speech embedding from the same speaker has a high co-sinusoidal similarity, while speech embedding from different speakers is spatially far apart. It is desirable for the speaker encoder to learn a representation related to speech synthesis that captures the characteristics of non-channel accents that are not visible during training.
The speech synthesis module uses the mean square error loss L for speaker embedded vector TTS The speech synthesis model is trained to generate only the channel accents embedded by the speaker in the corresponding tone. The model is an attention-based encoder-decoder model to support multiple speakers. As shown in fig. 1, the speaker-embedded vector calculated by the speaker encoder of the desired target speaker is connected to the TTS decoder output at each time step, and the attentive TTS decoder uses these as additional inputs to generate the mel-frequency spectrogram. Loss L using Mean Square Error (MSE) TTS The TTS model is trained to generate only the channel accents whose identity/timbre is determined by the speaker's embedding. We map the text transcript into a phoneme sequence as input to the TTS model, as it has been shown that: the use of phoneme sequences can increase convergence speed and improve pronunciation effects of rare words and proper nouns.
The multi-task ASR model is also connected with a full-connection conversion layer for calculating a connector time classification loss L CTC . We learn the accent independent language representation from the acoustic features using the accent ASR model. The ASR model applies an end-to-end based attention encoder-decoder framework. Given a pair of audio and its phoneme transcriptions, by TThe TS decoder calculates speech features from the audio and linguistic representations of the phoneme sequence, respectively. We add a Full Connection (FC) conversion layer to the ASR encoder and calculate a Connector Time Class (CTC) penalty L CTC To stabilize the training process. Since the training data of the ASR model contains accents, we connect the accent embedding and acoustic features of each frame as input to the ASR model and add an accent classifier on top of the ASR encoder, making it more robust to the recognition of accents. We assume that different accents are associated with different speakers. The speaker's accent embedding is obtained herein by averaging all the embeddings of the speaker. The output of the accent classifier is used to calculate the cross entropy loss L ACC . The phoneme labels and linguistic representations in the two streams are pre-determined using an attention-based ASR decoder. Cross entropy loss L CE MSE loss L for phoneme label prediction TTSE For measuring the language difference between the TTS decoder output Hl and the ASR decoder output Hl.
The neural network vocoder is a neural network vocoder such as a WaveRNN network, LPCNet or WaveNet. The WaveRNN network is preferably employed herein as a neural network vocoder, with training being accomplished using open source Pytorch. Since the mel-frequency spectrogram captures all relevant details required for high quality speech synthesis, we only train the WaveRNN using the mel-frequency spectrograms from multiple speakers, without adding any speaker embeddings.
In the training phase, as shown in fig. 3, we first train the speaker encoder model. Then, at loss L TTS In the case of (2) training a multi-speaker TTS model using only native english speech data. Thereafter, a speech conversion (AC) technique ASR model is first pre-trained using speech data from a plurality of native language users and a non-native language user. The ASR model then uses only speech data from non-native spoken target speakers for fine tuning. In both phases we train a WaveRNN using the lossy lidar in equation 2 and using only the voice data of the channel accent user. In the accent conversion phase, as shown in fig. 4, acoustic features are first calculated from non-channel accents. Then, the speakerThe encoder receives the acoustic features and outputs a speaker-embedded vector representing the identity of the non-authentic accent speaker. Accent embedding is the average embedding of speakers who are non-channel accents. We connect the accent embedding with the acoustic features of each frame, then input it into the ASR model to generate the linguistic representation H-l, and then the attention-based TTS decoder combines the linguistic representation with the speaker embedding to generate the channel accent acoustic features. Finally, we use the WaveRNN model to transform acoustic features into time domain waveforms, hopefully to more naturally present accents.
The accent conversion method comprises the following steps: a. collecting voice information of multiple speakers; b. screening out the speaker of the voice of the seal land as a target speaker; c. a portion of the target speaker's speech samples are used for multi-speaker TTS model and neural network vocoder training, and another portion of the target speaker's speech samples are used for training of the multi-tasking ASR model.
During the experimental stage, we collected speech samples that contained 109 speakers with a clear speech up to 44 hours. In this study, we selected the speaker p248 of the open-ended voice as the target speaker. We consider a person without the hindi accent as a person in native english and then resample the audio to 22.05kHz. For training of the TTS model and the WaveRNN model we only use speech data of 105 native speakers. The mel-frequency spectrogram has 80 channels, calculated using a 50ms window width and a 12.5ms frame offset. 1000 samples were randomly drawn as a validation set, with the remaining samples used for training. In the training of the accent ASR model, the voice data of the channel accent and p248 are used. 40 mel-frequency spectra calculated with a window width of 25 milliseconds and a displacement of 10 milliseconds and delta characteristics thereof are taken as acoustic characteristics. The number of utterances used for training, verification, and testing in p248 are 326, 25, and 25, respectively. The SI-ASR model for extracting ppg was trained using the TIMIT dataset.
As shown in fig. 1, is the training phase of the method. hs is a linguistic representation, hl is a linguistic representation of the TTS decoder, and Hl is a linguistic representation of the ASR decoder.
Fig. 2 investigates the transition phase of the method. hs and H1 respectively represent speakingA person and a language characterization. L (L) TTSE The expression is as follows:
where N is the number of training samples. ASR models are training with multitasking losses: l (L) ASR =λ 1 L CE2 L TTSE3 L CTC4 L ACC Where λ is the hyper-parameter, weighting the four losses.
L TTS The speaker encoder is a 3-layer LSTM with 256 hidden nodes followed by a 256-unit projection layer. The output is the second level normalized hidden state of the last layer, which is a vector of 256 elements. The TTS model employs the same architecture as in [ in ]. The ASR encoder is a 5-layer bi-directional LSTM (BLSTM) with 320 units per direction. 300-dimensional location-aware attention is used in the attention layer. The ASR decoder is a single layer LSTM with 320 units. The accent classifier is a 2-layer 1-dimensional convolutional network with 128 channels and 3 kernel sizes, followed by an averaging pooling layer and a final FC output layer. In equation 2, λ1=0.5, λ2=0.1, λ3=0.5, and λ4=0.1 heuristically brings the four penalty terms to a similar numerical scale. Transform 1 and transform 2 in the baseline approach are 2-layer and 4-layer BLSTMs, respectively, of 128 units per direction.
The speaker encoder model was trained using the Adam optimizer in 1000k steps with a batch size of 640 and a learning rate of 0.0001. The Adam optimizer was used to train the TTS model in 100k steps with a batch size of 16 and a learning rate of 0.001. The accent ASR model first pre-trains 160k steps using an Adadelta optimizer that uses a batch size of 16 and the learning rate of the speech data obtained from the native speaker and p 248. Then, while keeping the batch size and learning rate unchanged, fine tuning was performed for only the voice data from p248 by another 5k steps.
As shown in fig. 3, (a) is the mean opinion score result for the 95% confidence interval. (b) For speaker phase preference test results, AB-BL, P-AB represent comparison of ablation method v.s baseline method, proposed method v.s baseline method, proposed method v.s ablation method, respectively.
As shown in fig. 4, the accent preference test results are shown. "AB-BL", "P-AB", "P-L2" and "P-L1" respectively represent a comparison of "ablation method v.s baseline method", "proposed method v.s baseline method", "proposed method v.s ablation method", "proposed method v.s non-channel accent recording" and "proposed method v.s channel accent recording".
Three perceptual hearing tests were used to evaluate the conversion performance of Baseline (BL), proposed (P) and Ablation (AB) systems: mean Opinion Score (MOS) test of audio naturalness, speaker similarity XAB test, and accent AB test. We randomly extract 20 utterances from the test utterances of the p248 speaker for evaluation.
Audio naturalness. In the MOS test, the audio naturalness is evaluated in terms of five-component (1-component difference, 2-component difference, 3-component difference, 4-component difference, 5-component difference). The three system-generated audio and the reference recordings of non-channel accents ("L2-Ref") are randomly ordered before being presented to the listener. Each set of audio corresponds to the same text content. The listener can listen to the audio repeatedly as desired. This method resulted in 4.0 MOSs (L2-Refereeives-MOS 4.4), which were statistically significantly higher than the baseline method. The maximum likelihood of this approach is slightly higher than that of the ablative system, which illustrates that the addition of an accent embedding and classifier helps to extract accent-independent language content from the accent that is beneficial for speech synthesis. Since we use the ASR and TTS models based on the seq2seq, the converted speech has a pronunciation pattern more like the accent of the channel, such as in terms of duration and speech speed, which is very different from the speech of the source accent. We suggest that the reader listen to the audio samples.
Speaker similarity. We compare the difference in speaker similarity between the converted speech and the non-channel accent. In the XAB test, X represents a non-meatus accent reference sample. We present pairs of speech samples (a and B) that have the same text content as the reference, and ask the listener to determine which speech sample has a tone color closer to the reference. The listener may also replay the audio samples and may select "No Preference (NP)" if the difference cannot be distinguished. To avoid the impact of potential content, the audio is played back in reverse. We can see that the baseline system has better similarity performance than the proposed system. The results are reasonable since the speech synthesis model in this approach never finds speech data from non-channel accents. It is desirable that the synthesis model be able to infer the timbre of a speaker from the speaker-embedded generated by a speaker encoder that has only one speech (i.e., a phonetic clone). Speech data access from more speakers facilitates training of more versatile speaker encoders. By training the speaker encoder with speech data from 18K speakers, very good speech cloning performance can be achieved, but we cannot access such a large corpus of data. We found that there was no statistical difference between the two ablation systems (p-value 0.36). In the accent AB test, we first let the participant listen to the reference audio of the channel accent and the non-channel accent. Pairs of speech samples (a and B) with the same text content are then presented and the listener is asked to select a sample that more closely resembles accent speech. The results are shown in FIG. 4. According to the preference test between "P-BL" and "P-AB", the listener is very confident that the proposed method is able to produce more accents of the earth (P < 0.001) than the baseline and ablation methods, even though the ablation training is able to obtain better accent performance (P < 0.001) than the baseline treatment. The DTW process in the baseline approach may introduce alignment errors and the mapping from L2-PPGs to L1-PPGs using neural networks may be ineffective. From the results of "P-L2" and "P-L1", we can conclude that: the method can remove non-channel accents from the speech in the second language, so that the converted speech is similar to the channel accents in accent. The p values were 2.3 Xe-8 and 0.06, respectively.
An end-to-end accent conversion method is presented herein, which is the first model that can convert non-channel accents into channel accents without any guidance of channel accent reference audio during the conversion process. It consists of four independently trained neural networks: a speaker encoder, a multi-speaker TTS model, a multi-tasking ASR model, and a neural network vocoder. The experimental results show that: the method can convert the English speech of the non-channel accent into the English speech of the non-channel accent, and the English speech of the non-channel accent is difficult to distinguish from the channel accent on the accent. It is desirable for the synthesis model to be able to generate the desired timbre of the target speaker from the speaker embedding obtained from the speaker encoder. Taking the english accent conversion as an example, the same framework may be used for accent conversion in any other language.
In addition, the end-to-end accent conversion method provided herein can also be used for pronunciation correction of patients with dysarthria, and causes of dysarthria include brain injury, stroke, parkinson's disease or amyotrophic lateral sclerosis, etc. Accent modification is particularly important for the patient's daily communications. The implementation method is consistent with the steps, only the non-channel accent data is needed to be replaced by the non-standard accent data of the patient, and other training methods and conversion methods are the same as the conversion method of the non-channel accent.
The embodiments of the present invention have been described in detail above with reference to the accompanying drawings, but the present invention is not limited to the described embodiments. It will be apparent to those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, and yet fall within the scope of the invention.

Claims (6)

1. An end-to-end accent conversion method is characterized by comprising an accent conversion system for realizing the accent conversion method, wherein the accent conversion system comprises a voice recognition module, a speaker encoder, a voice synthesis module and a neural network vocoder, and the voice recognition module is used for adjusting the acoustic characteristics of input non-channel accents into signal parameters of channel accents, and the signal parameters are only related to the speaking content of the non-channel accents; the signal parameters of the non-channel accent and the speaker vector are input to the voice synthesis module, and the voice processed by the voice synthesis module finally passes through a neural network vocoder to synthesize the channel accent of a specific speaker;
wherein the speaker-embedded vector calculated by the speaker encoder of the desired target speaker is connected at each time step to a TTS decoder output of a TTS model, the TTS decoder generating a mel-frequency spectrogram with the connected speaker-embedded vector as an additional input;
the speech synthesis module comprises a multi-task ASR model, wherein a full-connection conversion layer is additionally arranged on an ASR encoder of the multi-task ASR model and is used for calculating a connector time classification loss, and the multi-task ASR model calculates speech characteristics from audio frequency and language representation of a phoneme sequence through the TTS decoder respectively;
wherein an accent classifier is arranged on top of the ASR encoder, the output of the accent classifier is used for calculating cross entropy loss, and the cross entropy loss is used for predicting phoneme labels;
wherein the input of the multitasking ASR model is a feature formed by a connection between an accent embedding and an acoustic feature of each frame in the speaker vector.
2. The end-to-end accent conversion method of claim 1, wherein the speaker encoder is a neural network framework that is scalable and validates the speaker, reconverting an arbitrary length of the input speech-producing acoustic frame into a fixed-dimension speaker-embedded vector, the speaker-embedded vector being associated with only the speaker.
3. The end-to-end accent conversion method of claim 2, wherein the speech synthesis module trains the speech synthesis model with a mean square error loss LTTS using the speaker-embedded vector to generate only the channel accent embedded by the speaker into the corresponding tone.
4. The end-to-end accent conversion method of claim 1, wherein the speech recognition module is configured to adjust the acoustic characteristics of the non-channel accent to the signal parameters of the channel accent, wherein during the training process, the signal parameters of the channel accent are extracted from the non-channel accents of the plurality of speakers in the speech synthesis module, and the signal parameters of the channel accent are related to the speech content only and are independent of the speakers.
5. The end-to-end accent conversion method of claim 1, wherein the neural network vocoder is a WaveRNN network, LPCNet or WaveNet.
6. The end-to-end accent conversion method of claim 1, comprising the steps of: a. collecting voice information and corresponding text information of a plurality of speakers, and training a speaker encoder and a voice synthesis module; b. screening out speakers with non-tunnel accents as target speakers; c. the text sample and the speech sample of the target speaker are used for training of the speech recognition module.
CN202010239586.2A 2020-03-30 2020-03-30 End-to-end accent conversion method Active CN111462769B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010239586.2A CN111462769B (en) 2020-03-30 2020-03-30 End-to-end accent conversion method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010239586.2A CN111462769B (en) 2020-03-30 2020-03-30 End-to-end accent conversion method

Publications (2)

Publication Number Publication Date
CN111462769A CN111462769A (en) 2020-07-28
CN111462769B true CN111462769B (en) 2023-10-27

Family

ID=71681783

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010239586.2A Active CN111462769B (en) 2020-03-30 2020-03-30 End-to-end accent conversion method

Country Status (1)

Country Link
CN (1) CN111462769B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112233646A (en) * 2020-10-20 2021-01-15 携程计算机技术(上海)有限公司 Voice cloning method, system, device and storage medium based on neural network
CN112786052A (en) * 2020-12-30 2021-05-11 科大讯飞股份有限公司 Speech recognition method, electronic device and storage device
CN113223542B (en) * 2021-04-26 2024-04-12 北京搜狗科技发展有限公司 Audio conversion method and device, storage medium and electronic equipment
US11948550B2 (en) 2021-05-06 2024-04-02 Sanas.ai Inc. Real-time accent conversion model
CN113593534B (en) * 2021-05-28 2023-07-14 思必驰科技股份有限公司 Method and device for multi-accent speech recognition
CN113327575B (en) * 2021-05-31 2024-03-01 广州虎牙科技有限公司 Speech synthesis method, device, computer equipment and storage medium
CN113345431A (en) * 2021-05-31 2021-09-03 平安科技(深圳)有限公司 Cross-language voice conversion method, device, equipment and medium
CN113470622B (en) * 2021-09-06 2021-11-19 成都启英泰伦科技有限公司 Conversion method and device capable of converting any voice into multiple voices
CN116994553A (en) * 2022-09-15 2023-11-03 腾讯科技(深圳)有限公司 Training method of speech synthesis model, speech synthesis method, device and equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101359473A (en) * 2007-07-30 2009-02-04 国际商业机器公司 Auto speech conversion method and apparatus
CN101399044A (en) * 2007-09-29 2009-04-01 国际商业机器公司 Voice conversion method and system
CN102982809A (en) * 2012-12-11 2013-03-20 中国科学技术大学 Conversion method for sound of speaker
CN108108357A (en) * 2018-01-12 2018-06-01 京东方科技集团股份有限公司 Accent conversion method and device, electronic equipment
CN110335584A (en) * 2018-03-29 2019-10-15 福特全球技术公司 Neural network generates modeling to convert sound pronunciation and enhancing training data
CN110600047A (en) * 2019-09-17 2019-12-20 南京邮电大学 Perceptual STARGAN-based many-to-many speaker conversion method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101359473A (en) * 2007-07-30 2009-02-04 国际商业机器公司 Auto speech conversion method and apparatus
CN101399044A (en) * 2007-09-29 2009-04-01 国际商业机器公司 Voice conversion method and system
CN102982809A (en) * 2012-12-11 2013-03-20 中国科学技术大学 Conversion method for sound of speaker
CN108108357A (en) * 2018-01-12 2018-06-01 京东方科技集团股份有限公司 Accent conversion method and device, electronic equipment
CN110335584A (en) * 2018-03-29 2019-10-15 福特全球技术公司 Neural network generates modeling to convert sound pronunciation and enhancing training data
CN110600047A (en) * 2019-09-17 2019-12-20 南京邮电大学 Perceptual STARGAN-based many-to-many speaker conversion method

Also Published As

Publication number Publication date
CN111462769A (en) 2020-07-28

Similar Documents

Publication Publication Date Title
CN111462769B (en) End-to-end accent conversion method
Schädler et al. Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition
Kingsbury et al. Robust speech recognition using the modulation spectrogram
Sarikaya et al. High resolution speech feature parametrization for monophone-based stressed speech recognition
Womack et al. N-channel hidden Markov models for combined stressed speech classification and recognition
Liu et al. End-to-end accent conversion without using native utterances
CN101359473A (en) Auto speech conversion method and apparatus
Doshi et al. Extending parrotron: An end-to-end, speech conversion and speech recognition model for atypical speech
Zhao et al. Using phonetic posteriorgram based frame pairing for segmental accent conversion
Yang et al. Speech representation disentanglement with adversarial mutual information learning for one-shot voice conversion
Geng et al. Speaker adaptation using spectro-temporal deep features for dysarthric and elderly speech recognition
CN110570842B (en) Speech recognition method and system based on phoneme approximation degree and pronunciation standard degree
Zheng et al. CASIA voice conversion system for the voice conversion challenge 2020
Matsubara et al. High-intelligibility speech synthesis for dysarthric speakers with LPCNet-based TTS and CycleVAE-based VC
KR20190135853A (en) Method and system of text to multiple speech
Das et al. Understanding the effect of voice quality and accent on talker similarity
Quamer et al. Zero-shot foreign accent conversion without a native reference
Nazir et al. Deep learning end to end speech synthesis: A review
Leung et al. Applying articulatory features to telephone-based speaker verification
Sae-Tang et al. Feature windowing-based Thai text-dependent speaker identification using MLP with backpropagation algorithm
Andra et al. Improved transcription and speaker identification system for concurrent speech in Bahasa Indonesia using recurrent neural network
CN114023343A (en) Voice conversion method based on semi-supervised feature learning
Hsu et al. Speaker-dependent model interpolation for statistical emotional speech synthesis
Shinde et al. Vowel Classification based on LPC and ANN
CN111259188A (en) Lyric alignment method and system based on seq2seq network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220728

Address after: 518000 Honglang North 408, Zhongli Chuangye community, No. 49, Dabao Road, Dalang community, Xin'an street, Bao'an District, Shenzhen, Guangdong Province

Applicant after: Shenzhen Dadan Shusheng Technology Co.,Ltd.

Address before: 518101 2710, building 2, huichuangxin Park, No. 2 Liuxian Avenue, Xingdong community, Xin'an street, Bao'an District, Shenzhen, Guangdong Province

Applicant before: SPEECHX LTD.

GR01 Patent grant
GR01 Patent grant