CN116386613B - Model training method for enhancing command word voice - Google Patents

Model training method for enhancing command word voice Download PDF

Info

Publication number
CN116386613B
CN116386613B CN202310650948.0A CN202310650948A CN116386613B CN 116386613 B CN116386613 B CN 116386613B CN 202310650948 A CN202310650948 A CN 202310650948A CN 116386613 B CN116386613 B CN 116386613B
Authority
CN
China
Prior art keywords
word
corpus
audio
command
command word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310650948.0A
Other languages
Chinese (zh)
Other versions
CN116386613A (en
Inventor
温登峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chipintelli Technology Co Ltd
Original Assignee
Chipintelli Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chipintelli Technology Co Ltd filed Critical Chipintelli Technology Co Ltd
Priority to CN202310650948.0A priority Critical patent/CN116386613B/en
Publication of CN116386613A publication Critical patent/CN116386613A/en
Application granted granted Critical
Publication of CN116386613B publication Critical patent/CN116386613B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

S1, performing initial training to obtain an original speech recognition model MD1; s2: acquiring command word entry C1 used by a client project, selecting corresponding audio for screening, and expanding the audio of a command word to be expanded; s3: removing the audio with decoding errors to obtain a corrected command word corpus B4; s4: recording the actual noise of the customer home products, and carrying out noise adding and reverberation processing on the corrected command word corpus B4 to obtain a command word expansion corpus B5; s5, selecting an initial model MD2 used by a corpus training chip end, and using the command word expansion corpus B5 obtained in the step S4 to adjust so as to obtain a final model MD3. The invention utilizes the existing corpus to generate the lacking command word corpus, can improve the recognition rate of the model in the actual scene without obviously increasing the labor cost, and meets the application requirements of clients.

Description

Model training method for enhancing command word voice
Technical Field
The invention belongs to the technical field of voice signal processing, and particularly relates to a model training method for enhancing command word voice.
Background
In recent years, with the continued development of artificial intelligence, various voice devices have been used more and more frequently. However, since the computing power of the chip side is limited, the actual application of the voice chip is mainly based on command words. The conventional end-side command word recognition process uses a large number of corpora to train a continuous speech recognition model, and then uses the model to recognize command words corresponding to a certain product. Command word recognition in quiet environments is essentially unproblematic, however recognition in noisy and far-field environments remains a major challenge.
Therefore, the developer usually records the corresponding corpus according to the command word instruction given by the client to fine tune the initial model. However, the corpus is recorded with great manpower and material resources, thereby bringing more cost. Another solution is to use a speech synthesis model to synthesize the corresponding command word corpus, which brings a new problem that the corresponding text-to-speech model (TTS) needs to be trained, and the training effect brought by the synthesized corpus is limited.
Disclosure of Invention
To overcome the deficiencies of the prior art, the present invention discloses a model training method for command word speech enhancement.
The invention relates to a model training method for enhancing command word voice, which comprises the following steps:
s1, performing initial training to obtain an original voice recognition model MD1, and establishing a quiet corpus;
s2: acquiring a command word entry C1 used by a client project, wherein the command word entry comprises at least one command word;
s21, selecting corresponding audio in the existing quiet corpus A1 according to command word entries C1, counting the number of audio entries of each command word in the quiet corpus A1, setting a first screening threshold value to screen all command words in the command word entries C1, and screening out to-be-expanded command words with the corresponding audio numbers lower than the first screening threshold value;
performing audio expansion on command words to be expanded; the expansion method comprises the following steps:
s22, word segmentation is carried out on the command words, wherein the first word segmentation is to decompose the command words into more than one word, the second word segmentation is carried out on the basis of the first word segmentation, and each word is divided into more than one single word;
s23, setting a second screening threshold, screening audio entries comprising each word from the quiet corpus A1 according to the first word segmentation result, and screening again in the quiet corpus A1 according to the second word segmentation result if the number of the audio entries of the single word is less than the second screening threshold; screening out single words with the number of the audio frequency greater than a second screening threshold value;
if the results cannot be screened out by the two screening steps, the second screening threshold value is reduced; repeating step S23;
the step S23 of obtaining the original audio containing the two word segmentation results through screening;
s24, aligning the audio obtained in the step S23 by using an original voice recognition model MD1 model to obtain a corresponding time tag of the word or the single word in the audio, and cutting out the audio only containing the corresponding word or the single word according to the corresponding time tag, wherein the cut-out audio is used as a word segmentation sub-corpus B1;
s25, repeating the steps S22-S24 for each command word, and combining all word segmentation sub-corpora B1 to obtain a word segmentation corpus B2;
s26, randomly screening audio from the word segmentation corpus B2 to combine to obtain a command word whole word corpus B3;
s3: decoding the synthesized command word whole word stock B3 by using the original voice recognition model MD1, and removing audio with decoding errors to obtain a corrected command word stock B4;
s4: recording the actual noise of the customer home products, and carrying out noise adding and reverberation processing on the corrected command word corpus B4 to obtain a command word expansion corpus B5;
s5, selecting an initial model MD2 used by a corpus training chip end, and using the command word expansion corpus B5 obtained in the step S4 to adjust so as to obtain a final model MD3.
Preferably, in the step S1, the original speech recognition model MD1 is obtained by training using a CTC/RNNT training method.
Preferably, in the step S4, part of the audio is selected from the corrected command word corpus B4, played in the corresponding noise environment, and picked up, and the audio obtained by picking up the audio is the complementary noise corpus B6;
in the step S5, the command word expansion corpus B5 and the supplementary noise corpus B6 are used to adjust MD2 at the same time, so as to obtain a final model MD3.
The invention utilizes the existing corpus to generate the lacking command word corpus, can improve the recognition rate of the model in the actual scene without obviously increasing the labor cost, and meets the application requirements of clients.
Drawings
FIG. 1 is a schematic flow chart of an embodiment of the present invention.
Description of the embodiments
The following detailed description of the invention is, therefore, not to be taken in a limiting sense, and is set forth in the appended drawings.
The following describes the embodiments of the present invention in further detail with reference to the accompanying drawings.
The invention relates to a model training method for enhancing command word voice, which comprises the following steps:
s1: initial training is carried out by using a large corpus exceeding 1 ten thousand hours to obtain an original speech recognition model MD1, the model uses a training method of CTC/RNNT (time sequence classification Connectionist Temporal Classification based on a neural network and a recurrent neural network transducer Recurrent Neural Network Transducer) of word modeling, and for Chinese, a modeling unit of the model is Chinese characters;
establishing a quiet corpus, wherein the quiet corpus comprises a large number of sounding audios collected in a quiet environment, the audios comprise single words, short sentences, long sentences or whole articles, and the speakers comprise people with different sexes, ages and different pronunciation habits;
s2: obtaining command word entries C1 used by client projects, wherein the command word entries comprise at least one command word, taking a tea bar machine project as an example, the command word entries C1 comprise command words such as tea mode, black tea mode, green tea mode, tea boiling closing, tea boiling stopping and the like,
s21, firstly selecting corresponding audio in the existing quiet corpus A1 according to command word entries C1, counting the number of audio entries of each command word in the command word entries C1 in the quiet corpus A1, setting a first screening threshold value to screen command words in the command word entries C1, screening command words to be expanded, and independently listing audio with the command words less than the screening threshold value, such as less than 1000 entries, as the command words to be expanded C2, wherein the number of audio entries in the current nectar mode is assumed to be 0, so that the number of audio of each command word in the command words to be expanded is required to be expanded. The specific expansion method is as follows:
s22, word segmentation is carried out on the command words, wherein the first word segmentation is to decompose the command words into more than one word, the second word segmentation is carried out on the basis of the first word segmentation, and each word is divided into more than one single word;
s23, setting a second screening threshold, screening audio entries comprising each word from the quiet corpus A1 according to the first word segmentation result, and screening again in the quiet corpus A1 according to the second word segmentation result if the number of the audio entries of the single word is less than the second screening threshold; screening out single words with the number of the audio frequency greater than a second screening threshold value;
if the results cannot be screened out by the two screening steps, the second screening threshold is set higher, and the second screening threshold is lowered; repeating step S23;
the step S23 of obtaining the original audio containing the two word segmentation results through screening;
s24, aligning the audio obtained in the step S23 by using an original voice recognition model MD1 model to obtain a corresponding time tag of the word or the single word in the audio, and cutting out the audio only containing the corresponding word or the single word according to the corresponding time tag, wherein the cut-out audio is used as a word segmentation sub-corpus B1;
s25, repeating the steps S21-S24 for each command word, and combining all word segmentation sub-corpora B1 to obtain a word segmentation corpus B2;
s26, randomly screening audio from the word segmentation corpus B2 to combine to obtain a command word whole word corpus B3;
for example, one embodiment is shown in FIG. 1:
(1) the method comprises the steps of performing word segmentation on a fruit tea mode, wherein word segmentation results of first word segmentation are fruit tea and mode, and single word results obtained by second word segmentation, which are obtained by continuing word segmentation on the first word segmentation results, are fruit, tea, and modes;
(2) setting a second screening threshold value as 50, screening audio entries corresponding to fruit tea and a mode from the corpus A1 according to the first word segmentation, and screening the corpus again according to the second word segmentation if the number of the audio entries of the single entry is less than 50; if the audio frequency of the nectar is still less than 50, corpus screening is carried out according to the second word of fruit and tea;
(3) obtaining the original audio which is obtained by screening in the steps (1) and (2) and contains the two word segmentation results;
(4) aligning the audio obtained in the step (3) by using an original voice recognition model MD1 model to obtain corresponding time labels of words or single words such as fruit tea, mode and the like in the audio, cutting out the audio only containing the corresponding words or single words according to the corresponding time labels, and taking the cut-out audio as a word segmentation sub-corpus B1;
for example, the audio of "fruit" and "mode" are respectively cut out from the audio of "fruit candy", "group purchase mode", etc.;
(5) repeating the steps (1) - (4) for each command word, and combining all word segmentation sub-corpora B1 to obtain a word segmentation corpus B2;
(6) randomly screening audio from the word segmentation corpus B2 to combine to obtain a command word whole word corpus B3;
for example, a plurality of audio frequencies of fruit, tea and mode exist in the word segmentation corpus B2, partial audio frequencies are selected or randomly screened out, the audio frequencies are randomly combined into a finished command word of fruit tea mode, and the obtained result is used as a command word whole word corpus B3;
the step of screening and word segmentation is adopted twice, because the pronunciation habit of a person carries out word segmentation on a whole word, then pronouncing is carried out, the pronunciations of the words are more consistent, the corpus obtained by the two word segmentation not only maintains the complete word pronunciation, but also is replaced by single word audio when the condition of insufficient complete word pronunciation audio is not satisfied, thus being beneficial to obtaining the optimal screening result and being beneficial to the segmentation of the corpus;
s3: the raw speech recognition model MD1 is used to decode the synthesized command word whole word library B3,
the decoding is to identify the synthesized audio, if the identified text is different from the corresponding command word text, the decoding is considered to be wrong, and the audio needs to be deleted to make the corpus put into training more effectively. The reason for the decoding error is usually that the original speech recognition model MD1 has systematic small probability decoding error or that the synthesized audio is distorted due to the audio segmentation inaccuracy;
removing the audio with decoding errors to obtain a corrected command word corpus B4;
s4, recording the actual noise of the customer home products, and carrying out noise adding and reverberation processing on the corrected command word corpus B4 to obtain a command word expansion corpus B5;
a small amount of audio can be selected from the correction command word corpus B4 to be played in the corresponding noise environment so as to simulate the real environment, and a voice chip is used for picking up the voice to obtain a supplementary noise corpus B6; the corresponding noise environment is the noise environment where the audio corresponding command word is always located, for example, the equipment corresponding to the audio of 'I want to receive water' is a tea bar machine, and the corresponding noise environment is the noise generated during water heating and water receiving;
the supplementary noise corpus B6 is noise audio of real environment sound collection, and has better effect compared with the command word expansion corpus B5 generated by simulated noise addition;
s5, selecting an initial model MD2 used for corpus training chip end, wherein the MD2 model is a basic model which can be operated on a chip and is constructed for chip end training, and training the MD2 by using the command word expansion corpus B5 and the supplementary noise corpus B6 obtained in the step S4 to obtain a final model MD3.
The invention utilizes the existing corpus to generate the lacking command word corpus, can improve the recognition rate of the model in the actual scene without obviously increasing the labor cost, and meets the application requirements of clients.
The invention improves the recognition rate by expanding command word materials, and for model training, the twice screening word segmentation is due to the pronunciation habit of a person (for a whole word, word segmentation is performed firstly and then pronunciation is performed, and the pronunciation of the words is more consistent, which is beneficial to corpus segmentation, and relatively speaking, the segmentation of a single word is more difficult because the word segmentation rarely occurs in a section of voice independently.
This embodiment is implemented in the open source speech recognition tool k2 and kaldi (kaldi) environments; firstly, training an end-to-end consumer model on a k2 platform by using tens of thousands of hours of basic corpus as an original speech recognition model MD1 for alignment and decoding of subsequent corpus. The conformation model is a voice recognition model proposed by Google (Google) company in 2020
Screening and expanding about 100 command words of a tea bar machine according to the step S2 of the training method to obtain a command word whole word stock B3; and finishing the rejection in the step S3;
and S4, recording the noise generated during the running of the tea bar machine, and performing noise adding processing to obtain a command word expansion corpus B5 and a supplementary noise corpus B6 so as to improve the recognition effect in the noise environment during actual use.
The initial model MD2 with the model structure f-TDNN (decomposed time-delay neural network Factorized TDNN) running at the end of a basic corpus training chip for thousands of hours is used in a Kaldi (kaldi) environment, and then the obtained command word expansion corpus B5 and the supplementary noise corpus B6 are used for fine tuning the initial model MD2, so that the final model MD3 suitable for specific command words is obtained.
In this embodiment, based on the test items of the tea bar machine, the number of samples in each test set is 220 audios, and the test results of each model at the PC end are shown in table 1:
MD2 MD3 MD4
quantity of parameters 850K 850K 850K
Quiet 98% 99% 100%
Music 79% 87% 93%
Noise of water receiving 90% 95% 96%
Noise of water heating 90% 93% 97%
Table 1: tea bar machine test meter
In table 1, MD2 is a model trained by using a basic corpus, MD3 is a model trained after generating an extended command word corpus by using the method of the present patent, MD4 is a model trained after using a command word audio recorded by high fidelity sound (i.e. a command word voice audio recorded truly), and from the test result, the extended command word corpus can improve the recognition effect of the model under the actual use item.
The foregoing description of the preferred embodiments of the present invention is not obvious contradiction or on the premise of a certain preferred embodiment, but all the preferred embodiments can be used in any overlapped combination, and the embodiments and specific parameters in the embodiments are only for clearly describing the invention verification process of the inventor and are not intended to limit the scope of the invention, and the scope of the invention is still subject to the claims, and all equivalent structural changes made by applying the specification and the content of the drawings of the present invention are included in the scope of the invention.

Claims (3)

1. A model training method for command word speech enhancement, comprising the steps of:
s1: initial training is carried out to obtain an original voice recognition model MD1, and a quiet corpus is established;
s2: acquiring a command word entry C1 used by a client project, wherein the command word entry comprises at least one command word;
s21, selecting corresponding audio in the existing quiet corpus A1 according to command word entries C1, counting the number of audio entries of each command word in the quiet corpus A1, setting a first screening threshold value to screen all command words in the command word entries C1, and screening out to-be-expanded command words with the corresponding audio numbers lower than the first screening threshold value;
performing audio expansion on command words to be expanded; the expansion method comprises the following steps:
s22, word segmentation is carried out on the command words, wherein the first word segmentation is to decompose the command words into more than one word, the second word segmentation is carried out on the basis of the first word segmentation, and each word is divided into more than one single word;
s23, setting a second screening threshold, screening audio entries comprising each word from the quiet corpus A1 according to the first word segmentation result, and screening again in the quiet corpus A1 according to the second word segmentation result if the number of the audio entries of the single word is less than the second screening threshold; screening out single words with the number of the audio frequency greater than a second screening threshold value;
if the results cannot be screened out by the two screening steps, the second screening threshold value is reduced; repeating step S23;
the step S23 of obtaining the original audio containing the two word segmentation results through screening;
s24, aligning the audio obtained in the step S23 by using an original voice recognition model MD1 model to obtain a corresponding time tag of the word or the single word in the audio, and cutting out the audio only containing the corresponding word or the single word according to the corresponding time tag, wherein the cut-out audio is used as a word segmentation sub-corpus B1;
s25, repeating the steps S22-S24 for each command word, and combining all word segmentation sub-corpora B1 to obtain a word segmentation corpus B2;
s26, randomly screening audio from the word segmentation corpus B2 to combine to obtain a command word whole word corpus B3;
s3: decoding the synthesized command word whole word stock B3 by using the original voice recognition model MD1, and removing audio with decoding errors to obtain a corrected command word stock B4;
s4: recording the actual noise of the customer home products, and carrying out noise adding and reverberation processing on the corrected command word corpus B4 to obtain a command word expansion corpus B5;
s5, selecting an initial model MD2 used by a corpus training chip end, and using the command word expansion corpus B5 obtained in the step S4 to adjust so as to obtain a final model MD3.
2. The model training method according to claim 1, wherein the training method of CTC/RNNT is used in step S1 to train the original speech recognition model MD1.
3. The model training method of claim 1,
in the step S4, part of audio is selected from the corrected command word corpus B4, played in the corresponding noise environment, and picked up, and the audio obtained by picking up the sound is the supplementary noise corpus B6;
in the step S5, the command word expansion corpus B5 and the supplementary noise corpus B6 are used to adjust MD2 at the same time, so as to obtain a final model MD3.
CN202310650948.0A 2023-06-05 2023-06-05 Model training method for enhancing command word voice Active CN116386613B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310650948.0A CN116386613B (en) 2023-06-05 2023-06-05 Model training method for enhancing command word voice

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310650948.0A CN116386613B (en) 2023-06-05 2023-06-05 Model training method for enhancing command word voice

Publications (2)

Publication Number Publication Date
CN116386613A CN116386613A (en) 2023-07-04
CN116386613B true CN116386613B (en) 2023-07-25

Family

ID=86973587

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310650948.0A Active CN116386613B (en) 2023-06-05 2023-06-05 Model training method for enhancing command word voice

Country Status (1)

Country Link
CN (1) CN116386613B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595696A (en) * 2018-05-09 2018-09-28 长沙学院 A kind of human-computer interaction intelligent answering method and system based on cloud platform
CN110853625A (en) * 2019-09-18 2020-02-28 厦门快商通科技股份有限公司 Speech recognition model word segmentation training method and system, mobile terminal and storage medium
EP3617930A1 (en) * 2018-08-28 2020-03-04 Accenture Global Solutions Limited Training data augmentation for convesational ai bots
CN112151080A (en) * 2020-10-28 2020-12-29 成都启英泰伦科技有限公司 Method for recording and processing training corpus
CN112151021A (en) * 2020-09-27 2020-12-29 北京达佳互联信息技术有限公司 Language model training method, speech recognition device and electronic equipment
CN112530417A (en) * 2019-08-29 2021-03-19 北京猎户星空科技有限公司 Voice signal processing method and device, electronic equipment and storage medium
CN114692634A (en) * 2022-01-27 2022-07-01 清华大学 Chinese named entity recognition and classification method and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160225372A1 (en) * 2015-02-03 2016-08-04 Samsung Electronics Company, Ltd. Smart home connected device contextual learning using audio commands
US11011157B2 (en) * 2018-11-13 2021-05-18 Adobe Inc. Active learning for large-scale semi-supervised creation of speech recognition training corpora based on number of transcription mistakes and number of word occurrences
US11645460B2 (en) * 2020-12-28 2023-05-09 Genesys Telecommunications Laboratories, Inc. Punctuation and capitalization of speech recognition transcripts

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595696A (en) * 2018-05-09 2018-09-28 长沙学院 A kind of human-computer interaction intelligent answering method and system based on cloud platform
EP3617930A1 (en) * 2018-08-28 2020-03-04 Accenture Global Solutions Limited Training data augmentation for convesational ai bots
CN112530417A (en) * 2019-08-29 2021-03-19 北京猎户星空科技有限公司 Voice signal processing method and device, electronic equipment and storage medium
CN110853625A (en) * 2019-09-18 2020-02-28 厦门快商通科技股份有限公司 Speech recognition model word segmentation training method and system, mobile terminal and storage medium
CN112151021A (en) * 2020-09-27 2020-12-29 北京达佳互联信息技术有限公司 Language model training method, speech recognition device and electronic equipment
CN112151080A (en) * 2020-10-28 2020-12-29 成都启英泰伦科技有限公司 Method for recording and processing training corpus
CN114692634A (en) * 2022-01-27 2022-07-01 清华大学 Chinese named entity recognition and classification method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Research on Keyword Extraction of Word2vec Model in Chinese Corpus;Chenchen Zhang;《2018 IEEE/ACIS 17th International Conference on Computer and Information Science (ICIS)》;全文 *
嵌入式连续中小词量的语音识别系统的研究;林伟敏;《中国优秀硕士学位论文全文数据库》;全文 *

Also Published As

Publication number Publication date
CN116386613A (en) 2023-07-04

Similar Documents

Publication Publication Date Title
Vasquez et al. Melnet: A generative model for audio in the frequency domain
Chandna et al. Wgansing: A multi-voice singing voice synthesizer based on the wasserstein-gan
CN113470662A (en) Generating and using text-to-speech data for keyword spotting systems and speaker adaptation in speech recognition systems
CN108922518A (en) voice data amplification method and system
Song et al. Noise invariant frame selection: a simple method to address the background noise problem for text-independent speaker verification
CN105023580B (en) Unsupervised noise estimation based on separable depth automatic coding and sound enhancement method
CN110782872A (en) Language identification method and device based on deep convolutional recurrent neural network
CN1835075B (en) Speech synthetizing method combined natural sample selection and acaustic parameter to build mould
Yağlı et al. Artificial bandwidth extension of spectral envelope along a Viterbi path
CN111508470B (en) Training method and device for speech synthesis model
Ai et al. SampleRNN-based neural vocoder for statistical parametric speech synthesis
CN110246489A (en) Audio recognition method and system for children
Ronanki et al. A Template-Based Approach for Speech Synthesis Intonation Generation Using LSTMs.
CN114267372A (en) Voice noise reduction method, system, electronic device and storage medium
Du et al. Noise-robust voice conversion with domain adversarial training
Lee et al. A new voice transformation method based on both linear and nonlinear prediction analysis
Koizumi et al. Miipher: A robust speech restoration model integrating self-supervised speech and text representations
CN113436607B (en) Quick voice cloning method
Zhang et al. AccentSpeech: Learning accent from crowd-sourced data for target speaker TTS with accents
Ma et al. Two-stage training method for Japanese electrolaryngeal speech enhancement based on sequence-to-sequence voice conversion
CN116386613B (en) Model training method for enhancing command word voice
Kumar et al. Towards building text-to-speech systems for the next billion users
CN116092475B (en) Stuttering voice editing method and system based on context-aware diffusion model
Du et al. Effective wavenet adaptation for voice conversion with limited data
CN114005428A (en) Speech synthesis method, apparatus, electronic device, storage medium, and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant