CN110827791B - Edge-device-oriented speech recognition-synthesis combined modeling method - Google Patents

Edge-device-oriented speech recognition-synthesis combined modeling method Download PDF

Info

Publication number
CN110827791B
CN110827791B CN201910847985.4A CN201910847985A CN110827791B CN 110827791 B CN110827791 B CN 110827791B CN 201910847985 A CN201910847985 A CN 201910847985A CN 110827791 B CN110827791 B CN 110827791B
Authority
CN
China
Prior art keywords
model
audio
synthesis
positive
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910847985.4A
Other languages
Chinese (zh)
Other versions
CN110827791A (en
Inventor
王海
秦晨光
张晓�
刘艺
赵子鑫
高岭
任杰
郑杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwest University
Original Assignee
Northwest University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwest University filed Critical Northwest University
Priority to CN201910847985.4A priority Critical patent/CN110827791B/en
Publication of CN110827791A publication Critical patent/CN110827791A/en
Application granted granted Critical
Publication of CN110827791B publication Critical patent/CN110827791B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

A speech recognition-synthesis combined modeling method for edge equipment is a model iteration method which integrates speech recognition and speech synthesis technologies at the back end through research on real-time calculation, distribution of edge calculation strategies, inspiring of 'copy is not out of shape' of an entertainment game. The voice enhancement function based on the audio processing field is used for constructing a real-time high-efficiency processing module, a voice recognition and synthesis iteration model aiming at Chinese dialects is constructed based on a voice recognition technology and a voice synthesis technology, the characteristics of the voice technology are fully utilized to realize a dialects processing model with the characteristics of recognition, synthesis and high efficiency, the processing capability of the edge environment is effectively utilized, the voice recognition technology and the voice synthesis technology are combined, and a voice model with more abundant functions and more robust performance is designed.

Description

Edge-device-oriented speech recognition-synthesis combined modeling method
Technical Field
The invention belongs to the technical field of edge calculation and audio research, relates to an edge server, voice enhancement, voice recognition, voice synthesis and a neural network, and particularly relates to a voice recognition-synthesis combined modeling method for edge equipment.
Background
After 4.0 of the industry, the rapid rise of artificial intelligence and the internet of things (IoT) provides great potential for convenience in human clothing and housing, and a great number of intelligent products are produced. Meanwhile, with the development of edge calculation in recent years, an edge calculation strategy can effectively realize the distribution of large task calculation amount, solve the real-time problem and improve the calculation capability of the model. Therefore, unlimited possibilities are provided for continuously strengthening the functions of the expanded intelligent products.
With the continuous development of neural networks and deep learning, the great breakthrough of related research is driven. Among them, the most obvious are the speech domain and the image domain. In recent years, in the field of NLP speech recognition, speech processing techniques, speech recognition, speech synthesis, and the like have been rapidly developed and have been drawing attention. However, there are still some technical problems to be studied optimally, such as: the real-time property of machine processing, the robustness of intelligent application, the comprehensive characteristics of objects and the like are all problems needing continuous optimization research. Some Chinese dialect data are tested by using a small program of 'dictation major' WeChat, and the recognition effect of some models to be tested is not good. The reasons are mostly bad input data characteristics, lack of model performance and operation procedure loopholes. Therefore, in addition to vulnerability detection, processing of model performance and data is an effective way to improve accuracy and solve this problem. Therefore, it is important to fully utilize the optimization model effects such as the speech field and the mobile computing technology, and to expand the application functions by using the model characteristics.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a voice recognition-synthesis combined modeling method facing edge equipment, wherein a real-time efficient processing module is constructed based on the voice enhancement function in the audio processing field, a voice recognition and synthesis iterative model aiming at Chinese dialects is constructed based on the voice recognition technology and the voice synthesis technology, and a real-time dialect processing model with recognition, synthesis and high efficiency is realized by fully utilizing the characteristics of the voice technology.
In order to achieve the purpose, the invention adopts the technical scheme that:
an edge-device-oriented speech recognition-synthesis joint modeling method is characterized by comprising the following steps:
1) collecting a data set sample:
collecting audio data of various environments, and classifying the audio data into clean audio under a class a quiet environment and different types of noise audio of class b, wherein the different types of noise audio of class b refers to a noise library;
2) and (3) carrying out data processing:
firstly, noise fusion processing is carried out, noise is added into clean audio, and the clean audio and the corresponding noise-added audio are packaged and assembled;
3) building an edge server:
building stable edge server equipment to realize a stable uploading and downloading interface; an algorithm processing voice enhancement module, which adopts a spectral subtraction method, a wavelet hard threshold value, a gan network model and a voice recognition engine to obtain an optimal value by adopting a voting method, takes the optimal value as an optimization means of audio preprocessing, performs audio front-end processing on the layer of equipment to perform dereverberation, noise reduction, noise separation and the like, screens an optimal model of corresponding noise by utilizing wavelet, spectral subtraction and neural network model fusion processing, and selects a model method with higher corresponding audio quality by applying the voting method to process audio;
4) building a Chinese dialect voice recognition model:
adopting a basic model architecture based on cnn + rnn, representing secondary feature processing after voice feature MFCC extraction by using a cnn technology, taking an audio spectrogram as an input feature, and performing normalization processing on an extracted spectrogram feature vector and an MFCC feature vector; building a Chinese dialect voice synthesis model, and providing an interface for multi-dimensional cross fusion based on a wavenet improved model; deploying a joint voice recognition model and a voice synthesis model at an upper-layer cloud end, and receiving a processing result of an edge layer as an input source;
5) processing the data in the step 3) for the first time on the built edge service layer equipment, performing audio preprocessing by a voice enhancement means, improving the machine intelligibility of audio, extracting the characteristics of a voice sample set, and passing the characteristics through the Chinese dialect voice recognition model in the step 4) to obtain a positive sample T1, an accuracy acc1 of the positive sample, a negative sample F1, an accuracy acc2 of the negative sample, inputting T1 and F1 into the Chinese dialect voice synthesis model in the step 4), wherein the output results of the synthesized audio correspond to four types, namely T11, T12, F21 and F22; wherein, T11 represents that the speech recognition result is a positive sample and the speech synthesis result is a positive sample; t12 indicates that the speech recognition result is positive and the sample speech synthesis result is negative; f21 indicates that the speech recognition result is negative and the synthesis result is positive; f22 indicates that the results of speech recognition and synthesis are both negative;
6) evaluating the dominance ratio of corresponding features according to the proportion of correct samples, screening out a dominant feature expression set as a staged model feature, adjusting the weight of a hyper-parametric training speech recognition model to loss convergence and storing the model, reloading the model to be combined with a speech synthesis module, continuously iterating the training model through a mechanism of feedback parameter updating of neural network backward propagation, setting a reasonable iteration training period by adjusting the model hyper-parameter, enabling the network to be faster and more energy-saving in convergence, optimizing the final effect of the model, and ensuring the robustness of the model at the moment when the model effect represents gradual convergence stability.
Further, the processed voice data is converted into corresponding texts through a voice recognition technology, a CCLD (Chinese relational network & LSTM) Chinese voice recognition network model is built by applying MFCC characteristics and combining the advantages of CNN network extraction characteristics, key characteristics are extracted by combining CNN network by adopting MFCC audio characteristic reference and are output to an RNN network of an LSTM layer, finally, a DNN network of three layers is connected as output judgment, a Chinese voice recognition engine is built by combining Chinese data voice characteristics, a Chinese voice recognition model is trained, and then, the samples classified by the model are divided into correct samples and error samples.
Further, the positive and negative samples are converted into corresponding audio samples through a speech synthesis model, whether corresponding recognition results are correct or not is counted, and the corresponding recognition results are classified into the positive and negative samples, and the method is characterized in that: the text is converted into audio by the wavenet-based improved speech synthesis model, where the text data corresponds to the output result of the last model, so that there are positive samples and negative samples, and there are further "positive and negative samples" corresponding thereto after passing through the speech synthesis model, so that the formed recognition result has sample attributes of "positive-positive", "positive-negative", "negative-positive" and "negative-negative".
Further, according to the dominant feature proportion of the classification result, the feature combination closest to the original appearance is screened out by calculating and comparing the original audio, and the method is characterized in that the sample is corresponding to the positive-positive characteristic group belonging to the A level, the sample is corresponding to the positive-negative characteristic group belonging to the B level, the sample is corresponding to the negative-positive characteristic group belonging to the C level, and the sample is corresponding to the negative-negative characteristic group belonging to the D level (and the priority A > B > C > D), and then the dominant feature is screened out by calculating the accuracy of each type of sample and taking the accuracy as the grading standard of the feature superiority.
The invention has the beneficial effects that:
1) the invention provides a model fusion method facing a large edge environment based on the consideration of performance and resources, reasonably schedules audio sources with different noise degrees by utilizing the real-time processing and task scheduling of edge equipment, combines a voice recognition module and a voice synthesis module, and greatly enriches the creativity of a new model.
2) According to the method, the dominance ratio of corresponding features is evaluated according to the proportion of correct samples, dominant feature expression sets are screened out and used as the characteristics of the staged model, the model is continuously and iteratively trained, the final effect of the model is optimized, and when the effect expression of the model is gradually converged and stabilized, the robustness of the model is very guaranteed.
3) The richer processing capability of the edge environment is effectively utilized, the voice recognition technology and the voice synthesis technology are combined, and a voice model with richer functions and more robust performance is designed.
4) The voice environment and voice experience of human-computer interaction are improved to a certain degree, and a practical application user brings comfortable experience.
5) The novel modeling idea provides a solution idea for the progress of the audio equipment and shows the huge expressive force of the audio equipment with strong functions.
Drawings
FIG. 1 is an overall architecture diagram;
FIG. 2 is a diagram of an edge-side speech enhancement model;
FIG. 3 is an iterative diagram of speech recognition and speech synthesis models.
Detailed Description
The invention will be further described with reference to the following drawings and examples, but the invention is not limited to the following examples:
as shown in fig. 1, 2 and 3, an edge device-oriented speech recognition-synthesis joint modeling method includes the following steps:
1) a data set sample is acquired. The method comprises the following steps of (a) dividing the audio into clean audio under a quiet environment and (b) dividing all audio data of different types of noise audio (specifically, white noise, ping noise, speed bubble and the like, which refer to a noise library in a classified mode) into a sampling rate of 16k and a storage format pcm (Shanxi, Minnan, Changsha, Sichuan, Hebei and Shanghai, six dialects);
2) and (6) carrying out data processing. Firstly, noise fusion processing is carried out, noise is added into clean audio, and the clean audio and the corresponding noise-added audio are packaged and assembled;
3) building an edge server, performing audio front-end processing on the layer of equipment to perform dereverberation, noise reduction, noise separation and the like, screening an optimal model of corresponding noise by utilizing wavelet, spectral subtraction and neural network model fusion processing, and selecting a model method with higher corresponding audio quality by applying a voting method to process audio;
4) building a Chinese dialect speech recognition model, adopting a basic model architecture based on cnn + rnn, representing secondary feature processing after speech feature MFCC extraction by using a cnn technology, taking an audio spectrogram as an input feature, and carrying out normalization processing on an extracted spectrogram feature vector and an MFCC feature vector;
5) building a Chinese dialect voice synthesis model, and providing an interface for multi-dimensional cross fusion based on a wavenet improved model;
6) deploying a joint voice recognition model and a voice synthesis model at an upper-layer cloud end, and receiving a processing result of an edge layer as an input source;
7) extracting the characteristics of a voice sample set and passing the characteristics through the Chinese dialect voice recognition model in 4) to obtain a positive sample T1 with the accuracy rate acc1 and a negative sample F1 with the accuracy rate acc 2;
8) inputting T1 and F1 to 3) the chinese dialect speech synthesis model, respectively, and four output results of the synthesized audio corresponding to the input (step 8), which are T11, T12, F21, and F22, respectively; (wherein T11 represents that the result of speech recognition is a positive sample and the result of speech synthesis is a positive sample; T12 represents that the result of speech recognition is a positive sample and the result of speech synthesis is negative; F21 represents that the result of speech recognition is negative and the result of synthesis is positive; F22 represents that the results of speech recognition and synthesis are negative);
9) evaluating the dominance ratio of corresponding features according to the proportion of correct samples, screening out a dominant feature expression set as a staged model feature, adjusting the weight of a hyper-parametric training speech recognition model until loss converges and storing the model, reloading the model to be combined with a speech synthesis module, carrying out continuous iterative training on the model through a mechanism of feedback updating parameters of neural network back propagation, setting a reasonable iterative training period by adjusting model hyper-parameters, optimizing the final effect of the model, and ensuring the robustness of the model when the model effect expresses gradual convergence stability.

Claims (4)

1. An edge-device-oriented speech recognition-synthesis joint modeling method is characterized by comprising the following steps:
1) collecting a data set sample:
collecting audio data of various environments, and classifying the audio data into clean audio in a class a quiet environment and different types of noise audio in a class b, wherein the different types of noise audio in the class b refers to a noise library;
2) and (3) carrying out data processing:
performing noise fusion processing, adding noise into clean audio, packaging and assembling into clean audio data and corresponding noise-added audio data;
3) building an edge server:
building stable edge server equipment to realize a stable uploading and downloading interface; an algorithm processing voice enhancement module, which adopts a spectral subtraction method, a wavelet hard threshold value and a gan network model in combination with a voice recognition engine to obtain an optimal value by adopting a voting method, takes the optimal value as an optimization means of audio preprocessing, performs audio front-end processing on the layer of equipment to perform dereverberation, noise reduction and noise separation, screens an optimal model of corresponding noise by utilizing wavelet, spectral subtraction and neural network model fusion processing, and selects a model method with higher corresponding audio quality by applying the voting method to process audio;
4) building a Chinese dialect speech recognition model:
adopting a basic model architecture based on CNN + RNN to represent secondary feature processing after voice feature MFCC extraction by using a CNN technology, taking an audio spectrogram as an input feature, and performing normalization processing on an extracted spectrogram feature vector and an MFCC feature vector; building a Chinese dialect voice synthesis model, and providing an interface for multi-dimensional cross fusion based on a wavenet improved model; deploying a joint voice recognition model and a voice synthesis model at an upper-layer cloud end, and receiving a processing result of an edge layer as an input source;
5) processing the data in the step 3) for the first time on the built edge service layer equipment, performing audio pre-processing by a voice enhancement means, improving the machine intelligibility of audio, extracting the characteristics of a voice sample set, and passing the characteristics through the Chinese dialect voice recognition model in the step 4) to obtain a positive sample T1 with an accuracy acc1, a negative sample F1 with an accuracy acc2, and inputting T1 and F1 into the Chinese dialect voice synthesis model in the step 4) respectively, wherein the output results of the synthesized audio correspond to the input in the step 4) and are T11, T12, F21 and F22 respectively; wherein, T11 represents that the speech recognition result is a positive sample and the speech synthesis result is a positive sample; t12 indicates that the speech recognition result is positive and the sample speech synthesis result is negative; f21 indicates that the speech recognition result is negative and the synthesis result is positive; f22 indicates that the results of speech recognition and synthesis are both negative;
6) evaluating the dominance ratio of corresponding features according to the proportion of correct samples, screening out a dominant feature expression set as a staged model feature, adjusting the weight of a hyper-parametric training speech recognition model until loss converges and storing the model, reloading the model to be combined with a speech synthesis module, carrying out continuous iterative training on the model through a mechanism of feedback updating parameters of neural network back propagation, setting a reasonable iterative training period by adjusting model hyper-parameters, optimizing the final effect of the model, and ensuring the robustness of the model when the model effect expresses gradual convergence stability.
2. The edge-device-oriented speech recognition-synthesis combined modeling method as claimed in claim 1, wherein processed speech data is converted into corresponding texts by a speech recognition technology, MFCC features are applied, a CCLD Chinese speech recognition network model is built by combining the advantages of CNN network extraction features, MFCC audio feature standards are adopted, CNN network extraction key features are combined and output to an RNN network of an LSTM layer, finally a DNN network of three layers is connected as output discrimination, a Chinese speech recognition engine is built by combining Chinese data speech characteristics, a Chinese speech recognition model is trained, and then samples classified by the model are divided into correct samples and incorrect samples.
3. The modeling method of speech recognition-synthesis union for edge devices according to claim 1, characterized in that "positive and negative samples" are converted into corresponding audio samples through a speech synthesis model, respectively, and whether the corresponding recognition results are correct or not is counted, and classified as "positive and negative samples", respectively, and the speech synthesis model improved based on wavenet converts text into audio, where the text data corresponds to the output result of the previous model, so that there are positive samples and negative samples, and after passing through the speech synthesis model, there are further "positive and negative samples" corresponding thereto, so that the sample attributes of the formed recognition results are "positive-positive", "positive-negative", "negative-positive", "negative-negative".
4. The edge-device-oriented speech recognition-synthesis combined modeling method according to claim 1, characterized in that feature combinations closest to original features are screened out by calculating and comparing original audio according to dominant feature ratios of classification results, wherein samples "positive-positive" correspond to a class a feature group, samples "positive-negative" correspond to a class B feature group, samples "negative-positive" correspond to a class C feature group, samples "negative-negative" correspond to a class D feature group, and priorities a > B > C > D, and then features are screened out by calculating accuracy rates of each class of samples as scoring criteria of feature superiority.
CN201910847985.4A 2019-09-09 2019-09-09 Edge-device-oriented speech recognition-synthesis combined modeling method Active CN110827791B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910847985.4A CN110827791B (en) 2019-09-09 2019-09-09 Edge-device-oriented speech recognition-synthesis combined modeling method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910847985.4A CN110827791B (en) 2019-09-09 2019-09-09 Edge-device-oriented speech recognition-synthesis combined modeling method

Publications (2)

Publication Number Publication Date
CN110827791A CN110827791A (en) 2020-02-21
CN110827791B true CN110827791B (en) 2022-07-01

Family

ID=69547963

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910847985.4A Active CN110827791B (en) 2019-09-09 2019-09-09 Edge-device-oriented speech recognition-synthesis combined modeling method

Country Status (1)

Country Link
CN (1) CN110827791B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111540345B (en) * 2020-05-09 2022-06-24 北京大牛儿科技发展有限公司 Weakly supervised speech recognition model training method and device
CN111583913B (en) * 2020-06-15 2020-11-03 深圳市友杰智新科技有限公司 Model training method and device for speech recognition and speech synthesis and computer equipment
CN111833878A (en) * 2020-07-20 2020-10-27 中国人民武装警察部队工程大学 Chinese voice interaction non-inductive control system and method based on raspberry Pi edge calculation
CN113823314B (en) * 2021-08-12 2022-10-28 北京荣耀终端有限公司 Voice processing method and electronic equipment
CN114765025A (en) * 2022-04-25 2022-07-19 脸萌有限公司 Method for generating and recognizing speech recognition model, device, medium and equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109256118A (en) * 2018-10-22 2019-01-22 江苏师范大学 End-to-end Chinese dialects identifying system and method based on production auditory model
CN109616093A (en) * 2018-12-05 2019-04-12 平安科技(深圳)有限公司 End-to-end phoneme synthesizing method, device, equipment and storage medium
CN109712609A (en) * 2019-01-08 2019-05-03 华南理工大学 A method of it solving keyword and identifies imbalanced training sets
US10373073B2 (en) * 2016-01-11 2019-08-06 International Business Machines Corporation Creating deep learning models using feature augmentation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10373073B2 (en) * 2016-01-11 2019-08-06 International Business Machines Corporation Creating deep learning models using feature augmentation
CN109256118A (en) * 2018-10-22 2019-01-22 江苏师范大学 End-to-end Chinese dialects identifying system and method based on production auditory model
CN109616093A (en) * 2018-12-05 2019-04-12 平安科技(深圳)有限公司 End-to-end phoneme synthesizing method, device, equipment and storage medium
CN109712609A (en) * 2019-01-08 2019-05-03 华南理工大学 A method of it solving keyword and identifies imbalanced training sets

Also Published As

Publication number Publication date
CN110827791A (en) 2020-02-21

Similar Documents

Publication Publication Date Title
CN110827791B (en) Edge-device-oriented speech recognition-synthesis combined modeling method
CN106104674B (en) Mixing voice identification
Ding et al. Autospeech: Neural architecture search for speaker recognition
CN109599091B (en) Star-WAN-GP and x-vector based many-to-many speaker conversion method
CN1731509A (en) Mobile speech synthesis method
CN109256118B (en) End-to-end Chinese dialect identification system and method based on generative auditory model
CN111402928B (en) Attention-based speech emotion state evaluation method, device, medium and equipment
CN105304078B (en) Target sound data training device and target sound data training method
CN112069310A (en) Text classification method and system based on active learning strategy
CN1924994B (en) Embedded language synthetic method and system
CN110060691B (en) Many-to-many voice conversion method based on i-vector and VARSGAN
CN112102813B (en) Speech recognition test data generation method based on context in user comment
CN112861984B (en) Speech emotion classification method based on feature fusion and ensemble learning
Wang Research and design of smart home speech recognition system based on deep learning
CN115101085A (en) Multi-speaker time-domain voice separation method for enhancing external attention through convolution
CN111091809B (en) Regional accent recognition method and device based on depth feature fusion
Lee et al. NAS-TasNet: neural architecture search for time-domain speech separation
CN114282646A (en) Light power prediction method and system based on two-stage feature extraction and improved BilSTM
Zhao et al. Transferring age and gender attributes for dimensional emotion prediction from big speech data using hierarchical deep learning
Ding et al. Speech emotion features selection based on BBO-SVM
KR101727306B1 (en) Languange model clustering based speech recognition apparatus and method
Rana Emotion classification from noisy speech-A deep learning approach
Yang et al. Essence knowledge distillation for speech recognition
CN110489348A (en) A kind of software function defect method for digging based on transfer learning
CN110619886B (en) End-to-end voice enhancement method for low-resource Tujia language

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant