CN110827791A - Edge-device-oriented speech recognition-synthesis combined modeling method - Google Patents
Edge-device-oriented speech recognition-synthesis combined modeling method Download PDFInfo
- Publication number
- CN110827791A CN110827791A CN201910847985.4A CN201910847985A CN110827791A CN 110827791 A CN110827791 A CN 110827791A CN 201910847985 A CN201910847985 A CN 201910847985A CN 110827791 A CN110827791 A CN 110827791A
- Authority
- CN
- China
- Prior art keywords
- model
- audio
- synthesis
- positive
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000003786 synthesis reaction Methods 0.000 title claims abstract description 54
- 238000000034 method Methods 0.000 title claims abstract description 31
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 43
- 238000012545 processing Methods 0.000 claims abstract description 31
- 238000005516 engineering process Methods 0.000 claims abstract description 17
- 230000000694 effects Effects 0.000 claims description 10
- 238000012549 training Methods 0.000 claims description 9
- 238000007499 fusion processing Methods 0.000 claims description 6
- 238000013528 artificial neural network Methods 0.000 claims description 5
- 238000000605 extraction Methods 0.000 claims description 5
- 230000003595 spectral effect Effects 0.000 claims description 5
- 238000013527 convolutional neural network Methods 0.000 claims description 4
- 238000005457 optimization Methods 0.000 claims description 4
- 238000007781 pre-processing Methods 0.000 claims description 4
- 238000012216 screening Methods 0.000 claims description 4
- 101100268665 Caenorhabditis elegans acc-1 gene Proteins 0.000 claims description 3
- 101100268668 Caenorhabditis elegans acc-2 gene Proteins 0.000 claims description 3
- 230000004927 fusion Effects 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 238000003062 neural network model Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 claims description 3
- 230000009467 reduction Effects 0.000 claims description 3
- 238000000926 separation method Methods 0.000 claims description 3
- 238000011410 subtraction method Methods 0.000 claims description 2
- 238000004806 packaging method and process Methods 0.000 claims 1
- 238000004364 calculation method Methods 0.000 abstract description 7
- 230000006870 function Effects 0.000 abstract description 7
- 238000011160 research Methods 0.000 abstract description 4
- 238000009826 distribution Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
- 238000005728 strengthening Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
- Image Analysis (AREA)
Abstract
A speech recognition-synthesis combined modeling method for edge equipment is a model iteration method which integrates speech recognition and speech synthesis technologies at the back end through research on real-time calculation, distribution of edge calculation strategies, inspiring of 'copy is not out of shape' of an entertainment game. The voice enhancement function based on the audio processing field is used for constructing a real-time high-efficiency processing module, a voice recognition and synthesis iteration model aiming at Chinese dialects is constructed based on a voice recognition technology and a voice synthesis technology, the characteristics of the voice technology are fully utilized to realize a dialects processing model with the characteristics of recognition, synthesis and high efficiency, the processing capability of the edge environment is effectively utilized, the voice recognition technology and the voice synthesis technology are combined, and a voice model with more abundant functions and more robust performance is designed.
Description
Technical Field
The invention belongs to the technical field of edge calculation and audio research, relates to an edge server, voice enhancement, voice recognition, voice synthesis and a neural network, and particularly relates to a voice recognition-synthesis combined modeling method for edge equipment.
Background
After 4.0 of the industry, the rapid rise of artificial intelligence and the internet of things (IoT) provides great potential for convenience in human clothing and housing, and a great number of intelligent products are produced. Meanwhile, with the development of edge calculation in recent years, an edge calculation strategy can effectively realize the distribution of large task calculation amount, solve the real-time problem and improve the calculation capability of the model. Therefore, unlimited possibilities are provided for continuously strengthening the functions of the expanded intelligent products.
With the continuous development of neural networks and deep learning, the great breakthrough of related research is driven. Among them, the most obvious are the speech domain and the image domain. In recent years, in the field of NLP speech recognition, speech processing techniques, speech recognition, speech synthesis, and the like have been rapidly developed and have been drawing attention. However, there are still some technical problems to be studied optimally, such as: the real-time property of machine processing, the robustness of intelligent application, the comprehensive characteristics of objects and the like are all problems needing continuous optimization research. Some Chinese dialect data are tested by using a small program of 'dictation major' WeChat, and the recognition effect of some models to be tested is not good. The reasons are mostly bad input data characteristics, lack of model performance and operation procedure loopholes. Therefore, in addition to vulnerability detection, processing of model performance and data is an effective method to improve accuracy and solve this problem. Therefore, it is important to fully utilize the optimization model effects such as the speech field and the mobile computing technology, and to expand the application functions by using the model characteristics.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a voice recognition-synthesis combined modeling method facing edge equipment, wherein a real-time efficient processing module is constructed based on the voice enhancement function in the audio processing field, a voice recognition and synthesis iterative model aiming at Chinese dialects is constructed based on the voice recognition technology and the voice synthesis technology, and a real-time dialect processing model with recognition, synthesis and high efficiency is realized by fully utilizing the characteristics of the voice technology.
In order to achieve the purpose, the invention adopts the technical scheme that:
an edge-device-oriented speech recognition-synthesis joint modeling method is characterized by comprising the following steps:
1) collecting a data set sample:
collecting audio data of various environments, and classifying the audio data into clean audio in a class a quiet environment and different types of noise audio in a class b, wherein the different types of noise audio in the class b refers to a noise library;
2) and (3) carrying out data processing:
firstly, noise fusion processing is carried out, noise is added into clean audio, and the clean audio and the corresponding noise-added audio are packaged and assembled into clean audio data;
3) building an edge server:
building stable edge server equipment to realize a stable uploading and downloading interface; an algorithm processing voice enhancement module, which adopts a spectral subtraction method, a wavelet hard threshold value, a gan network model and a voice recognition engine to obtain an optimal value by adopting a voting method, takes the optimal value as an optimization means of audio preprocessing, performs audio front-end processing on the layer of equipment to perform dereverberation, noise reduction, noise separation and the like, screens an optimal model of corresponding noise by utilizing wavelet, spectral subtraction and neural network model fusion processing, and selects a model method with higher corresponding audio quality by applying the voting method to process audio;
4) building a Chinese dialect speech recognition model:
adopting a basic model architecture based on cnn + rnn, representing secondary feature processing after voice feature MFCC extraction by using a cnn technology, taking an audio spectrogram as an input feature, and performing normalization processing on an extracted spectrogram feature vector and an MFCC feature vector; building a Chinese dialect voice synthesis model, and providing an interface for multi-dimensional cross fusion based on a wavenet improved model; deploying a joint voice recognition model and a voice synthesis model at an upper-layer cloud end, and receiving a processing result of an edge layer as an input source;
5) processing the data in the step 3) for the first time on the built edge service layer equipment, performing audio preprocessing by a voice enhancement means, improving the machine intelligibility of audio, extracting the characteristics of a voice sample set, and passing the characteristics through the Chinese dialect voice recognition model in the step 4) to obtain a positive sample T1, an accuracy acc1 of the positive sample, a negative sample F1, an accuracy acc2 of the negative sample, inputting T1 and F1 into the Chinese dialect voice synthesis model in the step 4), wherein the output results of the synthesized audio correspond to four types, namely T11, T12, F21 and F22; wherein, T11 represents that the speech recognition result is a positive sample and the speech synthesis result is a positive sample; t12 indicates that the speech recognition result is positive and the sample speech synthesis result is negative; f21 indicates that the speech recognition result is negative and the synthesis result is positive; f22 indicates that the results of speech recognition and synthesis are both negative;
6) evaluating the dominance ratio of corresponding features according to the proportion of correct samples, screening out a dominant feature expression set as a staged model feature, adjusting the weight of a hyper-parametric training speech recognition model to loss convergence and storing the model, reloading the model to be combined with a speech synthesis module, continuously iterating the training model through a mechanism of feedback parameter updating of neural network backward propagation, setting a reasonable iteration training period by adjusting the model hyper-parameter, enabling the network to be faster and more energy-saving in convergence, optimizing the final effect of the model, and ensuring the robustness of the model at the moment when the model effect represents gradual convergence stability.
Further, the processed voice data is converted into corresponding texts through a voice recognition technology, a CCLD (Chinese relational networks & LSTM) Chinese voice recognition network model is built by applying MFCC characteristics and combining the advantages of CNN network extraction characteristics, key characteristics are extracted by combining CNN network by adopting MFCC audio characteristic reference and are output to an RNN network of an LSTM layer, finally, a DNN network of three layers is connected as output judgment, a Chinese voice recognition engine is built by combining Chinese data voice characteristics, a Chinese voice recognition model is trained, and then, the samples classified by the model are divided into correct samples and error samples.
Further, the positive and negative samples are converted into corresponding audio samples through a speech synthesis model, whether corresponding recognition results are correct or not is counted, and the corresponding recognition results are classified into the positive and negative samples, and the method is characterized in that: the text is converted into audio by the wavenet-based improved speech synthesis model, where the text data corresponds to the output result of the last model, so that there are positive samples and negative samples, and there are further "positive and negative samples" corresponding thereto after passing through the speech synthesis model, so that the formed recognition result has sample attributes of "positive-positive", "positive-negative", "negative-positive" and "negative-negative".
Further, according to the dominant feature proportion of the classification result, the feature combination closest to the original appearance is screened out by calculating and comparing the original audio, and the method is characterized in that the sample is corresponding to the positive-positive characteristic group belonging to the A level, the sample is corresponding to the positive-negative characteristic group belonging to the B level, the sample is corresponding to the negative-positive characteristic group belonging to the C level, and the sample is corresponding to the negative-negative characteristic group belonging to the D level (and the priority A > B > C > D), and then the dominant feature is screened out by calculating the accuracy of each type of sample and taking the accuracy as the grading standard of the feature superiority.
The invention has the beneficial effects that:
1) the invention provides a model fusion method facing a large edge environment based on the consideration of performance and resources, reasonably schedules audio sources with different noise degrees by utilizing the real-time processing and task scheduling of edge equipment, combines a voice recognition module and a voice synthesis module, and greatly enriches the creativity of a new model.
2) According to the method, the dominance ratio of corresponding features is evaluated according to the proportion of correct samples, dominant feature expression sets are screened out and used as the characteristics of the staged model, the model is continuously and iteratively trained, the final effect of the model is optimized, and when the effect expression of the model is gradually converged and stabilized, the robustness of the model is very guaranteed.
3) The richer processing capability of the edge environment is effectively utilized, the voice recognition technology and the voice synthesis technology are combined, and a voice model with richer functions and more robust performance is designed.
4) The voice environment and voice experience of human-computer interaction are improved to a certain degree, and a practical application user brings comfortable experience.
5) The novel modeling idea provides a solution idea for the progress of the audio equipment and shows the huge expressive force of the audio equipment with strong functions.
Drawings
FIG. 1 is an overall architecture diagram;
FIG. 2 is a diagram of an edge-side speech enhancement model;
FIG. 3 is an iterative diagram of speech recognition and speech synthesis models.
Detailed Description
The invention will be further described with reference to the following drawings and examples, but the invention is not limited to the following examples:
as shown in fig. 1, 2 and 3, an edge device-oriented speech recognition-synthesis joint modeling method includes the following steps:
1) a data set sample is acquired. The method comprises the following steps of (a) dividing the audio into clean audio under a quiet environment and (b) dividing all audio data of different types of noise audio (specifically, white noise, ping noise, speed bubble and the like, which refer to a noise library in a classified mode) into a sampling rate of 16k and a storage format pcm (Shanxi, Minnan, Changsha, Sichuan, Hebei and Shanghai, six dialects);
2) and (6) carrying out data processing. Firstly, noise fusion processing is carried out, noise is added into clean audio, and the clean audio and the corresponding noise-added audio are packaged and assembled;
3) building an edge server, performing audio front-end processing on the layer of equipment to perform dereverberation, noise reduction, noise separation and the like, screening an optimal model of corresponding noise by utilizing wavelet, spectral subtraction and neural network model fusion processing, and selecting a model method with higher corresponding audio quality by applying a voting method to process audio;
4) building a Chinese dialect speech recognition model, adopting a basic model architecture based on cnn + rnn, representing secondary feature processing after speech feature MFCC extraction by using a cnn technology, taking an audio spectrogram as an input feature, and carrying out normalization processing on an extracted spectrogram feature vector and an MFCC feature vector;
5) building a Chinese dialect voice synthesis model, and providing an interface for multi-dimensional cross fusion based on a wavenet improved model;
6) deploying a joint voice recognition model and a voice synthesis model at an upper-layer cloud end, and receiving a processing result of an edge layer as an input source;
7) extracting the characteristics of a voice sample set and passing the characteristics through the Chinese dialect voice recognition model in 4) to obtain a positive sample T1, the accuracy rate acc1 of the positive sample, a negative sample F1 and the accuracy rate acc2 of the negative sample;
8) inputting T1 and F1 into 3) the Chinese dialect speech synthesis model, wherein four output results of the synthesized audio correspond to the input in the step (8), namely T11, T12, F21 and F22; (wherein T11 represents that the speech recognition result is a positive sample and the speech synthesis result is a positive sample; T12 represents that the speech recognition result is a positive sample and the speech synthesis result is negative; F21 represents that the speech recognition result is negative and the synthesis result is positive; F22 represents that the speech recognition and synthesis results are both negative);
9) evaluating the dominance ratio of corresponding features according to the proportion of correct samples, screening out a dominant feature expression set as a staged model feature, adjusting the weight of a hyper-parametric training speech recognition model until loss converges and storing the model, reloading the model to be combined with a speech synthesis module, carrying out continuous iterative training on the model through a mechanism of feedback updating parameters of neural network back propagation, setting a reasonable iterative training period by adjusting model hyper-parameters, optimizing the final effect of the model, and ensuring the robustness of the model when the model effect expresses gradual convergence stability.
Claims (4)
1. An edge-device-oriented speech recognition-synthesis joint modeling method is characterized by comprising the following steps:
1) collecting a data set sample:
collecting audio data of various environments, and classifying the audio data into clean audio in a class a quiet environment and different types of noise audio in a class b, wherein the different types of noise audio in the class b refers to a noise library;
2) and (3) carrying out data processing:
performing noise fusion processing, adding noise into clean audio, packaging and assembling into clean audio data and corresponding noise-added audio data;
3) building an edge server:
building stable edge server equipment to realize a stable uploading and downloading interface; an algorithm processing voice enhancement module, which adopts a spectral subtraction method, a wavelet hard threshold value, a gan network model and a voice recognition engine to obtain an optimal value by adopting a voting method, takes the optimal value as an optimization means of audio preprocessing, performs audio front-end processing on the layer of equipment to perform dereverberation, noise reduction, noise separation and the like, screens an optimal model of corresponding noise by utilizing wavelet, spectral subtraction and neural network model fusion processing, and selects a model method with higher corresponding audio quality by applying the voting method to process audio;
building a Chinese dialect speech recognition model:
adopting a basic model architecture based on cnn + rnn, representing secondary feature processing after voice feature MFCC extraction by using a cnn technology, taking an audio spectrogram as an input feature, and performing normalization processing on an extracted spectrogram feature vector and an MFCC feature vector; building a Chinese dialect voice synthesis model, and providing an interface for multi-dimensional cross fusion based on a wavenet improved model; deploying a joint voice recognition model and a voice synthesis model at an upper-layer cloud end, and receiving a processing result of an edge layer as an input source;
5) processing the data in the step 3) for the first time on the built edge service layer equipment, performing audio preprocessing by a voice enhancement means, improving the machine intelligibility of audio, extracting the characteristics of a voice sample set, and passing the characteristics through the Chinese dialect voice recognition model in the step 4) to obtain a positive sample T1, an accuracy acc1 of the positive sample, a negative sample F1, an accuracy acc2 of the negative sample, inputting T1 and F1 into the Chinese dialect voice synthesis model in the step 4) respectively, wherein the output result of the synthesized audio corresponds to the input in the step 7) to four types, namely T11, T12, F21 and F22 respectively; wherein, T11 represents that the speech recognition result is a positive sample and the speech synthesis result is a positive sample; t12 indicates that the speech recognition result is positive and the sample speech synthesis result is negative; f21 indicates that the speech recognition result is negative and the synthesis result is positive; f22 indicates that the results of speech recognition and synthesis are both negative;
6) evaluating the dominance ratio of corresponding features according to the proportion of correct samples, screening out a dominant feature expression set as a staged model feature, adjusting the weight of a hyper-parametric training speech recognition model until loss converges and storing the model, reloading the model to be combined with a speech synthesis module, carrying out continuous iterative training on the model through a mechanism of feedback updating parameters of neural network back propagation, setting a reasonable iterative training period by adjusting model hyper-parameters, optimizing the final effect of the model, and ensuring the robustness of the model when the model effect expresses gradual convergence stability.
2. The edge-device-oriented speech recognition-synthesis combined modeling method as claimed in claim 1, wherein the processed speech data is converted into corresponding text by a speech recognition technology, a CCLD (Chinese relational network & LSTM) Chinese speech recognition network model is built by applying MFCC features and combining the advantages of CNN network extraction features, a key feature is extracted by adopting MFCC audio feature standards and combining CNN Networks and output to an RNN network of an LSTM layer, finally a DNN network of three layers is connected as output discrimination, a Chinese speech recognition engine is built by combining Chinese data speech features, a Chinese speech recognition model is trained, and then samples classified by the model are divided into correct samples and error samples.
3. The edge-device-oriented speech recognition-synthesis combined modeling method according to claim 1, wherein the positive and negative samples are respectively converted into corresponding audio samples through a speech synthesis model, and whether the corresponding recognition results are correct or not is counted, and the corresponding audio samples are classified as the positive and negative samples, and the method is characterized in that: the text is converted into audio by the wavenet-based improved speech synthesis model, where the text data corresponds to the output result of the last model, so that there are positive samples and negative samples, and there are further "positive and negative samples" corresponding thereto after passing through the speech synthesis model, so that the formed recognition result has sample attributes of "positive-positive", "positive-negative", "negative-positive" and "negative-negative".
4. The edge-device-oriented speech recognition-synthesis combined modeling method according to claim 1, wherein the feature combination closest to the original appearance is selected by calculating and comparing the original audio according to the dominant feature proportion of the classification result, and the feature combination is characterized in that a sample belongs to a class a feature group in a positive-positive correspondence, a sample belongs to a class B feature group in a positive-negative correspondence, a sample belongs to a class C feature group in a negative-positive correspondence, and a sample belongs to a class D feature group in a negative-negative correspondence (and the priority is a > B > C > D), and then the dominant feature is selected by calculating the accuracy of each class of samples as the scoring criterion of the feature superiority.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910847985.4A CN110827791B (en) | 2019-09-09 | 2019-09-09 | Edge-device-oriented speech recognition-synthesis combined modeling method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910847985.4A CN110827791B (en) | 2019-09-09 | 2019-09-09 | Edge-device-oriented speech recognition-synthesis combined modeling method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110827791A true CN110827791A (en) | 2020-02-21 |
CN110827791B CN110827791B (en) | 2022-07-01 |
Family
ID=69547963
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910847985.4A Active CN110827791B (en) | 2019-09-09 | 2019-09-09 | Edge-device-oriented speech recognition-synthesis combined modeling method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110827791B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111540345A (en) * | 2020-05-09 | 2020-08-14 | 北京大牛儿科技发展有限公司 | Weakly supervised speech recognition model training method and device |
CN111583913A (en) * | 2020-06-15 | 2020-08-25 | 深圳市友杰智新科技有限公司 | Model training method and device for speech recognition and speech synthesis and computer equipment |
CN111833878A (en) * | 2020-07-20 | 2020-10-27 | 中国人民武装警察部队工程大学 | Chinese voice interaction non-inductive control system and method based on raspberry Pi edge calculation |
CN113823314A (en) * | 2021-08-12 | 2021-12-21 | 荣耀终端有限公司 | Voice processing method and electronic equipment |
WO2023211369A3 (en) * | 2022-04-25 | 2024-03-21 | 脸萌有限公司 | Speech recognition model generation method and apparatus, speech recognition method and apparatus, medium, and device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109256118A (en) * | 2018-10-22 | 2019-01-22 | 江苏师范大学 | End-to-end Chinese dialects identifying system and method based on production auditory model |
CN109616093A (en) * | 2018-12-05 | 2019-04-12 | 平安科技(深圳)有限公司 | End-to-end phoneme synthesizing method, device, equipment and storage medium |
CN109712609A (en) * | 2019-01-08 | 2019-05-03 | 华南理工大学 | A method of it solving keyword and identifies imbalanced training sets |
US10373073B2 (en) * | 2016-01-11 | 2019-08-06 | International Business Machines Corporation | Creating deep learning models using feature augmentation |
-
2019
- 2019-09-09 CN CN201910847985.4A patent/CN110827791B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10373073B2 (en) * | 2016-01-11 | 2019-08-06 | International Business Machines Corporation | Creating deep learning models using feature augmentation |
CN109256118A (en) * | 2018-10-22 | 2019-01-22 | 江苏师范大学 | End-to-end Chinese dialects identifying system and method based on production auditory model |
CN109616093A (en) * | 2018-12-05 | 2019-04-12 | 平安科技(深圳)有限公司 | End-to-end phoneme synthesizing method, device, equipment and storage medium |
CN109712609A (en) * | 2019-01-08 | 2019-05-03 | 华南理工大学 | A method of it solving keyword and identifies imbalanced training sets |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111540345A (en) * | 2020-05-09 | 2020-08-14 | 北京大牛儿科技发展有限公司 | Weakly supervised speech recognition model training method and device |
CN111540345B (en) * | 2020-05-09 | 2022-06-24 | 北京大牛儿科技发展有限公司 | Weakly supervised speech recognition model training method and device |
CN111583913A (en) * | 2020-06-15 | 2020-08-25 | 深圳市友杰智新科技有限公司 | Model training method and device for speech recognition and speech synthesis and computer equipment |
CN111833878A (en) * | 2020-07-20 | 2020-10-27 | 中国人民武装警察部队工程大学 | Chinese voice interaction non-inductive control system and method based on raspberry Pi edge calculation |
CN113823314A (en) * | 2021-08-12 | 2021-12-21 | 荣耀终端有限公司 | Voice processing method and electronic equipment |
CN113823314B (en) * | 2021-08-12 | 2022-10-28 | 北京荣耀终端有限公司 | Voice processing method and electronic equipment |
WO2023211369A3 (en) * | 2022-04-25 | 2024-03-21 | 脸萌有限公司 | Speech recognition model generation method and apparatus, speech recognition method and apparatus, medium, and device |
Also Published As
Publication number | Publication date |
---|---|
CN110827791B (en) | 2022-07-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110827791B (en) | Edge-device-oriented speech recognition-synthesis combined modeling method | |
Ding et al. | Autospeech: Neural architecture search for speaker recognition | |
CN109599091B (en) | Star-WAN-GP and x-vector based many-to-many speaker conversion method | |
WO2018014469A1 (en) | Voice recognition processing method and apparatus | |
CN102779510B (en) | Speech emotion recognition method based on feature space self-adaptive projection | |
CN109256118B (en) | End-to-end Chinese dialect identification system and method based on generative auditory model | |
CN1731509A (en) | Mobile speech synthesis method | |
CN111402928B (en) | Attention-based speech emotion state evaluation method, device, medium and equipment | |
CN112069310A (en) | Text classification method and system based on active learning strategy | |
CN112861984B (en) | Speech emotion classification method based on feature fusion and ensemble learning | |
CN112102813B (en) | Speech recognition test data generation method based on context in user comment | |
CN1924994B (en) | Embedded language synthetic method and system | |
CN108647206B (en) | Chinese junk mail identification method based on chaos particle swarm optimization CNN network | |
CN114282646A (en) | Light power prediction method and system based on two-stage feature extraction and improved BilSTM | |
CN115101085A (en) | Multi-speaker time-domain voice separation method for enhancing external attention through convolution | |
CN111583965A (en) | Voice emotion recognition method, device, equipment and storage medium | |
Lee et al. | NAS-TasNet: neural architecture search for time-domain speech separation | |
Yang et al. | Omni-sparsity dnn: Fast sparsity optimization for on-device streaming e2e asr via supernet | |
Zhang et al. | LD-CNN: A lightweight dilated convolutional neural network for environmental sound classification | |
CN110619886B (en) | End-to-end voice enhancement method for low-resource Tujia language | |
Zhao et al. | Transferring age and gender attributes for dimensional emotion prediction from big speech data using hierarchical deep learning | |
CN111091809B (en) | Regional accent recognition method and device based on depth feature fusion | |
Ye et al. | Tdcgan: Temporal dilated convolutional generative adversarial network for end-to-end speech enhancement | |
Ding et al. | Speech emotion features selection based on BBO-SVM | |
Rana | Emotion classification from noisy speech-A deep learning approach |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |