CN109272990B - Voice recognition method based on convolutional neural network - Google Patents
Voice recognition method based on convolutional neural network Download PDFInfo
- Publication number
- CN109272990B CN109272990B CN201811112506.6A CN201811112506A CN109272990B CN 109272990 B CN109272990 B CN 109272990B CN 201811112506 A CN201811112506 A CN 201811112506A CN 109272990 B CN109272990 B CN 109272990B
- Authority
- CN
- China
- Prior art keywords
- layer
- convolutional
- layers
- acoustic model
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0631—Creating reference templates; Clustering
Abstract
The invention provides a voice recognition method based on a convolutional neural network, which is better at extracting high-level features, simple in modeling process, easy to train, better in generalization performance of a model and capable of being widely applied to various voice recognition scenes. It includes: s1: preprocessing an input original voice signal; s2: extracting key characteristic parameters reflecting the characteristics of the voice signals to form a characteristic vector sequence; s3: constructing an acoustic model in an end-to-end mode based on a DCNN model and by taking a joint meaning time classifier (CTC) as a loss function; s4: training an acoustic model to obtain a trained acoustic model; s5: inputting the feature vector sequence to be recognized obtained in the step S2 into a trained acoustic model to obtain a recognition result; s6: and performing subsequent operations based on the recognition result obtained in step S5 to obtain a word string capable of outputting the speech signal with the maximum probability, where the word string is a language word in which the original speech is recognized.
Description
Technical Field
The invention relates to the technical field of voice recognition, in particular to a voice recognition method based on a convolutional neural network.
Background
In the speech recognition technology, a GMM-HMM (Gaussian Mixed Model-hidden Markov Model) Model always takes a leading role as an acoustic Model of speech, but due to the characteristics of the GMM-HMM Model, the GMM-HMM acoustic Model needs to be aligned first before training, data of each frame needs to be aligned with a corresponding label, the alignment process is complicated and complicated, the training time is long, and the Model is a combined Model of the GMM and the HMM, the specific modeling process is relatively complicated to implement, and certain limitations are provided in the specific application of the speech recognition technology.
Disclosure of Invention
In order to solve the problems of long training time, complex modeling process and limited application of the existing acoustic model, the invention provides the voice recognition method based on the convolutional neural network, which is better at extracting high-level features, simple in modeling process, easy to train, better in generalization performance of the model and capable of being more widely applied to various voice recognition scenes.
The technical scheme of the invention is as follows: the speech recognition method based on the convolutional neural network comprises the following steps:
s1: inputting original voice, preprocessing the original voice signal, and performing related transformation processing;
s2: extracting key characteristic parameters reflecting the characteristics of the voice signals to form a characteristic vector sequence;
s3: constructing an acoustic model;
s4: training the acoustic model to obtain a trained acoustic model;
s5: inputting the feature vector sequence to be recognized obtained in the step S2 into the trained acoustic model to obtain a recognition result;
s6: performing subsequent operations based on the recognition result obtained in step S5 to obtain a word string capable of outputting the speech signal with the maximum probability, where the word string is the language word recognized by the original speech;
the method is characterized in that:
in step S3, the acoustic model is constructed in an end-to-end manner based on a DCNN-based network model and using a join-sense time classifier CTC as a loss function.
It is further characterized in that:
the structure of the acoustic model comprises a plurality of convolutional layer structures, two full-connection layers and a CTC loss function which are sequentially arranged;
extracting voice features by adopting convolutional layers with 32 convolutional kernels in a first layer and a second layer of the convolutional layer structures; the third layer and the fourth layer adopt the convolution layers of 64 convolution kernels to extract voice features; the fifth layer begins to extract speech higher layer features for the convolutional layers of the multiple successive 128 convolutional kernels;
if the number of layers of the convolutional layers in the plurality of convolutional layer structures is an even number, starting from the first convolutional layer, every two continuous convolutional layers are followed by one pooling layer; if the number of the layers of the convolutional layers in the plurality of convolutional layer structures is an odd number, starting from the first convolutional layer, one pooling layer follows every two consecutive convolutional layers, and performing pooling operation of the pooling layer again after continuous operation of the last three convolutional layers;
the structure of the plurality of convolution layers has 8 layers or 9 layers in total;
the training formula of the acoustic model in step S4 is as follows:
Hi= Wi* X + bi ,
wherein:
i=1,……k,
Hirepresents the characteristic diagram of the ith image,
Wirepresents the weight of the ith feature map,
x represents the feature map of the previous layer input,
bia bias term representing the ith feature map;
the CTC loss function is:
wherein:
the sum of the probabilities representing how many paths the probability of outputting the label sequence is,
x represents the input of the user, x represents the input,
X = x1,x2,……xTrepresenting the input sequence, the subscripts represent the time from 1 to T,
Y = y1,y2,……ykand represents an output corresponding to X,
yi= yi 1,yi 2,……,yi krepresenting the conditional probability distribution of the ith frame of the output sequence, where i = 1,2, … … K,
and pi represents an output path of the input signal,
μ denotes the output label sequence,
a many-to-one relationship exists between pi and mu,
b represents the mapping relation from the path to the label sequence;
the sizes of convolution kernels of the convolution layers are all set to be 3 x 3;
the step length of the pooling layer is 2 multiplied by 2, and the maximum pooling operation is performed;
in step S2, a spectrogram is used as a voice feature; the frame length in the spectrogram is set to be 25ms, the frame shift is set to be 10ms, and the window function uses a Hamming window function.
In the technical scheme provided by the invention, the acoustic model is constructed based on the DCNN model, only one network model is based on, the parameters of the model are greatly reduced, the constructed acoustic model is easier to train, the training time is reduced, the training speed is higher, and the fitting effect on audio data acquired in various scenes is good because the generalization capability of the algorithm model of the convolutional neural network is stronger, so that the technical scheme provided by the invention has a wider application range; the invention takes the joint meaning time classifier CTC as a loss function to construct an acoustic model in an end-to-end mode, does not need to adopt the traditional preceding alignment operation, can be trained only by one input sequence and one output sequence, directly outputs the probability of sequence prediction, does not need external post-processing, reduces the training time and simplifies the modeling process.
Drawings
FIG. 1 is a schematic diagram of a network architecture according to the present invention;
FIG. 2 is a schematic diagram of a speech feature spectrogram extraction process;
FIG. 3 is a schematic diagram of a network of the present invention including 7 convolutional layers;
FIG. 4 is a schematic diagram of a network of the present invention including 8 convolutional layers;
fig. 5 is a schematic diagram of a network including 9 convolutional layers according to the present invention.
Detailed Description
As shown in fig. 1 to 5, the technical solution of the present invention is based on a dcnn (deep relational Neural network) network model and a CTC (connected semantic Temporal Classification) method to implement an acoustic model in an end-to-end manner; comprises the following steps:
s1: inputting original voice, preprocessing an original voice signal, and performing related transformation processing;
s2: extracting key characteristic parameters reflecting the characteristics of the voice signals to form a characteristic vector sequence;
s3: constructing an acoustic model; constructing an acoustic model in an end-to-end mode by taking a joint meaning time classifier (CTC) as a loss function on the basis of a DCNN-based network model;
the structure of the acoustic model comprises a plurality of convolution layers, two full-connection layers and a CTC loss function which are sequentially arranged; the structure of a plurality of convolution layers is: the first layer and the second layer adopt 32 convolution kernels to extract voice features; the third layer and the fourth layer adopt 64 convolution kernels to extract voice features; the fifth layer extracts the voice higher layer characteristics for the multilayer continuous convolution layer with 128 convolution kernels; if the number of layers of the convolutional layers in the plurality of convolutional layer structures is even, each two continuous convolutional layers from the first convolutional layer is followed by one pooling layer; if the number of the layers of the convolutional layers in the plurality of convolutional layer structures is an odd number, starting from the first convolutional layer, each two continuous convolutional layers are followed by one pooling layer, and after the last three convolutional layers are continuously operated, performing pooling operation of the pooling layers again; the sizes of convolution kernels of the convolution layers are all set to be 3 x 3; maximum pooling operation with 2 x 2 pooling layers and step size of 2; by adopting the convolution layers of 128 multilayer convolution kernels to extract the higher-layer characteristics of the voice, the parameter number of the acoustic model can be controlled, the network is ensured not to be over-fitted, and the practicability of the acoustic model is ensured; the pooling operation by adopting the pooling layer mainly reduces the dimension of the voice characteristic diagram, reduces the number of parameters and can enhance the noise resistance of the voice;
s4: training an acoustic model to obtain a trained acoustic model;
s5: inputting the feature vector sequence to be recognized obtained in the step S2 into a trained acoustic model to obtain a recognition result;
s6: constructing a language model, and training the language model to obtain a trained language model; inputting the recognition result obtained in the step S5 into a trained language model, and performing subsequent speech decoding operation to obtain a word string capable of outputting the speech signal with the maximum probability, where the word string is a language word recognized by the original speech;
the method comprises the steps of forming a feature vector sequence by analyzing key feature parameters aiming at an input voice signal, establishing a recognition network by a trained acoustic model, a trained language model and a dictionary, and searching an optimal path in the network according to a search algorithm, wherein the path is a word string capable of outputting the voice signal with the maximum probability, and the word string is a character of recognized original voice.
In step S2, a spectrogram is used as a voice feature; the extraction process of the spectrogram comprises the steps of inputting original voice, then carrying out framing and windowing on the voice, wherein the frame length is 25ms, the frame shift is 10ms, a Hamming window is used as a window function, then carrying out fast Fourier transform, converting a voice signal from a time domain to a frequency domain, and taking logarithm to obtain the spectrogram; the spectrogram represents the time and the frequency by combining the time domain and the frequency domain, is a visual expression mode of voice energy time-frequency distribution, effectively utilizes the correlation between the time domain and the frequency domain, has better extraction effect on the original characteristics by a characteristic vector sequence obtained by spectrogram analysis, and is input into an acoustic model, so that the subsequent operation accuracy is higher; compared with other window functions, the Hamming window can effectively reduce the leakage of frequency spectrum, so that the data input into the acoustic model is more accurate.
The CTC loss function is:
wherein:
the sum of the probabilities representing how many paths the probability of outputting the label sequence is,
x represents the input of the user, x represents the input,
X = x1,x2,……xTrepresenting the input sequence, the subscripts represent the time from 1 to T,
Y = y1,y2,……ykand represents an output corresponding to X,
yi= yi 1,yi 2,……,yi krepresenting the conditional probability distribution of the ith frame of the output sequence, where i = 1,2, … … K,
and pi represents an output path of the input signal,
μ denotes the output lebel sequence,
a many-to-one relationship between pi and mu, mu = B (pi),
b represents the mapping of the path to the lebel sequence,
finally, one decoding mode of the CTC is given, namely maximum path decoding:
pi in the formula*The path that yields the maximum probability is represented.
The training formula of the acoustic model in step S4 is as follows:
Hi= Wi* X + bi ,
wherein:
i=1,……k,
Hirepresents the characteristic diagram of the ith image,
Wirepresents the weight of the ith feature map,
x represents the feature map of the previous layer input,
bia bias term representing the ith feature map;
an open source 30-hour voice data set of Qinghua university is used for training an acoustic model, the data set is divided into a training set, a verification set and a test set, the number of the linguistic data is 10000, 893 and 2495 sentences respectively, and the linguistic data are recorded in a clean and noiseless environment.
As shown in fig. 3, the network structure includes 7 convolutional layers, a feature vector sequence obtained by spectrogram analysis is used as input, and the first layer and the second layer adopt 32 convolutional kernels to extract voice features; the third layer and the fourth layer adopt 64 convolution kernels to extract voice features; the fifth layer, the sixth layer and the seventh layer are 3 continuous convolution layers with 128 convolution kernels; the first and second convolution layers are a pooling layer; the third and fourth convolution layers are followed by a pooling layer; a fifth layer, a sixth layer and a seventh layer are sequentially provided with two convolution layers with 128 convolution kernels, and then a pooling layer is followed; the two fully connected layers in succession are then connected in turn with the CTC loss function.
As shown in fig. 4, the network structure includes 8 convolutional layers, a feature vector sequence obtained by spectrogram analysis is used as an input, and the first layer and the second layer adopt 32 convolutional kernels to extract voice features; the third layer and the fourth layer adopt 64 convolution kernels to extract voice features; the fifth layer, the sixth layer, the seventh layer and the eighth layer are 4 continuous convolution layers with 128 convolution kernels; each two consecutive convolutional layers are followed by a pooling layer, which is then connected in turn to two consecutive fully-connected layers and a CTC loss function.
As shown in fig. 5, the network structure includes 9 convolutional layers, a feature vector sequence obtained by spectrogram analysis is used as input, and the first layer and the second layer adopt 32 convolutional kernels to extract voice features; the third layer and the fourth layer adopt 64 convolution kernels to extract voice features; the fifth layer, the sixth layer, the seventh layer, the eighth layer and the ninth layer are 5 continuous convolution layers of 128 convolution kernels; the first and second convolution layers are a pooling layer; the third and fourth convolution layers are followed by a pooling layer; the fifth layer and the sixth layer are two convolution layers with 128 continuous convolution kernels, and then a pooling layer is followed; the seventh layer, the eighth layer and the ninth layer are sequentially connected with 128 convolution layers, and then are sequentially connected with two continuous full-connection layers and the CTC loss function through a pooling layer.
Building a network model through a learning library Keras, and testing the network structures of the 7 convolutional layers, the 8 convolutional layers and the 9 convolutional layers and a traditional acoustic model built based on a GMM-HMM (Gaussian mixture model) by adopting a kaldi voice recognition tool kit under the experimental environment of a computer graphics card GTX-1070Ti, a CPU (Central processing Unit) model I7-7700K and the computing power of 6.1, wherein the used data set is a 30-hour voice data set of Qinghua university, and the obtained results are shown in the following table 1:
TABLE 1 test results
According to the technical scheme, the error rate of the acoustic model built based on the GMM-HMM is lower than that of the traditional acoustic model built based on the GMM-HMM, and in the technical scheme, the error rate of the acoustic model is continuously reduced along with the deepening of the network depth, and the fitting capacity of the network is continuously improved; the network fitting capability is best when the number of layers of the convolutional layer of the acoustic model is 9, and the number of parameters is within an acceptable range when the number of layers is 8 and 9.
Claims (6)
1. The speech recognition method based on the convolutional neural network comprises the following steps:
s1: inputting original voice, preprocessing the original voice signal, and performing related transformation processing;
s2: extracting key characteristic parameters reflecting the characteristics of the voice signals to form a characteristic vector sequence;
s3: constructing an acoustic model;
s4: training the acoustic model to obtain a trained acoustic model;
s5: inputting the feature vector sequence to be recognized obtained in the step S2 into the trained acoustic model to obtain a recognition result;
s6: performing subsequent operations based on the recognition result obtained in step S5 to obtain a word string capable of outputting the speech signal with the maximum probability, where the word string is the language word recognized by the original speech;
the method is characterized in that:
in step S3, constructing the acoustic model in an end-to-end manner based on a DCNN-based network model and using a join-sense time classifier CTC as a loss function;
the structure of the acoustic model comprises a plurality of convolutional layer structures, two full-connection layers and a CTC loss function which are sequentially arranged;
extracting voice features by adopting convolutional layers with 32 convolutional kernels in a first layer and a second layer of the convolutional layer structures; the third layer and the fourth layer adopt the convolution layers of 64 convolution kernels to extract voice features; the fifth layer begins to extract speech higher layer features for the convolutional layers of the multiple successive 128 convolutional kernels;
if the number of layers of the convolutional layers in the plurality of convolutional layer structures is an even number, starting from the first convolutional layer, every two continuous convolutional layers are followed by one pooling layer; if the number of the layers of the convolutional layers in the plurality of convolutional layer structures is an odd number, starting from the first convolutional layer, one pooling layer follows every two consecutive convolutional layers, and performing pooling operation of the pooling layer again after continuous operation of the last three convolutional layers;
the structure of the plurality of convolution layers is 8 layers or 9 layers in total.
2. The convolutional neural network-based speech recognition method of claim 1, wherein: the training formula of the acoustic model in step S4 is as follows:
Hi= Wi* X + bi ,
wherein:
i=1,……k,
Hirepresents the characteristic diagram of the ith image,
Wirepresents the weight of the ith feature map,
x represents the feature map of the previous layer input,
biand representing the bias term of the ith feature map.
3. The convolutional neural network-based speech recognition method of claim 1, wherein: the CTC loss function is:
wherein:
the sum of the probabilities representing how many paths the probability of outputting the label sequence is,
x represents the input of the user, x represents the input,
X = x1,x2,……xTrepresenting the input sequence, the subscripts represent the time from 1 to T,
Y = y1,y2,……ykand represents an output corresponding to X,
yi= yi 1,yi 2,……,yi krepresenting the conditional probability distribution of the ith frame of the output sequence, where i = 1,2, … … K,
and pi represents an output path of the input signal,
μ denotes the output label sequence,
a many-to-one relationship exists between pi and mu,
and B represents the mapping relation of the path to the label sequence.
4. The convolutional neural network-based speech recognition method of claim 1, wherein: the convolution kernels of the convolutional layers are all set to 3 × 3 in size.
5. The convolutional neural network-based speech recognition method of claim 1, wherein: the pooling layer is a 2 x 2 maximum pooling operation with a step size of 2.
6. The convolutional neural network-based speech recognition method of claim 1, wherein: in step S2, a spectrogram is used as a voice feature; the frame length in the spectrogram is set to be 25ms, the frame shift is set to be 10ms, and the window function uses a Hamming window function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811112506.6A CN109272990B (en) | 2018-09-25 | 2018-09-25 | Voice recognition method based on convolutional neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811112506.6A CN109272990B (en) | 2018-09-25 | 2018-09-25 | Voice recognition method based on convolutional neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109272990A CN109272990A (en) | 2019-01-25 |
CN109272990B true CN109272990B (en) | 2021-11-05 |
Family
ID=65197268
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811112506.6A Active CN109272990B (en) | 2018-09-25 | 2018-09-25 | Voice recognition method based on convolutional neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109272990B (en) |
Families Citing this family (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109861932B (en) * | 2019-02-15 | 2021-09-03 | 中国人民解放军战略支援部队信息工程大学 | Short wave Morse message automatic identification method based on intelligent image analysis |
CN111951785B (en) * | 2019-05-16 | 2024-03-15 | 武汉Tcl集团工业研究院有限公司 | Voice recognition method and device and terminal equipment |
CN110148408A (en) * | 2019-05-29 | 2019-08-20 | 上海电力学院 | A kind of Chinese speech recognition method based on depth residual error |
CN110176228A (en) * | 2019-05-29 | 2019-08-27 | 广州伟宏智能科技有限公司 | A kind of small corpus audio recognition method and system |
CN110197666B (en) * | 2019-05-30 | 2022-05-10 | 广东工业大学 | Voice recognition method and device based on neural network |
CN112102817A (en) * | 2019-06-18 | 2020-12-18 | 杭州中软安人网络通信股份有限公司 | Speech recognition system |
CN112133292A (en) * | 2019-06-25 | 2020-12-25 | 南京航空航天大学 | End-to-end automatic voice recognition method for civil aviation land-air communication field |
CN110364184B (en) * | 2019-07-15 | 2022-01-28 | 西安音乐学院 | Intonation evaluation method based on deep convolutional neural network DCNN and CTC algorithm |
CN110992941A (en) * | 2019-10-22 | 2020-04-10 | 国网天津静海供电有限公司 | Power grid dispatching voice recognition method and device based on spectrogram |
CN110956201B (en) * | 2019-11-07 | 2023-07-25 | 江南大学 | Convolutional neural network-based image distortion type classification method |
CN111009235A (en) * | 2019-11-20 | 2020-04-14 | 武汉水象电子科技有限公司 | Voice recognition method based on CLDNN + CTC acoustic model |
CN110853629A (en) * | 2019-11-21 | 2020-02-28 | 中科智云科技有限公司 | Speech recognition digital method based on deep learning |
CN110931046A (en) * | 2019-11-29 | 2020-03-27 | 福州大学 | Audio high-level semantic feature extraction method and system for overlapped sound event detection |
CN110930985B (en) * | 2019-12-05 | 2024-02-06 | 携程计算机技术(上海)有限公司 | Telephone voice recognition model, method, system, equipment and medium |
CN110930996B (en) * | 2019-12-11 | 2023-10-31 | 广州市百果园信息技术有限公司 | Model training method, voice recognition method, device, storage medium and equipment |
CN111048082B (en) * | 2019-12-12 | 2022-09-06 | 中国电子科技集团公司第二十八研究所 | Improved end-to-end speech recognition method |
CN110827801B (en) * | 2020-01-09 | 2020-04-17 | 成都无糖信息技术有限公司 | Automatic voice recognition method and system based on artificial intelligence |
CN111243578A (en) * | 2020-01-10 | 2020-06-05 | 中国科学院声学研究所 | Chinese mandarin character-voice conversion method based on self-attention mechanism |
CN111210807B (en) * | 2020-02-21 | 2023-03-31 | 厦门快商通科技股份有限公司 | Speech recognition model training method, system, mobile terminal and storage medium |
CN111246026A (en) * | 2020-03-11 | 2020-06-05 | 兰州飞天网景信息产业有限公司 | Recording processing method based on convolutional neural network and connectivity time sequence classification |
CN112068555A (en) * | 2020-08-27 | 2020-12-11 | 江南大学 | Voice control type mobile robot based on semantic SLAM method |
CN111986661B (en) * | 2020-08-28 | 2024-02-09 | 西安电子科技大学 | Deep neural network voice recognition method based on voice enhancement in complex environment |
CN112466297B (en) * | 2020-11-19 | 2022-09-30 | 重庆兆光科技股份有限公司 | Speech recognition method based on time domain convolution coding and decoding network |
CN112786019A (en) * | 2021-01-04 | 2021-05-11 | 中国人民解放军32050部队 | System and method for realizing voice transcription through image recognition mode |
CN113808581B (en) * | 2021-08-17 | 2024-03-12 | 山东大学 | Chinese voice recognition method based on acoustic and language model training and joint optimization |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106652999A (en) * | 2015-10-29 | 2017-05-10 | 三星Sds株式会社 | System and method for voice recognition |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10936862B2 (en) * | 2016-11-14 | 2021-03-02 | Kodak Alaris Inc. | System and method of character recognition using fully convolutional neural networks |
-
2018
- 2018-09-25 CN CN201811112506.6A patent/CN109272990B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106652999A (en) * | 2015-10-29 | 2017-05-10 | 三星Sds株式会社 | System and method for voice recognition |
Non-Patent Citations (2)
Title |
---|
Towards end-to-end speech recognition with deep convolutional neural networks;Ying Zhang等;《INTERSPEECH 2016》;20160912;参见第1-3节,附图1,附图3 * |
语音识别技术的研究进展与展望;王海坤等;《电信科学》;20180220(第2期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN109272990A (en) | 2019-01-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109272990B (en) | Voice recognition method based on convolutional neural network | |
CN109272988B (en) | Voice recognition method based on multi-path convolution neural network | |
US11062699B2 (en) | Speech recognition with trained GMM-HMM and LSTM models | |
CN108597539B (en) | Speech emotion recognition method based on parameter migration and spectrogram | |
US11030998B2 (en) | Acoustic model training method, speech recognition method, apparatus, device and medium | |
CN109410914B (en) | Method for identifying Jiangxi dialect speech and dialect point | |
WO2018227781A1 (en) | Voice recognition method, apparatus, computer device, and storage medium | |
CN109065032B (en) | External corpus speech recognition method based on deep convolutional neural network | |
CN102800316B (en) | Optimal codebook design method for voiceprint recognition system based on nerve network | |
CN110444208A (en) | A kind of speech recognition attack defense method and device based on gradient estimation and CTC algorithm | |
Muhammad et al. | Speech recognition for English to Indonesian translator using hidden Markov model | |
CN103065629A (en) | Speech recognition system of humanoid robot | |
CN111798840B (en) | Voice keyword recognition method and device | |
CN111402928B (en) | Attention-based speech emotion state evaluation method, device, medium and equipment | |
CN115019776A (en) | Voice recognition model, training method thereof, voice recognition method and device | |
CN107093422A (en) | A kind of audio recognition method and speech recognition system | |
CN114783418B (en) | End-to-end voice recognition method and system based on sparse self-attention mechanism | |
CN114999460A (en) | Lightweight Chinese speech recognition method combined with Transformer | |
Liu et al. | Hierarchical component-attention based speaker turn embedding for emotion recognition | |
Mhiri et al. | A low latency ASR-free end to end spoken language understanding system | |
CN116010874A (en) | Emotion recognition method based on deep learning multi-mode deep scale emotion feature fusion | |
Reddy et al. | Indian sign language generation from live audio or text for tamil | |
Liu et al. | Keyword retrieving in continuous speech using connectionist temporal classification | |
Al-Rababah et al. | Automatic detection technique for speech recognition based on neural networks inter-disciplinary | |
Hu et al. | Speaker Recognition Based on 3DCNN-LSTM. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |