CN109272990B - Voice recognition method based on convolutional neural network - Google Patents

Voice recognition method based on convolutional neural network Download PDF

Info

Publication number
CN109272990B
CN109272990B CN201811112506.6A CN201811112506A CN109272990B CN 109272990 B CN109272990 B CN 109272990B CN 201811112506 A CN201811112506 A CN 201811112506A CN 109272990 B CN109272990 B CN 109272990B
Authority
CN
China
Prior art keywords
layer
convolutional
layers
acoustic model
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811112506.6A
Other languages
Chinese (zh)
Other versions
CN109272990A (en
Inventor
曹毅
张威
翟明浩
刘晨
黄子龙
李巍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangnan University
Original Assignee
Jiangnan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangnan University filed Critical Jiangnan University
Priority to CN201811112506.6A priority Critical patent/CN109272990B/en
Publication of CN109272990A publication Critical patent/CN109272990A/en
Application granted granted Critical
Publication of CN109272990B publication Critical patent/CN109272990B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Abstract

The invention provides a voice recognition method based on a convolutional neural network, which is better at extracting high-level features, simple in modeling process, easy to train, better in generalization performance of a model and capable of being widely applied to various voice recognition scenes. It includes: s1: preprocessing an input original voice signal; s2: extracting key characteristic parameters reflecting the characteristics of the voice signals to form a characteristic vector sequence; s3: constructing an acoustic model in an end-to-end mode based on a DCNN model and by taking a joint meaning time classifier (CTC) as a loss function; s4: training an acoustic model to obtain a trained acoustic model; s5: inputting the feature vector sequence to be recognized obtained in the step S2 into a trained acoustic model to obtain a recognition result; s6: and performing subsequent operations based on the recognition result obtained in step S5 to obtain a word string capable of outputting the speech signal with the maximum probability, where the word string is a language word in which the original speech is recognized.

Description

Voice recognition method based on convolutional neural network
Technical Field
The invention relates to the technical field of voice recognition, in particular to a voice recognition method based on a convolutional neural network.
Background
In the speech recognition technology, a GMM-HMM (Gaussian Mixed Model-hidden Markov Model) Model always takes a leading role as an acoustic Model of speech, but due to the characteristics of the GMM-HMM Model, the GMM-HMM acoustic Model needs to be aligned first before training, data of each frame needs to be aligned with a corresponding label, the alignment process is complicated and complicated, the training time is long, and the Model is a combined Model of the GMM and the HMM, the specific modeling process is relatively complicated to implement, and certain limitations are provided in the specific application of the speech recognition technology.
Disclosure of Invention
In order to solve the problems of long training time, complex modeling process and limited application of the existing acoustic model, the invention provides the voice recognition method based on the convolutional neural network, which is better at extracting high-level features, simple in modeling process, easy to train, better in generalization performance of the model and capable of being more widely applied to various voice recognition scenes.
The technical scheme of the invention is as follows: the speech recognition method based on the convolutional neural network comprises the following steps:
s1: inputting original voice, preprocessing the original voice signal, and performing related transformation processing;
s2: extracting key characteristic parameters reflecting the characteristics of the voice signals to form a characteristic vector sequence;
s3: constructing an acoustic model;
s4: training the acoustic model to obtain a trained acoustic model;
s5: inputting the feature vector sequence to be recognized obtained in the step S2 into the trained acoustic model to obtain a recognition result;
s6: performing subsequent operations based on the recognition result obtained in step S5 to obtain a word string capable of outputting the speech signal with the maximum probability, where the word string is the language word recognized by the original speech;
the method is characterized in that:
in step S3, the acoustic model is constructed in an end-to-end manner based on a DCNN-based network model and using a join-sense time classifier CTC as a loss function.
It is further characterized in that:
the structure of the acoustic model comprises a plurality of convolutional layer structures, two full-connection layers and a CTC loss function which are sequentially arranged;
extracting voice features by adopting convolutional layers with 32 convolutional kernels in a first layer and a second layer of the convolutional layer structures; the third layer and the fourth layer adopt the convolution layers of 64 convolution kernels to extract voice features; the fifth layer begins to extract speech higher layer features for the convolutional layers of the multiple successive 128 convolutional kernels;
if the number of layers of the convolutional layers in the plurality of convolutional layer structures is an even number, starting from the first convolutional layer, every two continuous convolutional layers are followed by one pooling layer; if the number of the layers of the convolutional layers in the plurality of convolutional layer structures is an odd number, starting from the first convolutional layer, one pooling layer follows every two consecutive convolutional layers, and performing pooling operation of the pooling layer again after continuous operation of the last three convolutional layers;
the structure of the plurality of convolution layers has 8 layers or 9 layers in total;
the training formula of the acoustic model in step S4 is as follows:
Hi= Wi* X + bi
wherein:
i=1,……k,
Hirepresents the characteristic diagram of the ith image,
Wirepresents the weight of the ith feature map,
x represents the feature map of the previous layer input,
bia bias term representing the ith feature map;
the CTC loss function is:
Figure 413886DEST_PATH_IMAGE001
wherein:
Figure 95404DEST_PATH_IMAGE002
as a calculation formula of the softmax function,
Figure 831278DEST_PATH_IMAGE003
representing the probability of an output pi path based on input x,
Figure 145627DEST_PATH_IMAGE004
the sum of the probabilities representing how many paths the probability of outputting the label sequence is,
x represents the input of the user, x represents the input,
X = x1,x2,……xTrepresenting the input sequence, the subscripts represent the time from 1 to T,
Y = y1,y2,……ykand represents an output corresponding to X,
yi= yi 1,yi 2,……,yi krepresenting the conditional probability distribution of the ith frame of the output sequence, where i = 1,2, … … K,
and pi represents an output path of the input signal,
μ denotes the output label sequence,
a many-to-one relationship exists between pi and mu,
b represents the mapping relation from the path to the label sequence;
the sizes of convolution kernels of the convolution layers are all set to be 3 x 3;
the step length of the pooling layer is 2 multiplied by 2, and the maximum pooling operation is performed;
in step S2, a spectrogram is used as a voice feature; the frame length in the spectrogram is set to be 25ms, the frame shift is set to be 10ms, and the window function uses a Hamming window function.
In the technical scheme provided by the invention, the acoustic model is constructed based on the DCNN model, only one network model is based on, the parameters of the model are greatly reduced, the constructed acoustic model is easier to train, the training time is reduced, the training speed is higher, and the fitting effect on audio data acquired in various scenes is good because the generalization capability of the algorithm model of the convolutional neural network is stronger, so that the technical scheme provided by the invention has a wider application range; the invention takes the joint meaning time classifier CTC as a loss function to construct an acoustic model in an end-to-end mode, does not need to adopt the traditional preceding alignment operation, can be trained only by one input sequence and one output sequence, directly outputs the probability of sequence prediction, does not need external post-processing, reduces the training time and simplifies the modeling process.
Drawings
FIG. 1 is a schematic diagram of a network architecture according to the present invention;
FIG. 2 is a schematic diagram of a speech feature spectrogram extraction process;
FIG. 3 is a schematic diagram of a network of the present invention including 7 convolutional layers;
FIG. 4 is a schematic diagram of a network of the present invention including 8 convolutional layers;
fig. 5 is a schematic diagram of a network including 9 convolutional layers according to the present invention.
Detailed Description
As shown in fig. 1 to 5, the technical solution of the present invention is based on a dcnn (deep relational Neural network) network model and a CTC (connected semantic Temporal Classification) method to implement an acoustic model in an end-to-end manner; comprises the following steps:
s1: inputting original voice, preprocessing an original voice signal, and performing related transformation processing;
s2: extracting key characteristic parameters reflecting the characteristics of the voice signals to form a characteristic vector sequence;
s3: constructing an acoustic model; constructing an acoustic model in an end-to-end mode by taking a joint meaning time classifier (CTC) as a loss function on the basis of a DCNN-based network model;
the structure of the acoustic model comprises a plurality of convolution layers, two full-connection layers and a CTC loss function which are sequentially arranged; the structure of a plurality of convolution layers is: the first layer and the second layer adopt 32 convolution kernels to extract voice features; the third layer and the fourth layer adopt 64 convolution kernels to extract voice features; the fifth layer extracts the voice higher layer characteristics for the multilayer continuous convolution layer with 128 convolution kernels; if the number of layers of the convolutional layers in the plurality of convolutional layer structures is even, each two continuous convolutional layers from the first convolutional layer is followed by one pooling layer; if the number of the layers of the convolutional layers in the plurality of convolutional layer structures is an odd number, starting from the first convolutional layer, each two continuous convolutional layers are followed by one pooling layer, and after the last three convolutional layers are continuously operated, performing pooling operation of the pooling layers again; the sizes of convolution kernels of the convolution layers are all set to be 3 x 3; maximum pooling operation with 2 x 2 pooling layers and step size of 2; by adopting the convolution layers of 128 multilayer convolution kernels to extract the higher-layer characteristics of the voice, the parameter number of the acoustic model can be controlled, the network is ensured not to be over-fitted, and the practicability of the acoustic model is ensured; the pooling operation by adopting the pooling layer mainly reduces the dimension of the voice characteristic diagram, reduces the number of parameters and can enhance the noise resistance of the voice;
s4: training an acoustic model to obtain a trained acoustic model;
s5: inputting the feature vector sequence to be recognized obtained in the step S2 into a trained acoustic model to obtain a recognition result;
s6: constructing a language model, and training the language model to obtain a trained language model; inputting the recognition result obtained in the step S5 into a trained language model, and performing subsequent speech decoding operation to obtain a word string capable of outputting the speech signal with the maximum probability, where the word string is a language word recognized by the original speech;
the method comprises the steps of forming a feature vector sequence by analyzing key feature parameters aiming at an input voice signal, establishing a recognition network by a trained acoustic model, a trained language model and a dictionary, and searching an optimal path in the network according to a search algorithm, wherein the path is a word string capable of outputting the voice signal with the maximum probability, and the word string is a character of recognized original voice.
In step S2, a spectrogram is used as a voice feature; the extraction process of the spectrogram comprises the steps of inputting original voice, then carrying out framing and windowing on the voice, wherein the frame length is 25ms, the frame shift is 10ms, a Hamming window is used as a window function, then carrying out fast Fourier transform, converting a voice signal from a time domain to a frequency domain, and taking logarithm to obtain the spectrogram; the spectrogram represents the time and the frequency by combining the time domain and the frequency domain, is a visual expression mode of voice energy time-frequency distribution, effectively utilizes the correlation between the time domain and the frequency domain, has better extraction effect on the original characteristics by a characteristic vector sequence obtained by spectrogram analysis, and is input into an acoustic model, so that the subsequent operation accuracy is higher; compared with other window functions, the Hamming window can effectively reduce the leakage of frequency spectrum, so that the data input into the acoustic model is more accurate.
The CTC loss function is:
Figure 17768DEST_PATH_IMAGE001
wherein:
Figure 604607DEST_PATH_IMAGE002
as a calculation formula of the softmax function,
Figure 827778DEST_PATH_IMAGE003
representing the probability of an output pi path based on input x,
Figure 158265DEST_PATH_IMAGE004
the sum of the probabilities representing how many paths the probability of outputting the label sequence is,
x represents the input of the user, x represents the input,
X = x1,x2,……xTrepresenting the input sequence, the subscripts represent the time from 1 to T,
Y = y1,y2,……ykand represents an output corresponding to X,
yi= yi 1,yi 2,……,yi krepresenting the conditional probability distribution of the ith frame of the output sequence, where i = 1,2, … … K,
and pi represents an output path of the input signal,
μ denotes the output lebel sequence,
a many-to-one relationship between pi and mu, mu = B (pi),
b represents the mapping of the path to the lebel sequence,
finally, one decoding mode of the CTC is given, namely maximum path decoding:
Figure 150492DEST_PATH_IMAGE005
pi in the formula*The path that yields the maximum probability is represented.
The training formula of the acoustic model in step S4 is as follows:
Hi= Wi* X + bi
wherein:
i=1,……k,
Hirepresents the characteristic diagram of the ith image,
Wirepresents the weight of the ith feature map,
x represents the feature map of the previous layer input,
bia bias term representing the ith feature map;
an open source 30-hour voice data set of Qinghua university is used for training an acoustic model, the data set is divided into a training set, a verification set and a test set, the number of the linguistic data is 10000, 893 and 2495 sentences respectively, and the linguistic data are recorded in a clean and noiseless environment.
As shown in fig. 3, the network structure includes 7 convolutional layers, a feature vector sequence obtained by spectrogram analysis is used as input, and the first layer and the second layer adopt 32 convolutional kernels to extract voice features; the third layer and the fourth layer adopt 64 convolution kernels to extract voice features; the fifth layer, the sixth layer and the seventh layer are 3 continuous convolution layers with 128 convolution kernels; the first and second convolution layers are a pooling layer; the third and fourth convolution layers are followed by a pooling layer; a fifth layer, a sixth layer and a seventh layer are sequentially provided with two convolution layers with 128 convolution kernels, and then a pooling layer is followed; the two fully connected layers in succession are then connected in turn with the CTC loss function.
As shown in fig. 4, the network structure includes 8 convolutional layers, a feature vector sequence obtained by spectrogram analysis is used as an input, and the first layer and the second layer adopt 32 convolutional kernels to extract voice features; the third layer and the fourth layer adopt 64 convolution kernels to extract voice features; the fifth layer, the sixth layer, the seventh layer and the eighth layer are 4 continuous convolution layers with 128 convolution kernels; each two consecutive convolutional layers are followed by a pooling layer, which is then connected in turn to two consecutive fully-connected layers and a CTC loss function.
As shown in fig. 5, the network structure includes 9 convolutional layers, a feature vector sequence obtained by spectrogram analysis is used as input, and the first layer and the second layer adopt 32 convolutional kernels to extract voice features; the third layer and the fourth layer adopt 64 convolution kernels to extract voice features; the fifth layer, the sixth layer, the seventh layer, the eighth layer and the ninth layer are 5 continuous convolution layers of 128 convolution kernels; the first and second convolution layers are a pooling layer; the third and fourth convolution layers are followed by a pooling layer; the fifth layer and the sixth layer are two convolution layers with 128 continuous convolution kernels, and then a pooling layer is followed; the seventh layer, the eighth layer and the ninth layer are sequentially connected with 128 convolution layers, and then are sequentially connected with two continuous full-connection layers and the CTC loss function through a pooling layer.
Building a network model through a learning library Keras, and testing the network structures of the 7 convolutional layers, the 8 convolutional layers and the 9 convolutional layers and a traditional acoustic model built based on a GMM-HMM (Gaussian mixture model) by adopting a kaldi voice recognition tool kit under the experimental environment of a computer graphics card GTX-1070Ti, a CPU (Central processing Unit) model I7-7700K and the computing power of 6.1, wherein the used data set is a 30-hour voice data set of Qinghua university, and the obtained results are shown in the following table 1:
TABLE 1 test results
Figure 439391DEST_PATH_IMAGE006
According to the technical scheme, the error rate of the acoustic model built based on the GMM-HMM is lower than that of the traditional acoustic model built based on the GMM-HMM, and in the technical scheme, the error rate of the acoustic model is continuously reduced along with the deepening of the network depth, and the fitting capacity of the network is continuously improved; the network fitting capability is best when the number of layers of the convolutional layer of the acoustic model is 9, and the number of parameters is within an acceptable range when the number of layers is 8 and 9.

Claims (6)

1. The speech recognition method based on the convolutional neural network comprises the following steps:
s1: inputting original voice, preprocessing the original voice signal, and performing related transformation processing;
s2: extracting key characteristic parameters reflecting the characteristics of the voice signals to form a characteristic vector sequence;
s3: constructing an acoustic model;
s4: training the acoustic model to obtain a trained acoustic model;
s5: inputting the feature vector sequence to be recognized obtained in the step S2 into the trained acoustic model to obtain a recognition result;
s6: performing subsequent operations based on the recognition result obtained in step S5 to obtain a word string capable of outputting the speech signal with the maximum probability, where the word string is the language word recognized by the original speech;
the method is characterized in that:
in step S3, constructing the acoustic model in an end-to-end manner based on a DCNN-based network model and using a join-sense time classifier CTC as a loss function;
the structure of the acoustic model comprises a plurality of convolutional layer structures, two full-connection layers and a CTC loss function which are sequentially arranged;
extracting voice features by adopting convolutional layers with 32 convolutional kernels in a first layer and a second layer of the convolutional layer structures; the third layer and the fourth layer adopt the convolution layers of 64 convolution kernels to extract voice features; the fifth layer begins to extract speech higher layer features for the convolutional layers of the multiple successive 128 convolutional kernels;
if the number of layers of the convolutional layers in the plurality of convolutional layer structures is an even number, starting from the first convolutional layer, every two continuous convolutional layers are followed by one pooling layer; if the number of the layers of the convolutional layers in the plurality of convolutional layer structures is an odd number, starting from the first convolutional layer, one pooling layer follows every two consecutive convolutional layers, and performing pooling operation of the pooling layer again after continuous operation of the last three convolutional layers;
the structure of the plurality of convolution layers is 8 layers or 9 layers in total.
2. The convolutional neural network-based speech recognition method of claim 1, wherein: the training formula of the acoustic model in step S4 is as follows:
Hi= Wi* X + bi
wherein:
i=1,……k,
Hirepresents the characteristic diagram of the ith image,
Wirepresents the weight of the ith feature map,
x represents the feature map of the previous layer input,
biand representing the bias term of the ith feature map.
3. The convolutional neural network-based speech recognition method of claim 1, wherein: the CTC loss function is:
Figure 520515DEST_PATH_IMAGE001
wherein:
Figure 421606DEST_PATH_IMAGE002
as a calculation formula of the softmax function,
Figure 282115DEST_PATH_IMAGE003
representing the probability of an output pi path based on input x,
Figure 684277DEST_PATH_IMAGE004
the sum of the probabilities representing how many paths the probability of outputting the label sequence is,
x represents the input of the user, x represents the input,
X = x1,x2,……xTrepresenting the input sequence, the subscripts represent the time from 1 to T,
Y = y1,y2,……ykand represents an output corresponding to X,
yi= yi 1,yi 2,……,yi krepresenting the conditional probability distribution of the ith frame of the output sequence, where i = 1,2, … … K,
and pi represents an output path of the input signal,
μ denotes the output label sequence,
a many-to-one relationship exists between pi and mu,
and B represents the mapping relation of the path to the label sequence.
4. The convolutional neural network-based speech recognition method of claim 1, wherein: the convolution kernels of the convolutional layers are all set to 3 × 3 in size.
5. The convolutional neural network-based speech recognition method of claim 1, wherein: the pooling layer is a 2 x 2 maximum pooling operation with a step size of 2.
6. The convolutional neural network-based speech recognition method of claim 1, wherein: in step S2, a spectrogram is used as a voice feature; the frame length in the spectrogram is set to be 25ms, the frame shift is set to be 10ms, and the window function uses a Hamming window function.
CN201811112506.6A 2018-09-25 2018-09-25 Voice recognition method based on convolutional neural network Active CN109272990B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811112506.6A CN109272990B (en) 2018-09-25 2018-09-25 Voice recognition method based on convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811112506.6A CN109272990B (en) 2018-09-25 2018-09-25 Voice recognition method based on convolutional neural network

Publications (2)

Publication Number Publication Date
CN109272990A CN109272990A (en) 2019-01-25
CN109272990B true CN109272990B (en) 2021-11-05

Family

ID=65197268

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811112506.6A Active CN109272990B (en) 2018-09-25 2018-09-25 Voice recognition method based on convolutional neural network

Country Status (1)

Country Link
CN (1) CN109272990B (en)

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109861932B (en) * 2019-02-15 2021-09-03 中国人民解放军战略支援部队信息工程大学 Short wave Morse message automatic identification method based on intelligent image analysis
CN111951785B (en) * 2019-05-16 2024-03-15 武汉Tcl集团工业研究院有限公司 Voice recognition method and device and terminal equipment
CN110148408A (en) * 2019-05-29 2019-08-20 上海电力学院 A kind of Chinese speech recognition method based on depth residual error
CN110176228A (en) * 2019-05-29 2019-08-27 广州伟宏智能科技有限公司 A kind of small corpus audio recognition method and system
CN110197666B (en) * 2019-05-30 2022-05-10 广东工业大学 Voice recognition method and device based on neural network
CN112102817A (en) * 2019-06-18 2020-12-18 杭州中软安人网络通信股份有限公司 Speech recognition system
CN112133292A (en) * 2019-06-25 2020-12-25 南京航空航天大学 End-to-end automatic voice recognition method for civil aviation land-air communication field
CN110364184B (en) * 2019-07-15 2022-01-28 西安音乐学院 Intonation evaluation method based on deep convolutional neural network DCNN and CTC algorithm
CN110992941A (en) * 2019-10-22 2020-04-10 国网天津静海供电有限公司 Power grid dispatching voice recognition method and device based on spectrogram
CN110956201B (en) * 2019-11-07 2023-07-25 江南大学 Convolutional neural network-based image distortion type classification method
CN111009235A (en) * 2019-11-20 2020-04-14 武汉水象电子科技有限公司 Voice recognition method based on CLDNN + CTC acoustic model
CN110853629A (en) * 2019-11-21 2020-02-28 中科智云科技有限公司 Speech recognition digital method based on deep learning
CN110931046A (en) * 2019-11-29 2020-03-27 福州大学 Audio high-level semantic feature extraction method and system for overlapped sound event detection
CN110930985B (en) * 2019-12-05 2024-02-06 携程计算机技术(上海)有限公司 Telephone voice recognition model, method, system, equipment and medium
CN110930996B (en) * 2019-12-11 2023-10-31 广州市百果园信息技术有限公司 Model training method, voice recognition method, device, storage medium and equipment
CN111048082B (en) * 2019-12-12 2022-09-06 中国电子科技集团公司第二十八研究所 Improved end-to-end speech recognition method
CN110827801B (en) * 2020-01-09 2020-04-17 成都无糖信息技术有限公司 Automatic voice recognition method and system based on artificial intelligence
CN111243578A (en) * 2020-01-10 2020-06-05 中国科学院声学研究所 Chinese mandarin character-voice conversion method based on self-attention mechanism
CN111210807B (en) * 2020-02-21 2023-03-31 厦门快商通科技股份有限公司 Speech recognition model training method, system, mobile terminal and storage medium
CN111246026A (en) * 2020-03-11 2020-06-05 兰州飞天网景信息产业有限公司 Recording processing method based on convolutional neural network and connectivity time sequence classification
CN112068555A (en) * 2020-08-27 2020-12-11 江南大学 Voice control type mobile robot based on semantic SLAM method
CN111986661B (en) * 2020-08-28 2024-02-09 西安电子科技大学 Deep neural network voice recognition method based on voice enhancement in complex environment
CN112466297B (en) * 2020-11-19 2022-09-30 重庆兆光科技股份有限公司 Speech recognition method based on time domain convolution coding and decoding network
CN112786019A (en) * 2021-01-04 2021-05-11 中国人民解放军32050部队 System and method for realizing voice transcription through image recognition mode
CN113808581B (en) * 2021-08-17 2024-03-12 山东大学 Chinese voice recognition method based on acoustic and language model training and joint optimization

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106652999A (en) * 2015-10-29 2017-05-10 三星Sds株式会社 System and method for voice recognition

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10936862B2 (en) * 2016-11-14 2021-03-02 Kodak Alaris Inc. System and method of character recognition using fully convolutional neural networks

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106652999A (en) * 2015-10-29 2017-05-10 三星Sds株式会社 System and method for voice recognition

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Towards end-to-end speech recognition with deep convolutional neural networks;Ying Zhang等;《INTERSPEECH 2016》;20160912;参见第1-3节,附图1,附图3 *
语音识别技术的研究进展与展望;王海坤等;《电信科学》;20180220(第2期);全文 *

Also Published As

Publication number Publication date
CN109272990A (en) 2019-01-25

Similar Documents

Publication Publication Date Title
CN109272990B (en) Voice recognition method based on convolutional neural network
CN109272988B (en) Voice recognition method based on multi-path convolution neural network
US11062699B2 (en) Speech recognition with trained GMM-HMM and LSTM models
CN108597539B (en) Speech emotion recognition method based on parameter migration and spectrogram
US11030998B2 (en) Acoustic model training method, speech recognition method, apparatus, device and medium
CN109410914B (en) Method for identifying Jiangxi dialect speech and dialect point
WO2018227781A1 (en) Voice recognition method, apparatus, computer device, and storage medium
CN109065032B (en) External corpus speech recognition method based on deep convolutional neural network
CN102800316B (en) Optimal codebook design method for voiceprint recognition system based on nerve network
CN110444208A (en) A kind of speech recognition attack defense method and device based on gradient estimation and CTC algorithm
Muhammad et al. Speech recognition for English to Indonesian translator using hidden Markov model
CN103065629A (en) Speech recognition system of humanoid robot
CN111798840B (en) Voice keyword recognition method and device
CN111402928B (en) Attention-based speech emotion state evaluation method, device, medium and equipment
CN115019776A (en) Voice recognition model, training method thereof, voice recognition method and device
CN107093422A (en) A kind of audio recognition method and speech recognition system
CN114783418B (en) End-to-end voice recognition method and system based on sparse self-attention mechanism
CN114999460A (en) Lightweight Chinese speech recognition method combined with Transformer
Liu et al. Hierarchical component-attention based speaker turn embedding for emotion recognition
Mhiri et al. A low latency ASR-free end to end spoken language understanding system
CN116010874A (en) Emotion recognition method based on deep learning multi-mode deep scale emotion feature fusion
Reddy et al. Indian sign language generation from live audio or text for tamil
Liu et al. Keyword retrieving in continuous speech using connectionist temporal classification
Al-Rababah et al. Automatic detection technique for speech recognition based on neural networks inter-disciplinary
Hu et al. Speaker Recognition Based on 3DCNN-LSTM.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant