CN115862596A - Deep learning-based spoken English speech recognition method - Google Patents
Deep learning-based spoken English speech recognition method Download PDFInfo
- Publication number
- CN115862596A CN115862596A CN202310194346.9A CN202310194346A CN115862596A CN 115862596 A CN115862596 A CN 115862596A CN 202310194346 A CN202310194346 A CN 202310194346A CN 115862596 A CN115862596 A CN 115862596A
- Authority
- CN
- China
- Prior art keywords
- model
- english
- audio
- training
- deep learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 57
- 238000013135 deep learning Methods 0.000 title claims abstract description 23
- 238000012549 training Methods 0.000 claims abstract description 53
- 238000005457 optimization Methods 0.000 claims abstract description 5
- 238000012360 testing method Methods 0.000 claims description 19
- 238000013528 artificial neural network Methods 0.000 claims description 17
- 238000007781 pre-processing Methods 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 7
- 238000010606 normalization Methods 0.000 claims description 7
- 238000001228 spectrum Methods 0.000 claims description 7
- 238000012952 Resampling Methods 0.000 claims description 5
- 230000003044 adaptive effect Effects 0.000 claims description 4
- 238000012545 processing Methods 0.000 claims description 4
- 230000009467 reduction Effects 0.000 claims description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 3
- 238000004422 calculation algorithm Methods 0.000 claims description 3
- 230000006870 function Effects 0.000 claims description 3
- 230000008569 process Effects 0.000 abstract description 16
- 230000009471 action Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 239000012634 fragment Substances 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 230000015654 memory Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000003909 pattern recognition Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
Images
Landscapes
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention provides an English spoken language voice recognition method based on deep learning, and belongs to the technical field of voice recognition. The method comprises the following steps: designing an English voice recognition model based on a Transformer; performing model structure adjustment and parameter optimization of the English voice recognition model through a training data set; and performing voice recognition on the English spoken language audio file collected in real time by using the adjusted English voice recognition model, and generating an audio recognition text. According to the invention, an end-to-end voice recognition method based on a Transformer is adopted, and an examinee English spoken language training data set is constructed, so that the whole network model is integrally optimized in the training process, the global optimum is ensured, and the recognition accuracy is effectively improved.
Description
Technical Field
The invention relates to the technical field of voice recognition, in particular to a spoken English voice recognition method based on deep learning.
Background
In recent years, with the rapid development of pattern recognition and artificial intelligence, and the deep application of technologies such as machine learning, especially deep learning, etc., the research and application fields of speech recognition technology are becoming more and more extensive. Speech recognition technology transcribes a speech signal into a corresponding text or command by a computer, which is essentially a pattern recognition process.
The model most commonly used in the initial stage of the speech recognition field is GMM-HMM, but the modeling capability is limited, the speech features and the structure cannot be completely and accurately represented, and more speech recognition models based on a neural network appear along with the development of deep learning. The DNN replaces GMM to realize the modeling of the observation state probability, so that the recognition accuracy can be improved, but the DNN-HMM model is difficult to train, each frame of voice in training data needs to be labeled, and the manual labeling difficulty is high. In addition, LSTM-RNN is widely used in acoustic models because it can capture context-dependent information of sequence data, but RNN calculation at each time requires the output of the previous time as input, and therefore, it can only be calculated serially, and is slow in speed, and furthermore, RNN is susceptible to gradient vanishing during training, converges more slowly, and requires more computational resources.
At present, when being applied to the speech recognition system of oral english examination with current speech recognition model based on neural network, although can satisfy the automatic speech recognition demand of oral english examination, the training process that current speech recognition model based on neural network exists is complicated, data annotation is difficult and occupy defects such as computational resource are big, the speech conversion flow that has led to current speech recognition system is very complicated, and when carrying out the pronunciation transcription of examinee oral english examination, speech recognition's rate of correctness remains to be improved.
Disclosure of Invention
Aiming at the problems in the prior art, the invention aims to provide an English spoken language voice recognition method based on deep learning, which adopts an end-to-end voice recognition method based on a Transformer and constructs an examinee English spoken language training data set, so that the whole network model is integrally optimized in the training process, the global optimum is ensured, and the recognition accuracy is effectively improved.
In order to achieve the purpose, the invention is realized by the following technical scheme:
a spoken English speech recognition method based on deep learning comprises the following steps:
s1: designing an English voice recognition model based on a Transformer;
s2: performing model structure adjustment and parameter optimization of the English voice recognition model through a training data set;
s3: and performing voice recognition on the English spoken language audio file collected in real time by using the adjusted English voice recognition model, and generating an audio recognition text.
Further, step S1 includes:
and constructing a position information embedding module, a multi-head self-attention module, a feed-forward neural network and a cross-attention module, and combining the position information embedding module, the multi-head self-attention module, the feed-forward neural network and the cross-attention module into an English speech recognition model based on a Transformer.
Further, step S2 specifically includes the following steps:
s21: acquiring audio matched with the text and manual annotation audio data aiming at the oral English test, and constructing a data set;
s22: preprocessing a data set, and dividing the data set into a training set and a test set according to a preset distribution proportion;
s23: carrying out logarithmic Mel spectrum feature extraction on the data set, carrying out feature normalization, and globally scaling the input features between-1 and 1;
s24: setting quantizer parameters and hyper-parameters for model training;
s25: and performing Model training on the English voice recognition Model by using a training set to obtain a pre-training Model, and performing fine adjustment on the pre-training Model by using a test set to obtain a final Model.
Further, step S3 specifically includes the following steps:
s31: deploying the Model obtained by training;
s32: collecting an English spoken language audio file, and suppressing noise of the English spoken language audio file by adopting an LMS adaptive filter noise reduction method to obtain a preprocessed audio sequence x;
s33: resampling the preprocessed audio sequence X to 16000Hz, and intercepting the audio sequence X into 30-second segments to form batch audio segments X;
s34: carrying out logarithmic Mel spectrum feature extraction on the batch audio fragments X, and carrying out normalization processing on the features;
s35: inputting the normalized feature sequence F of the batch of audio clips X into a Model for prediction to obtain probability distribution P of the recognition text;
s36: and performing table look-up on the probability distribution P of the recognition text to obtain the audio recognition text.
Further, step S21 specifically includes:
collecting audio matched with the text with the duration of 1.1 ten thousand hours and manual annotation audio data aiming at the oral English test with the duration of 0.15 ten thousand hours, and constructing the audio data into a data set.
Further, step S22 specifically includes:
resampling all audio data in the data set to 16000Hz, dividing the resampled audio data into segments with the duration of 30 seconds and identifying by utilizing a preset label to finish the preprocessing of the data set;
the preprocessed data set is divided into a training set and a testing set according to the proportion of 8:2.
Further, step S24 specifically includes:
setting quantizer parameters for Adam quantizer for model training, and according to formula
wherein d is 512 and warmup steps is 5000;
setting 6 layers of perceptrons for the feedforward neural network, and setting the number of the multi-head self-attention module heads to be 6;
forming a residual error attention module by using a multi-head self-attention module and a feedforward neural network;
the encoder and decoder were built with 5 residual attention modules, respectively, and the parameter discard rate was set to 0.1.
Further, step S25 specifically includes:
carrying out 300000 steps of iteration on the English speech recognition model by using a training set to obtain a pre-training model;
and (5) fine-tuning the pre-training Model by using the manually marked audio data in the test set to obtain a final Model.
Further, step S1 further includes:
forming a characteristic vector with a time sequence on the word vector by a position information embedding method;
the position information characteristic is expressed by using sine and cosine functions, and the position information embedding formula is as follows:
where p represents position, i represents dimension, and d is 512.
Further, step S31 includes:
adopting an ONNX framework to deploy a Model, and carrying out encryption deployment on the Model to enable the Model to operate on various devices by 0;
when encryption deployment is carried out, the encryption algorithm formula adopted is as follows:
the Model encrypt is an encrypted Model, the Model is a Model obtained by training, i represents the ith element in the Model, key represents a generated random character string, and keyLength represents the length of the key.
Compared with the prior art, the invention has the beneficial effects that: the invention provides a spoken English speech recognition method based on deep learning. Then, carrying out model structure and parameter tuning on the designed model on a training set; the data set is formed by combining audio matched with texts on the Internet and manually marked audio data aiming at an English spoken language examination, the robustness of a training model can be helped by the diversity of the Internet audio data, and the training model is finely adjusted through the manually marked data. And finally, deploying and applying the trained optimal model on actual data, thereby realizing spoken English speech recognition.
The invention adopts an improved end-to-end voice recognition method based on Transformer, constructs an examinee oral English training data set, integrally optimizes the whole network model in the training process, ensures global optimization and effectively improves the recognition accuracy. Because the network model adopts a relatively single network structure, the network model can be deployed on low-delay and high-precision equipment, and the calculation efficiency is high. After the model is deployed, the voice characteristics are input, english words are output, and the voice recognition process is simplified.
Therefore, compared with the prior art, the invention has prominent substantive features and remarkable progress, and the beneficial effects of the implementation are also obvious.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a process flow diagram of an embodiment of the present invention.
FIG. 2 is a schematic diagram of a model structure according to an embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a residual attention module according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made with reference to the accompanying drawings.
Fig. 1 shows a method for recognizing spoken english based on deep learning, which includes the following steps:
s1: and designing an English voice recognition model based on a Transformer.
Specifically, a position information embedding module, a multi-head self-attention module, a feedforward neural network and a cross-attention module are constructed and combined into an English speech recognition model based on a Transformer.
As shown in fig. 2, the overall model is composed of a position information embedding part, a multi-head self-attention module, a feed-forward neural network part, a cross-attention module part and the like, can solve the problem of audio recognition in the oral english language examination, forms a time-sequential feature vector on a word vector by a position information embedding method under the condition of no language modeling, and integrates the position information and the attention mechanism to realize the overall audio recognition process.
In the process of constructing the model, the existence time information of the audio sequence is considered, so that a time-sequence feature vector is formed on a word vector by a position information embedding method, the position information feature is expressed by using sine and cosine functions, and a position information embedding formula is as follows:
where p represents position, i represents dimension, and d is 512.
S2: and performing model structure adjustment and parameter optimization of the English voice recognition model through a training data set.
Specifically, the method comprises the following five steps:
s21: collecting audio matched with the text and manual annotation audio data aiming at the oral English test, and constructing a data set.
As an example, the construction of the data set is completed by collecting audio paired with text with a duration of 1.1 ten thousand hours and manually labeled audio data for an english oral examination with a duration of 0.15 ten thousand hours.
S22: preprocessing the data set, and dividing the data set into a training set and a testing set according to a preset distribution proportion.
As an example, all audio data in the data set is resampled to 16000Hz and the audio is cut into 30 second segments and corresponds to the tags. The preprocessed data sets are partitioned into training and test sets in the proportion of 8:2.
S23: and carrying out logarithmic Mel spectrum feature extraction on the data set, and carrying out feature normalization to globally scale the input features between-1 and 1.
S24: quantizer parameters and hyper-parameters are set for model training.
It should be noted that in the method, adam quantizer is used in model training, the quantizer parameters are set to β 1=0.9, β 2=0.95, and ∈ =10-8, and the method is based on the formula
Setting model training hyper-parameters as follows:
the feedforward neural network is set as a 6-layer perceptron, the head number of the multi-head self-attention module is set as 6, a residual attention module is formed by the multi-head self-attention module and the feedforward neural network, the encoder and the decoder are respectively formed by 5 residual attention modules, and the parameter discarding rate is set as 0.1.
As shown in fig. 3, the residual attention module is composed of a multi-headed self-attention module and a feedforward neural network, wherein the multi-headed self-attention module expands the ability of the model to focus on different positions, each weight matrix is used to map the input vector to different representation subspaces, and the output matrix is compressed as the input of the feedforward neural network.
S25: and performing Model training on the English speech recognition Model by using a training set to obtain a pre-training Model, and performing fine tuning on the pre-training Model by using a test set to obtain a final Model.
As an example, a Model is trained on a data set, a pre-trained Model is obtained by iterating 300000 steps, and a final Model is obtained by fine tuning using an artificially labeled data set on the basis of the pre-trained Model.
S3: and performing voice recognition on the English spoken language audio file collected in real time by using the adjusted English voice recognition model, and generating an audio recognition text.
The step aims to realize the deployment implementation of the model, and can be specifically realized through the following six steps:
s301: and deploying the trained Model.
In this step, the model deployment adopts an ONNX framework, and the model is encrypted and deployed so that the model can run on various devices by 0, and the encryption algorithm formula adopted is as follows:
the Model encrypt is the encrypted Model, the Model is the trained final Model, i represents the ith element in the Model, key represents the generated random string, and keyLength represents the length of the key.
By encrypting the ONNX model deployment, the ONNX model deployment can run on various devices, and the safety of the ONNX model deployment is ensured.
S302: preprocessing an English spoken language audio file, and suppressing the noise of the audio by adopting an LMS adaptive filter noise reduction method in the preprocessing process to obtain a preprocessed audio sequence x.
S303: the preprocessed audio sequence X is resampled to 16000Hz and cut into 30-second segments to form a batch of audio segments X.
S304: and carrying out logarithmic Mel spectrum feature extraction on the batch of audio fragments X, and carrying out normalization processing on the features.
S305: and inputting the normalized feature sequence F of the batch of audio clips X into a Model for prediction to obtain the probability distribution P of the recognition text.
S306: and performing table look-up on the probability distribution P of the obtained recognition text to obtain the audio recognition text.
As an example, the present invention further provides a specific embodiment to explain the implementation process of step S3. The experimental environment used is a Linux system configured to: the method comprises the following steps that Inter (R) Xeone E5-2620 v4 @ 2.10GHz memory 128G and four NVIDIA A40 GPU video cards with memories of 48G are used, and model training is not needed in the implementation process before the model is formally used. And selecting the reading question audio and the open question audio of each 100 oral English tests, wherein each time is 60 seconds.
The specific implementation steps are as follows:
s311: deploying the Model obtained by training;
s312: preprocessing each 100 spoken English reading audio and open question audio files, and suppressing the noise of the audio by using an LMS adaptive filter noise reduction method in the preprocessing process to obtain a preprocessed audio sequence x;
s313: resampling the preprocessed audio sequence X to 16000Hz, and intercepting the audio sequence X into 30-second segments to form batch audio segments X;
s314: carrying out logarithmic Mel spectrum feature extraction on the obtained batch of audio fragments X, and carrying out normalization processing on the features;
s315: inputting the normalized feature sequence F of the batch of audio clips X into a Model for prediction to obtain probability distribution P of the recognition text;
s316: and performing table look-up on the obtained probability distribution P of the recognition text to obtain the audio recognition text.
By utilizing the oral English speech recognition method based on deep learning, the oral English test spoken data of 200 examinees randomly selected are tested and are subjected to manual quality inspection, and the recognition result is shown in the table 1:
table 1: oral data identification accuracy statistical table for certain English test of 200 examinees
Type of data | Accuracy rate |
Reading questions clearly | 0.930 |
Open question | 0.875 |
As can be seen from Table 1, the spoken English speech recognition method based on deep learning provided by the invention can effectively overcome the defects in the prior art.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.
The method for recognizing spoken English based on deep learning provided by the invention is described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the method and its core concepts. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.
Claims (10)
1. An English spoken language voice recognition method based on deep learning is characterized by comprising the following steps:
s1: designing an English voice recognition model based on a Transformer;
s2: performing model structure adjustment and parameter optimization of the English voice recognition model through a training data set;
s3: and performing voice recognition on the English spoken language audio file collected in real time by using the adjusted English voice recognition model, and generating an audio recognition text.
2. The deep learning-based spoken english speech recognition method according to claim 1, wherein said step S1 comprises:
and constructing a position information embedding module, a multi-head self-attention module, a feedforward neural network and a cross attention module, and combining the position information embedding module, the multi-head self-attention module, the feedforward neural network and the cross attention module into an English speech recognition model based on a Transformer.
3. The spoken english speech recognition method based on deep learning according to claim 2, wherein the step S2 specifically comprises the steps of:
s21: acquiring audio matched with the text and manual annotation audio data aiming at the oral English test, and constructing a data set;
s22: preprocessing a data set, and dividing the data set into a training set and a test set according to a preset distribution proportion;
s23: carrying out logarithmic Mel spectrum feature extraction on the data set, carrying out feature normalization, and globally scaling the input features to be between-1 and 1;
s24: setting quantizer parameters and hyper-parameters for model training;
s25: and performing Model training on the English speech recognition Model by using a training set to obtain a pre-training Model, and performing fine tuning on the pre-training Model by using a test set to obtain a final Model.
4. The spoken english speech recognition method based on deep learning according to claim 3, wherein the step S3 specifically comprises the steps of:
s31: deploying the Model obtained by training;
s32: collecting an English spoken language audio file, and suppressing noise of the English spoken language audio file by adopting an LMS adaptive filter noise reduction method to obtain a preprocessed audio sequence x;
s33: resampling the preprocessed audio sequence X to 16000Hz, and intercepting the audio sequence X into 30-second segments to form batch audio segments X;
s34: carrying out logarithmic Mel spectrum feature extraction on the batch audio clips X, and carrying out normalization processing on the features;
s35: inputting the normalized feature sequence F of the batch of audio clips X into a Model for prediction to obtain probability distribution P of the recognition text;
s36: and performing table look-up on the probability distribution P of the recognition text to obtain the audio recognition text.
5. The method for recognizing spoken english speech based on deep learning according to claim 3, wherein the step S21 is specifically:
collecting audio matched with the text with the duration of 1.1 ten thousand hours and manual annotation audio data aiming at the oral English test with the duration of 0.15 ten thousand hours, and constructing the audio data into a data set.
6. The spoken english speech recognition method based on deep learning according to claim 3, wherein the step S22 is specifically:
resampling all audio data in the data set to 16000Hz, dividing the resampled audio data into segments with the duration of 30 seconds and identifying by utilizing a preset label to finish the preprocessing of the data set;
the preprocessed data set is divided into a training set and a testing set according to the proportion of 8:2.
7. The method for recognizing spoken english speech based on deep learning according to claim 3, wherein the step S24 is specifically:
setting quantizer parameters for Adam quantizer for model training, and according to formula
wherein d is 512 and warmup steps is 5000;
setting 6 layers of perceptrons for the feedforward neural network, and setting the head number of the multi-head self-attention module to be 6;
forming a residual attention module by using a multi-head self-attention module and a feedforward neural network;
the encoder and decoder were built with 5 residual attention modules, respectively, and the parameter discard rate was set to 0.1.
8. The method for recognizing spoken english speech based on deep learning according to claim 3, wherein the step S25 is specifically:
carrying out 300000 steps of iteration on the English speech recognition model by using a training set to obtain a pre-training model;
and (5) fine-tuning the pre-training Model by using the manually marked audio data in the test set to obtain a final Model.
9. The method for recognizing spoken English based on deep learning of claim 2, wherein the step S1 further comprises:
forming a characteristic vector with a time sequence on the word vector by a position information embedding method;
the position information characteristic is expressed by using sine and cosine functions, and the position information embedding formula is as follows:
where p represents position, i represents dimension, and d is 512.
10. The deep learning based spoken english recognition method according to claim 4, wherein said step S31 comprises:
adopting an ONNX framework to deploy a Model, and carrying out encryption deployment on the Model to enable the Model to operate on various devices by 0;
when encryption deployment is carried out, the encryption algorithm formula adopted is as follows:
the Model encrypt is an encrypted Model, the Model is a Model obtained by training, i represents the ith element in the Model, key represents a generated random character string, and keyLength represents the length of the key.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310194346.9A CN115862596A (en) | 2023-03-03 | 2023-03-03 | Deep learning-based spoken English speech recognition method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310194346.9A CN115862596A (en) | 2023-03-03 | 2023-03-03 | Deep learning-based spoken English speech recognition method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115862596A true CN115862596A (en) | 2023-03-28 |
Family
ID=85659833
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310194346.9A Pending CN115862596A (en) | 2023-03-03 | 2023-03-03 | Deep learning-based spoken English speech recognition method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115862596A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117912027A (en) * | 2024-03-18 | 2024-04-19 | 山东大学 | Intelligent identification method and system suitable for RPA process automation |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111916064A (en) * | 2020-08-10 | 2020-11-10 | 北京睿科伦智能科技有限公司 | End-to-end neural network speech recognition model training method |
CN113539244A (en) * | 2021-07-22 | 2021-10-22 | 广州虎牙科技有限公司 | End-to-end speech recognition model training method, speech recognition method and related device |
CN113570030A (en) * | 2021-01-18 | 2021-10-29 | 腾讯科技(深圳)有限公司 | Data processing method, device, equipment and storage medium |
CN113889085A (en) * | 2021-11-22 | 2022-01-04 | 北京百度网讯科技有限公司 | Speech recognition method, apparatus, device, storage medium and program product |
CN114333824A (en) * | 2021-12-31 | 2022-04-12 | 佛山科学技术学院 | Partial information fusion voice recognition network and method based on Transformer model and terminal |
CN114360503A (en) * | 2021-11-18 | 2022-04-15 | 腾讯科技(深圳)有限公司 | Voice recognition method, system, storage medium and terminal equipment |
CN115050371A (en) * | 2022-07-12 | 2022-09-13 | 深圳市普渡科技有限公司 | Speech recognition method, speech recognition device, computer equipment and storage medium |
-
2023
- 2023-03-03 CN CN202310194346.9A patent/CN115862596A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111916064A (en) * | 2020-08-10 | 2020-11-10 | 北京睿科伦智能科技有限公司 | End-to-end neural network speech recognition model training method |
CN113570030A (en) * | 2021-01-18 | 2021-10-29 | 腾讯科技(深圳)有限公司 | Data processing method, device, equipment and storage medium |
CN113539244A (en) * | 2021-07-22 | 2021-10-22 | 广州虎牙科技有限公司 | End-to-end speech recognition model training method, speech recognition method and related device |
CN114360503A (en) * | 2021-11-18 | 2022-04-15 | 腾讯科技(深圳)有限公司 | Voice recognition method, system, storage medium and terminal equipment |
CN113889085A (en) * | 2021-11-22 | 2022-01-04 | 北京百度网讯科技有限公司 | Speech recognition method, apparatus, device, storage medium and program product |
CN114333824A (en) * | 2021-12-31 | 2022-04-12 | 佛山科学技术学院 | Partial information fusion voice recognition network and method based on Transformer model and terminal |
CN115050371A (en) * | 2022-07-12 | 2022-09-13 | 深圳市普渡科技有限公司 | Speech recognition method, speech recognition device, computer equipment and storage medium |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117912027A (en) * | 2024-03-18 | 2024-04-19 | 山东大学 | Intelligent identification method and system suitable for RPA process automation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Chen et al. | End-to-end neural network based automated speech scoring | |
Miao et al. | Speaker adaptive training of deep neural network acoustic models using i-vectors | |
Li et al. | Robust automatic speech recognition: a bridge to practical applications | |
Ling et al. | Deep learning for acoustic modeling in parametric speech generation: A systematic review of existing techniques and future trends | |
Dahl et al. | Phone recognition with the mean-covariance restricted Boltzmann machine | |
CN108172218B (en) | Voice modeling method and device | |
CN107408111A (en) | End-to-end speech recognition | |
CN111837178A (en) | Speech processing system and method for processing speech signal | |
CN112509563A (en) | Model training method and device and electronic equipment | |
Shen et al. | Interactive learning of teacher-student model for short utterance spoken language identification | |
CN113744727B (en) | Model training method, system, terminal equipment and storage medium | |
CN115862596A (en) | Deep learning-based spoken English speech recognition method | |
Marlina et al. | Makhraj recognition of Hijaiyah letter for children based on Mel-Frequency Cepstrum Coefficients (MFCC) and Support Vector Machines (SVM) method | |
CN117672176A (en) | Rereading controllable voice synthesis method and device based on voice self-supervision learning characterization | |
KR102319753B1 (en) | Method and apparatus for producing video contents based on deep learning | |
Liu | Deep convolutional and LSTM neural networks for acoustic modelling in automatic speech recognition | |
Yi et al. | Prosodyspeech: Towards advanced prosody model for neural text-to-speech | |
CN108831486B (en) | Speaker recognition method based on DNN and GMM models | |
CN116092475B (en) | Stuttering voice editing method and system based on context-aware diffusion model | |
Shahamiri et al. | An investigation towards speaker identification using a single-sound-frame | |
Zegers | Speech recognition using neural networks | |
Watrous¹ et al. | Learned phonetic discrimination using connectionist networks | |
Saha | Development of a bangla speech to text conversion system using deep learning | |
Thangthai | Computer lipreading via hybrid deep neural network hidden Markov models | |
CN111833851B (en) | Method for automatically learning and optimizing acoustic model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20230328 |
|
RJ01 | Rejection of invention patent application after publication |