CN115862596A - Deep learning-based spoken English speech recognition method - Google Patents

Deep learning-based spoken English speech recognition method Download PDF

Info

Publication number
CN115862596A
CN115862596A CN202310194346.9A CN202310194346A CN115862596A CN 115862596 A CN115862596 A CN 115862596A CN 202310194346 A CN202310194346 A CN 202310194346A CN 115862596 A CN115862596 A CN 115862596A
Authority
CN
China
Prior art keywords
model
english
audio
training
deep learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310194346.9A
Other languages
Chinese (zh)
Inventor
马磊
陈义学
夏彬彬
侯庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Original Assignee
SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANDONG SHANDA OUMA SOFTWARE CO Ltd filed Critical SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Priority to CN202310194346.9A priority Critical patent/CN115862596A/en
Publication of CN115862596A publication Critical patent/CN115862596A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention provides an English spoken language voice recognition method based on deep learning, and belongs to the technical field of voice recognition. The method comprises the following steps: designing an English voice recognition model based on a Transformer; performing model structure adjustment and parameter optimization of the English voice recognition model through a training data set; and performing voice recognition on the English spoken language audio file collected in real time by using the adjusted English voice recognition model, and generating an audio recognition text. According to the invention, an end-to-end voice recognition method based on a Transformer is adopted, and an examinee English spoken language training data set is constructed, so that the whole network model is integrally optimized in the training process, the global optimum is ensured, and the recognition accuracy is effectively improved.

Description

Deep learning-based spoken English speech recognition method
Technical Field
The invention relates to the technical field of voice recognition, in particular to a spoken English voice recognition method based on deep learning.
Background
In recent years, with the rapid development of pattern recognition and artificial intelligence, and the deep application of technologies such as machine learning, especially deep learning, etc., the research and application fields of speech recognition technology are becoming more and more extensive. Speech recognition technology transcribes a speech signal into a corresponding text or command by a computer, which is essentially a pattern recognition process.
The model most commonly used in the initial stage of the speech recognition field is GMM-HMM, but the modeling capability is limited, the speech features and the structure cannot be completely and accurately represented, and more speech recognition models based on a neural network appear along with the development of deep learning. The DNN replaces GMM to realize the modeling of the observation state probability, so that the recognition accuracy can be improved, but the DNN-HMM model is difficult to train, each frame of voice in training data needs to be labeled, and the manual labeling difficulty is high. In addition, LSTM-RNN is widely used in acoustic models because it can capture context-dependent information of sequence data, but RNN calculation at each time requires the output of the previous time as input, and therefore, it can only be calculated serially, and is slow in speed, and furthermore, RNN is susceptible to gradient vanishing during training, converges more slowly, and requires more computational resources.
At present, when being applied to the speech recognition system of oral english examination with current speech recognition model based on neural network, although can satisfy the automatic speech recognition demand of oral english examination, the training process that current speech recognition model based on neural network exists is complicated, data annotation is difficult and occupy defects such as computational resource are big, the speech conversion flow that has led to current speech recognition system is very complicated, and when carrying out the pronunciation transcription of examinee oral english examination, speech recognition's rate of correctness remains to be improved.
Disclosure of Invention
Aiming at the problems in the prior art, the invention aims to provide an English spoken language voice recognition method based on deep learning, which adopts an end-to-end voice recognition method based on a Transformer and constructs an examinee English spoken language training data set, so that the whole network model is integrally optimized in the training process, the global optimum is ensured, and the recognition accuracy is effectively improved.
In order to achieve the purpose, the invention is realized by the following technical scheme:
a spoken English speech recognition method based on deep learning comprises the following steps:
s1: designing an English voice recognition model based on a Transformer;
s2: performing model structure adjustment and parameter optimization of the English voice recognition model through a training data set;
s3: and performing voice recognition on the English spoken language audio file collected in real time by using the adjusted English voice recognition model, and generating an audio recognition text.
Further, step S1 includes:
and constructing a position information embedding module, a multi-head self-attention module, a feed-forward neural network and a cross-attention module, and combining the position information embedding module, the multi-head self-attention module, the feed-forward neural network and the cross-attention module into an English speech recognition model based on a Transformer.
Further, step S2 specifically includes the following steps:
s21: acquiring audio matched with the text and manual annotation audio data aiming at the oral English test, and constructing a data set;
s22: preprocessing a data set, and dividing the data set into a training set and a test set according to a preset distribution proportion;
s23: carrying out logarithmic Mel spectrum feature extraction on the data set, carrying out feature normalization, and globally scaling the input features between-1 and 1;
s24: setting quantizer parameters and hyper-parameters for model training;
s25: and performing Model training on the English voice recognition Model by using a training set to obtain a pre-training Model, and performing fine adjustment on the pre-training Model by using a test set to obtain a final Model.
Further, step S3 specifically includes the following steps:
s31: deploying the Model obtained by training;
s32: collecting an English spoken language audio file, and suppressing noise of the English spoken language audio file by adopting an LMS adaptive filter noise reduction method to obtain a preprocessed audio sequence x;
s33: resampling the preprocessed audio sequence X to 16000Hz, and intercepting the audio sequence X into 30-second segments to form batch audio segments X;
s34: carrying out logarithmic Mel spectrum feature extraction on the batch audio fragments X, and carrying out normalization processing on the features;
s35: inputting the normalized feature sequence F of the batch of audio clips X into a Model for prediction to obtain probability distribution P of the recognition text;
s36: and performing table look-up on the probability distribution P of the recognition text to obtain the audio recognition text.
Further, step S21 specifically includes:
collecting audio matched with the text with the duration of 1.1 ten thousand hours and manual annotation audio data aiming at the oral English test with the duration of 0.15 ten thousand hours, and constructing the audio data into a data set.
Further, step S22 specifically includes:
resampling all audio data in the data set to 16000Hz, dividing the resampled audio data into segments with the duration of 30 seconds and identifying by utilizing a preset label to finish the preprocessing of the data set;
the preprocessed data set is divided into a training set and a testing set according to the proportion of 8:2.
Further, step S24 specifically includes:
setting quantizer parameters for Adam quantizer for model training, and according to formula
Figure SMS_1
Setting learning rate learngrate;
wherein d is 512 and warmup steps is 5000;
setting 6 layers of perceptrons for the feedforward neural network, and setting the number of the multi-head self-attention module heads to be 6;
forming a residual error attention module by using a multi-head self-attention module and a feedforward neural network;
the encoder and decoder were built with 5 residual attention modules, respectively, and the parameter discard rate was set to 0.1.
Further, step S25 specifically includes:
carrying out 300000 steps of iteration on the English speech recognition model by using a training set to obtain a pre-training model;
and (5) fine-tuning the pre-training Model by using the manually marked audio data in the test set to obtain a final Model.
Further, step S1 further includes:
forming a characteristic vector with a time sequence on the word vector by a position information embedding method;
the position information characteristic is expressed by using sine and cosine functions, and the position information embedding formula is as follows:
Figure SMS_2
where p represents position, i represents dimension, and d is 512.
Further, step S31 includes:
adopting an ONNX framework to deploy a Model, and carrying out encryption deployment on the Model to enable the Model to operate on various devices by 0;
when encryption deployment is carried out, the encryption algorithm formula adopted is as follows:
Figure SMS_3
the Model encrypt is an encrypted Model, the Model is a Model obtained by training, i represents the ith element in the Model, key represents a generated random character string, and keyLength represents the length of the key.
Compared with the prior art, the invention has the beneficial effects that: the invention provides a spoken English speech recognition method based on deep learning. Then, carrying out model structure and parameter tuning on the designed model on a training set; the data set is formed by combining audio matched with texts on the Internet and manually marked audio data aiming at an English spoken language examination, the robustness of a training model can be helped by the diversity of the Internet audio data, and the training model is finely adjusted through the manually marked data. And finally, deploying and applying the trained optimal model on actual data, thereby realizing spoken English speech recognition.
The invention adopts an improved end-to-end voice recognition method based on Transformer, constructs an examinee oral English training data set, integrally optimizes the whole network model in the training process, ensures global optimization and effectively improves the recognition accuracy. Because the network model adopts a relatively single network structure, the network model can be deployed on low-delay and high-precision equipment, and the calculation efficiency is high. After the model is deployed, the voice characteristics are input, english words are output, and the voice recognition process is simplified.
Therefore, compared with the prior art, the invention has prominent substantive features and remarkable progress, and the beneficial effects of the implementation are also obvious.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a process flow diagram of an embodiment of the present invention.
FIG. 2 is a schematic diagram of a model structure according to an embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a residual attention module according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made with reference to the accompanying drawings.
Fig. 1 shows a method for recognizing spoken english based on deep learning, which includes the following steps:
s1: and designing an English voice recognition model based on a Transformer.
Specifically, a position information embedding module, a multi-head self-attention module, a feedforward neural network and a cross-attention module are constructed and combined into an English speech recognition model based on a Transformer.
As shown in fig. 2, the overall model is composed of a position information embedding part, a multi-head self-attention module, a feed-forward neural network part, a cross-attention module part and the like, can solve the problem of audio recognition in the oral english language examination, forms a time-sequential feature vector on a word vector by a position information embedding method under the condition of no language modeling, and integrates the position information and the attention mechanism to realize the overall audio recognition process.
In the process of constructing the model, the existence time information of the audio sequence is considered, so that a time-sequence feature vector is formed on a word vector by a position information embedding method, the position information feature is expressed by using sine and cosine functions, and a position information embedding formula is as follows:
Figure SMS_4
where p represents position, i represents dimension, and d is 512.
S2: and performing model structure adjustment and parameter optimization of the English voice recognition model through a training data set.
Specifically, the method comprises the following five steps:
s21: collecting audio matched with the text and manual annotation audio data aiming at the oral English test, and constructing a data set.
As an example, the construction of the data set is completed by collecting audio paired with text with a duration of 1.1 ten thousand hours and manually labeled audio data for an english oral examination with a duration of 0.15 ten thousand hours.
S22: preprocessing the data set, and dividing the data set into a training set and a testing set according to a preset distribution proportion.
As an example, all audio data in the data set is resampled to 16000Hz and the audio is cut into 30 second segments and corresponds to the tags. The preprocessed data sets are partitioned into training and test sets in the proportion of 8:2.
S23: and carrying out logarithmic Mel spectrum feature extraction on the data set, and carrying out feature normalization to globally scale the input features between-1 and 1.
S24: quantizer parameters and hyper-parameters are set for model training.
It should be noted that in the method, adam quantizer is used in model training, the quantizer parameters are set to β 1=0.9, β 2=0.95, and ∈ =10-8, and the method is based on the formula
Figure SMS_5
The learning rate was set where d was 512 and warmup steps was 5000.
Setting model training hyper-parameters as follows:
the feedforward neural network is set as a 6-layer perceptron, the head number of the multi-head self-attention module is set as 6, a residual attention module is formed by the multi-head self-attention module and the feedforward neural network, the encoder and the decoder are respectively formed by 5 residual attention modules, and the parameter discarding rate is set as 0.1.
As shown in fig. 3, the residual attention module is composed of a multi-headed self-attention module and a feedforward neural network, wherein the multi-headed self-attention module expands the ability of the model to focus on different positions, each weight matrix is used to map the input vector to different representation subspaces, and the output matrix is compressed as the input of the feedforward neural network.
S25: and performing Model training on the English speech recognition Model by using a training set to obtain a pre-training Model, and performing fine tuning on the pre-training Model by using a test set to obtain a final Model.
As an example, a Model is trained on a data set, a pre-trained Model is obtained by iterating 300000 steps, and a final Model is obtained by fine tuning using an artificially labeled data set on the basis of the pre-trained Model.
S3: and performing voice recognition on the English spoken language audio file collected in real time by using the adjusted English voice recognition model, and generating an audio recognition text.
The step aims to realize the deployment implementation of the model, and can be specifically realized through the following six steps:
s301: and deploying the trained Model.
In this step, the model deployment adopts an ONNX framework, and the model is encrypted and deployed so that the model can run on various devices by 0, and the encryption algorithm formula adopted is as follows:
Figure SMS_6
the Model encrypt is the encrypted Model, the Model is the trained final Model, i represents the ith element in the Model, key represents the generated random string, and keyLength represents the length of the key.
By encrypting the ONNX model deployment, the ONNX model deployment can run on various devices, and the safety of the ONNX model deployment is ensured.
S302: preprocessing an English spoken language audio file, and suppressing the noise of the audio by adopting an LMS adaptive filter noise reduction method in the preprocessing process to obtain a preprocessed audio sequence x.
S303: the preprocessed audio sequence X is resampled to 16000Hz and cut into 30-second segments to form a batch of audio segments X.
S304: and carrying out logarithmic Mel spectrum feature extraction on the batch of audio fragments X, and carrying out normalization processing on the features.
S305: and inputting the normalized feature sequence F of the batch of audio clips X into a Model for prediction to obtain the probability distribution P of the recognition text.
S306: and performing table look-up on the probability distribution P of the obtained recognition text to obtain the audio recognition text.
As an example, the present invention further provides a specific embodiment to explain the implementation process of step S3. The experimental environment used is a Linux system configured to: the method comprises the following steps that Inter (R) Xeone E5-2620 v4 @ 2.10GHz memory 128G and four NVIDIA A40 GPU video cards with memories of 48G are used, and model training is not needed in the implementation process before the model is formally used. And selecting the reading question audio and the open question audio of each 100 oral English tests, wherein each time is 60 seconds.
The specific implementation steps are as follows:
s311: deploying the Model obtained by training;
s312: preprocessing each 100 spoken English reading audio and open question audio files, and suppressing the noise of the audio by using an LMS adaptive filter noise reduction method in the preprocessing process to obtain a preprocessed audio sequence x;
s313: resampling the preprocessed audio sequence X to 16000Hz, and intercepting the audio sequence X into 30-second segments to form batch audio segments X;
s314: carrying out logarithmic Mel spectrum feature extraction on the obtained batch of audio fragments X, and carrying out normalization processing on the features;
s315: inputting the normalized feature sequence F of the batch of audio clips X into a Model for prediction to obtain probability distribution P of the recognition text;
s316: and performing table look-up on the obtained probability distribution P of the recognition text to obtain the audio recognition text.
By utilizing the oral English speech recognition method based on deep learning, the oral English test spoken data of 200 examinees randomly selected are tested and are subjected to manual quality inspection, and the recognition result is shown in the table 1:
table 1: oral data identification accuracy statistical table for certain English test of 200 examinees
Type of data Accuracy rate
Reading questions clearly 0.930
Open question 0.875
As can be seen from Table 1, the spoken English speech recognition method based on deep learning provided by the invention can effectively overcome the defects in the prior art.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.
The method for recognizing spoken English based on deep learning provided by the invention is described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the method and its core concepts. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims (10)

1. An English spoken language voice recognition method based on deep learning is characterized by comprising the following steps:
s1: designing an English voice recognition model based on a Transformer;
s2: performing model structure adjustment and parameter optimization of the English voice recognition model through a training data set;
s3: and performing voice recognition on the English spoken language audio file collected in real time by using the adjusted English voice recognition model, and generating an audio recognition text.
2. The deep learning-based spoken english speech recognition method according to claim 1, wherein said step S1 comprises:
and constructing a position information embedding module, a multi-head self-attention module, a feedforward neural network and a cross attention module, and combining the position information embedding module, the multi-head self-attention module, the feedforward neural network and the cross attention module into an English speech recognition model based on a Transformer.
3. The spoken english speech recognition method based on deep learning according to claim 2, wherein the step S2 specifically comprises the steps of:
s21: acquiring audio matched with the text and manual annotation audio data aiming at the oral English test, and constructing a data set;
s22: preprocessing a data set, and dividing the data set into a training set and a test set according to a preset distribution proportion;
s23: carrying out logarithmic Mel spectrum feature extraction on the data set, carrying out feature normalization, and globally scaling the input features to be between-1 and 1;
s24: setting quantizer parameters and hyper-parameters for model training;
s25: and performing Model training on the English speech recognition Model by using a training set to obtain a pre-training Model, and performing fine tuning on the pre-training Model by using a test set to obtain a final Model.
4. The spoken english speech recognition method based on deep learning according to claim 3, wherein the step S3 specifically comprises the steps of:
s31: deploying the Model obtained by training;
s32: collecting an English spoken language audio file, and suppressing noise of the English spoken language audio file by adopting an LMS adaptive filter noise reduction method to obtain a preprocessed audio sequence x;
s33: resampling the preprocessed audio sequence X to 16000Hz, and intercepting the audio sequence X into 30-second segments to form batch audio segments X;
s34: carrying out logarithmic Mel spectrum feature extraction on the batch audio clips X, and carrying out normalization processing on the features;
s35: inputting the normalized feature sequence F of the batch of audio clips X into a Model for prediction to obtain probability distribution P of the recognition text;
s36: and performing table look-up on the probability distribution P of the recognition text to obtain the audio recognition text.
5. The method for recognizing spoken english speech based on deep learning according to claim 3, wherein the step S21 is specifically:
collecting audio matched with the text with the duration of 1.1 ten thousand hours and manual annotation audio data aiming at the oral English test with the duration of 0.15 ten thousand hours, and constructing the audio data into a data set.
6. The spoken english speech recognition method based on deep learning according to claim 3, wherein the step S22 is specifically:
resampling all audio data in the data set to 16000Hz, dividing the resampled audio data into segments with the duration of 30 seconds and identifying by utilizing a preset label to finish the preprocessing of the data set;
the preprocessed data set is divided into a training set and a testing set according to the proportion of 8:2.
7. The method for recognizing spoken english speech based on deep learning according to claim 3, wherein the step S24 is specifically:
setting quantizer parameters for Adam quantizer for model training, and according to formula
Figure QLYQS_1
Setting learning rate learningRate;
wherein d is 512 and warmup steps is 5000;
setting 6 layers of perceptrons for the feedforward neural network, and setting the head number of the multi-head self-attention module to be 6;
forming a residual attention module by using a multi-head self-attention module and a feedforward neural network;
the encoder and decoder were built with 5 residual attention modules, respectively, and the parameter discard rate was set to 0.1.
8. The method for recognizing spoken english speech based on deep learning according to claim 3, wherein the step S25 is specifically:
carrying out 300000 steps of iteration on the English speech recognition model by using a training set to obtain a pre-training model;
and (5) fine-tuning the pre-training Model by using the manually marked audio data in the test set to obtain a final Model.
9. The method for recognizing spoken English based on deep learning of claim 2, wherein the step S1 further comprises:
forming a characteristic vector with a time sequence on the word vector by a position information embedding method;
the position information characteristic is expressed by using sine and cosine functions, and the position information embedding formula is as follows:
Figure QLYQS_2
where p represents position, i represents dimension, and d is 512.
10. The deep learning based spoken english recognition method according to claim 4, wherein said step S31 comprises:
adopting an ONNX framework to deploy a Model, and carrying out encryption deployment on the Model to enable the Model to operate on various devices by 0;
when encryption deployment is carried out, the encryption algorithm formula adopted is as follows:
Figure QLYQS_3
the Model encrypt is an encrypted Model, the Model is a Model obtained by training, i represents the ith element in the Model, key represents a generated random character string, and keyLength represents the length of the key.
CN202310194346.9A 2023-03-03 2023-03-03 Deep learning-based spoken English speech recognition method Pending CN115862596A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310194346.9A CN115862596A (en) 2023-03-03 2023-03-03 Deep learning-based spoken English speech recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310194346.9A CN115862596A (en) 2023-03-03 2023-03-03 Deep learning-based spoken English speech recognition method

Publications (1)

Publication Number Publication Date
CN115862596A true CN115862596A (en) 2023-03-28

Family

ID=85659833

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310194346.9A Pending CN115862596A (en) 2023-03-03 2023-03-03 Deep learning-based spoken English speech recognition method

Country Status (1)

Country Link
CN (1) CN115862596A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117912027A (en) * 2024-03-18 2024-04-19 山东大学 Intelligent identification method and system suitable for RPA process automation

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111916064A (en) * 2020-08-10 2020-11-10 北京睿科伦智能科技有限公司 End-to-end neural network speech recognition model training method
CN113539244A (en) * 2021-07-22 2021-10-22 广州虎牙科技有限公司 End-to-end speech recognition model training method, speech recognition method and related device
CN113570030A (en) * 2021-01-18 2021-10-29 腾讯科技(深圳)有限公司 Data processing method, device, equipment and storage medium
CN113889085A (en) * 2021-11-22 2022-01-04 北京百度网讯科技有限公司 Speech recognition method, apparatus, device, storage medium and program product
CN114333824A (en) * 2021-12-31 2022-04-12 佛山科学技术学院 Partial information fusion voice recognition network and method based on Transformer model and terminal
CN114360503A (en) * 2021-11-18 2022-04-15 腾讯科技(深圳)有限公司 Voice recognition method, system, storage medium and terminal equipment
CN115050371A (en) * 2022-07-12 2022-09-13 深圳市普渡科技有限公司 Speech recognition method, speech recognition device, computer equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111916064A (en) * 2020-08-10 2020-11-10 北京睿科伦智能科技有限公司 End-to-end neural network speech recognition model training method
CN113570030A (en) * 2021-01-18 2021-10-29 腾讯科技(深圳)有限公司 Data processing method, device, equipment and storage medium
CN113539244A (en) * 2021-07-22 2021-10-22 广州虎牙科技有限公司 End-to-end speech recognition model training method, speech recognition method and related device
CN114360503A (en) * 2021-11-18 2022-04-15 腾讯科技(深圳)有限公司 Voice recognition method, system, storage medium and terminal equipment
CN113889085A (en) * 2021-11-22 2022-01-04 北京百度网讯科技有限公司 Speech recognition method, apparatus, device, storage medium and program product
CN114333824A (en) * 2021-12-31 2022-04-12 佛山科学技术学院 Partial information fusion voice recognition network and method based on Transformer model and terminal
CN115050371A (en) * 2022-07-12 2022-09-13 深圳市普渡科技有限公司 Speech recognition method, speech recognition device, computer equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117912027A (en) * 2024-03-18 2024-04-19 山东大学 Intelligent identification method and system suitable for RPA process automation

Similar Documents

Publication Publication Date Title
Chen et al. End-to-end neural network based automated speech scoring
Miao et al. Speaker adaptive training of deep neural network acoustic models using i-vectors
Li et al. Robust automatic speech recognition: a bridge to practical applications
Ling et al. Deep learning for acoustic modeling in parametric speech generation: A systematic review of existing techniques and future trends
Dahl et al. Phone recognition with the mean-covariance restricted Boltzmann machine
CN108172218B (en) Voice modeling method and device
CN107408111A (en) End-to-end speech recognition
CN111837178A (en) Speech processing system and method for processing speech signal
CN112509563A (en) Model training method and device and electronic equipment
Shen et al. Interactive learning of teacher-student model for short utterance spoken language identification
CN113744727B (en) Model training method, system, terminal equipment and storage medium
CN115862596A (en) Deep learning-based spoken English speech recognition method
Marlina et al. Makhraj recognition of Hijaiyah letter for children based on Mel-Frequency Cepstrum Coefficients (MFCC) and Support Vector Machines (SVM) method
CN117672176A (en) Rereading controllable voice synthesis method and device based on voice self-supervision learning characterization
KR102319753B1 (en) Method and apparatus for producing video contents based on deep learning
Liu Deep convolutional and LSTM neural networks for acoustic modelling in automatic speech recognition
Yi et al. Prosodyspeech: Towards advanced prosody model for neural text-to-speech
CN108831486B (en) Speaker recognition method based on DNN and GMM models
CN116092475B (en) Stuttering voice editing method and system based on context-aware diffusion model
Shahamiri et al. An investigation towards speaker identification using a single-sound-frame
Zegers Speech recognition using neural networks
Watrous¹ et al. Learned phonetic discrimination using connectionist networks
Saha Development of a bangla speech to text conversion system using deep learning
Thangthai Computer lipreading via hybrid deep neural network hidden Markov models
CN111833851B (en) Method for automatically learning and optimizing acoustic model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20230328

RJ01 Rejection of invention patent application after publication