CN115019782B - Voice recognition method based on CTC multilayer loss - Google Patents
Voice recognition method based on CTC multilayer loss Download PDFInfo
- Publication number
- CN115019782B CN115019782B CN202210619908.5A CN202210619908A CN115019782B CN 115019782 B CN115019782 B CN 115019782B CN 202210619908 A CN202210619908 A CN 202210619908A CN 115019782 B CN115019782 B CN 115019782B
- Authority
- CN
- China
- Prior art keywords
- ctc
- network
- layer
- training
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 31
- 238000012549 training Methods 0.000 claims abstract description 61
- 238000012360 testing method Methods 0.000 claims abstract description 35
- 238000004364 calculation method Methods 0.000 claims description 36
- 238000007781 pre-processing Methods 0.000 claims description 8
- 238000010276 construction Methods 0.000 claims description 5
- 238000011478 gradient descent method Methods 0.000 claims description 4
- 238000011176 pooling Methods 0.000 claims description 4
- 238000009432 framing Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 abstract description 3
- 230000008859 change Effects 0.000 abstract description 2
- 238000003909 pattern recognition Methods 0.000 abstract 1
- 230000006870 function Effects 0.000 description 5
- 238000013461 design Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 230000001537 neural effect Effects 0.000 description 3
- 238000007792 addition Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/01—Assessment or evaluation of speech recognition systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Electrically Operated Instructional Devices (AREA)
- Telephonic Communication Services (AREA)
Abstract
A voice recognition method based on CTC multilayer loss belongs to the fields of pattern recognition and acoustics. The method standardizes the output of different layers of the voice recognition network, so that the output of different layers is as close as possible to the required voice recognition result, thereby improving the performance of voice recognition. The method comprises two stages of model training and model testing: in the training stage, inputting the preprocessed training set into a built multi-layer voice recognition network, calculating the losses of different layers and weights of different layers, carrying out weighted summation on the losses of different layers to obtain multi-layer losses, circularly calculating the losses, and updating network parameters until convergence; in the test stage, the preprocessed test set is input into a trained multi-layer voice recognition network, and a recognition result is output. The invention only changes the loss function of the training stage of the CTC speech recognition model, does not change the structure of the CTC speech recognition model and the speech recognition process thereof, and improves the accuracy of speech recognition with the characteristics of low complexity and low cost.
Description
Technical Field
The invention belongs to the fields of mode recognition and acoustics, and particularly relates to an end-to-end mode voice recognition technology.
Background
Speech recognition is a very important research topic in the acoustic field, and a traditional speech recognition model is a hybrid system using a hidden markov model, including an acoustic model, a language model and a pronunciation dictionary, but the model design is complex and difficult. End-to-end speech recognition is the more mainstream speech recognition method today, and compared with traditional methods such as hybrid systems using hidden markov models, end-to-end speech recognition simplifies the model design, training and decoding process. However, this improvement incurs more computational costs, and many advanced ASR architectures employ attention-based codec architectures, requiring significant computational costs and larger model sizes. In addition, the decoder operates in an autoregressive manner, requiring sequential computation, i.e., the generation of the next token can only begin after the previous token is completed. The Connectionist Temporal Classification (CTC) model proposed in 2006 does not require a separate decoder, is more compact and faster in design, and is suitable for use in speech recognition technology. In recent years, students improve the performance of the CTC by modifying the model structure and performing pre-training, but the performance of the CTC model is still weaker than that of the codec model due to the condition independence assumption of the CTC model, and the problems of adding an external language model and performing beam searching are often solved, but the two situations of high complexity, high cost and the like exist, so that the calculation cost is high, and the performance of the CTC model is difficult to improve in a low-complexity and low-cost mode.
Speech recognition techniques have a variety of application environments. Such as: in the voice instruction, a command can be directly issued to equipment or software through voice, and the method is suitable for various large search scenes such as video websites, intelligent hardware and the like; in game entertainment, voice recognition can convert voice into characters, so that diversified chat requirements of users are well met; and has important application in caption generation and meeting summary. With the wide application of the speech recognition technology in production and life, the low-complexity, low-cost and high-performance speech recognition model design is particularly important.
Disclosure of Invention
Aiming at the conditions of high complexity and high cost of the existing method for solving the problem of CTC model condition independence assumption (adding an external language model and carrying out beam search), the invention provides a voice recognition method based on CTC multi-layer loss. The method standardizes the output of different layers of the voice recognition network, so that the output of different layers is as close as possible to the required voice recognition result, thereby improving the performance of voice recognition. Normalizing the output of the different layers has different values for performance enhancement of CTC multi-layer speech recognition network g, which values are represented by different weights. In order to avoid the influence caused by the artificial subjective setting of weights of different layers, the weights are obtained through network training learning calculation. The method comprises the steps of obtaining output of different layers of a network during training, calculating CTC losses of the different layers and weights of the different layers, carrying out weighted summation on the losses of the different layers to obtain final CTC multi-layer losses, and updating model parameters to achieve model convergence by using a gradient descent method to obtain a final model. The method only changes the loss function of the training phase of the CTC model, and does not change the structure and the testing process of the CTC model, so that no additional cost is generated in the testing phase of the model. The method is easy to realize, and effectively improves the performance of CTC model speech recognition under the condition of not increasing the complexity and the cost of the CTC model.
The invention provides a voice recognition method based on CTC multi-layer loss, which is characterized by comprising a model training stage and a testing stage, wherein the training stage comprises training voice pretreatment, CTC multi-layer voice recognition network establishment, different-layer probability calculation, different-layer loss calculation, different-layer weight calculation, CTC multi-layer loss calculation and parameter model convergence updating as shown in figure 1. Model test stage: including test speech preprocessing and test speech recognition.
1) Model training stage:
1-1) training speech pretreatment:
The training set is s= { (x 1,y1),(x2,y2),…,(xN,yN) }, which means that there are N training samples, where the ith training sample is denoted (x i,yi),xi is an input sample, y i is a corresponding real label, that is, text information corresponding to voice x i. Dividing voice x i in the training set into T i frames, and solving its mel-cepstrum feature, to obtain the preprocessed training voice.
1-2) CTC multilayer speech recognition network construction:
A CTC multi-layer speech recognition network g was constructed, as shown in fig. 2, with network g comprising L transducer layers and 1 softmax layer, with overall network parameters phi in the range of 20 to 28. The input of the network is defined as the training voice x i after preprocessing, and the output of the network is the text information y i corresponding to the training voice. The network function is
yi=g(φ;xi)(i=1,2,…N)
1-3) Different layer probability calculation:
The preprocessed training voice x i is input into a network g, x l represents output of the training voice x i after passing through a first layer of the network g, and probability P (y i|xl) that a real label y i can be obtained by decoding according to x l is calculated as
This probability is referred to as the different layer probability, where B -1(yi) is an aligned set of length T i consistent with y i, where the content of the set is x l mapped to all paths of y i, where the paths include blank labels, and q is one of the paths.
1-4) Calculation of CTC loss at different layers:
the CTC penalty function is defined as the sum of the negative logarithms of the probability P (y i|xl) that the decoding gets a true label. Calculating the CTC loss of different layers as according to the probabilities of different layers obtained in 1-3)
LCTC_l=-ln P(yi|xl)(l=1,2,…L)
1-5) Different layer weight calculation:
The preprocessed training voice x i is input into a weight calculation network f to obtain weights alpha l (l=1, 2, … L) of different layers,
Wherein the method comprises the steps ofThe parameters of the network f are calculated for the weights,And alpha l is more than or equal to 0 and less than or equal to 1. As shown in fig. 3, the weight calculation network f is composed of a convolutional neural (CNN) network, a pooling layer, a full connection layer, and a softmax layer, wherein the number of CNN network layers can be set to 4 to 6, and the number of full connection layer layers can be set to 3 to 5. The weight representation of the different layers normalizes the output of the different layers has different values for improving the performance of the CTC multi-layer speech recognition network g. The weight is calculated by the network f, and the artificial subjective setting of the weight is avoided.
1-6) CTC multilayer loss calculation:
As shown in fig. 2, according to the weights of different layers alpha l, the CTC losses of different layers are weighted and summed to obtain CTC multi-layer losses, namely
Wherein L MulyiLayer_CTC is CTC multilayering loss.
1-7) Update parameter model convergence:
By using gradient descent method, minimize L MultiLayer_CTC, update network parameters phi and Repeating steps 1-3) through 1-6), calculating CTC multilayer loss L MultiLayer_CTC, updating network parameters phi andUntil L MultiLayer_CTC is smaller than the threshold value 0.001, the model converges, and after training is completed, the trained CTC multi-layer voice recognition network g and the weight calculation network f are obtained.
2) Model test stage:
2-1) test speech pretreatment:
ith sample data for test speech Framing is carried out, and the mel cepstrum characteristic is obtained, so that the preprocessed test voice is obtained.
2-2) Test speech recognition:
test speech after pretreatment And inputting the training CTC multi-layer voice recognition network g to obtain a voice recognition result.
Drawings
FIG. 1 is a step of the method of the present invention.
Fig. 2 is an overall architecture of the method of the present invention.
Fig. 3 is a composition of a weight calculation network.
Detailed Description
The invention provides a voice recognition method based on CTC multilayer loss, which is characterized by comprising a model training stage and a testing stage, wherein the model training stage is shown in the attached figure 1: the method comprises training voice preprocessing, CTC multi-layer voice recognition network construction, different layer probability calculation, different layer loss calculation and CTC multi-layer loss calculation, and updating parameter model convergence. Model test stage: including test speech preprocessing and test speech recognition. The specific implementation method is described below.
1) Model training stage:
1-1) training speech pretreatment:
The training set is s= { (x 1,y1),(x2,y2),…,(xN,yN) }, which means that there are N training samples, where the ith training sample is denoted (x i,yi),xi is an input sample, y i is a corresponding real label, that is, text information corresponding to voice x i. Dividing voice x i in the training set into T i frames, and solving its mel-cepstrum feature, to obtain the preprocessed training voice.
In this example, the aishell-1 dataset was used for training, comprising 120098 voices in the 150-hour training set.
1-2) CTC multilayer speech recognition network construction:
A CTC multi-layer speech recognition network g was constructed, as shown in fig. 2, with network g comprising L transducer layers and 1 softmax layer, with overall network parameters phi, L ranging from 12 to 32. In this embodiment, L is set to 24. The input of the network is defined as training voice x i after preprocessing, and the output of the network is word information y i corresponding to the voice. The network function is
yi=g(φ;xi)(i=1,2,…N)
1-3) Different layer probability calculation:
The preprocessed training voice x i is input into a network g, x l represents output of the training voice x i after passing through a first layer of the network g, and probability P (y i|xl) that a real label y i can be obtained by decoding according to x l is calculated as
This probability is referred to as the different layer probability, where B -1(yi) is an aligned set of length T i consistent with y i, where the content of the set is x l mapped to all paths of y i, where the paths include blank labels, and q is one of the paths.
1-4) Calculation of CTC loss at different layers:
the CTC penalty function is defined as the sum of the negative logarithms of the probability P (y i|xl) that the decoding gets a true label. Calculating the CTC loss of different layers as according to the probabilities of different layers obtained in 1-3)
LCTC_l=-ln P(yi|xl)(l=1,2,…L)
In this embodiment, CTC loss is calculated for each layer of the multi-layer speech recognition network g.
1-5) Different layer weight calculation:
The preprocessed training voice x i is input into a weight calculation network f to obtain weights alpha l (l=1, 2, … L) of different layers,
Wherein the method comprises the steps ofThe parameters of the network f are calculated for the weights,And alpha l is more than or equal to 0 and less than or equal to 1. As shown in fig. 3, the weight calculation network f is composed of a convolutional neural (CNN) network, a pooling layer, a full connection layer, and a softmax layer, wherein the number of CNN network layers can be set to 4 to 8, and the number of full connection layer layers can be set to 3 to 6. The weight representation of the different layers normalizes the output of the different layers has different values for improving the performance of the CTC multi-layer speech recognition network g. The weight is obtained by training and automatic learning through the network f, and the artificial subjective setting of the weight is avoided.
In this embodiment, the weight calculation network f is composed of a 4-layer convolutional neural (CNN) network, a pooling layer, a 3-layer full-connection layer, and a softmax layer. The final network f automatically calculates the weight parameter alpha l which has the characteristics of high weight of the high layer and low weight of the low layer.
1-6) CTC multilayer loss calculation:
As shown in fig. 2, according to the weights of different layers alpha l, the CTC losses of different layers are weighted and summed to obtain CTC multi-layer losses, namely
Wherein L MultiLayer_CTC is CTC multilayering loss.
1-7) Update parameter model convergence:
By using gradient descent method, minimize L MultiLayer_CTC, update network parameters phi and Repeating steps 1-3) through 1-6), calculating CTC multilayer loss L MultiLayer_CTC, updating network parameters phi andUntil L Multilayer_CTC is smaller than the threshold value 0.001, after training, obtaining the CTC multi-layer voice recognition network g and the weight calculation network f.
2) Model test stage:
2-1) test speech pretreatment:
ith sample data for test speech Framing is carried out, and the mel cepstrum characteristic is obtained.
In this example, the test was performed using aishell-1 datasets, including 7176 voices in the 5-hour training set.
2-2) Test speech recognition:
test sequence And inputting the training CTC multi-layer voice recognition network g to obtain a voice recognition result.
The specific embodiments described herein are offered by way of example only to illustrate the spirit of the invention. Those skilled in the art may make various modifications or additions to the described embodiments or substitutions thereof without departing from the spirit of the invention or exceeding the scope of the invention as defined in the accompanying claims.
Claims (1)
1. A speech recognition method based on CTC multi-layer loss is characterized by comprising a model training stage and a model testing stage; model training stage: the method comprises training voice preprocessing, CTC multi-layer voice recognition network construction, different-layer probability calculation, different-layer CTC loss calculation, different-layer weight calculation, CTC multi-layer loss calculation and updated parameter model convergence; model test stage: the method comprises the steps of test voice preprocessing and test voice recognition;
The method specifically comprises the following steps:
1) Model training stage:
1-1) training speech pretreatment:
The training set is S= { (x 1,y1),(x2,y2),…,(xN,yN) }, N training samples are represented, wherein the ith training sample is represented as (x i,yi),xi is an input sample, y i is a corresponding real label, namely text information corresponding to voice x i;
1-2) CTC multilayer speech recognition network construction:
Constructing a CTC multi-layer voice recognition network g, wherein the network g comprises L transducer layers and 1 softmax layer, the whole network parameter is phi, and the range of L is 20 to 28; defining the input of the network as the training voice x i after preprocessing, and the output of the network as the text information y i corresponding to the training voice; the network function is
Y i=g(φ;xi), where i=1, 2, … N;
1-3) different layer probability calculation:
The preprocessed training voice x i is input into a network g, x l represents output of the training voice x i after passing through a first layer of the network g, and probability P (y i|xl) that a real label y i can be obtained by decoding according to x l is calculated as
Wherein l=1, 2, … L;
This probability is called different layer probability, where B -1(yi) is an aligned set of length T i consistent with y i, where the content in the set is all paths that x l maps to y i, where the paths include blank labels, and q is one of the paths;
1-4) calculation of CTC loss at different layers:
The definition of CTC loss function is the sum of negative logarithms of probability P (y i|xl) of decoding to get a true label; calculating CTC loss of different layers as L CTC_l=-lnP(yi|xl) according to the different layer probabilities obtained in 1-3), wherein l=1, 2, … L;
1-5) different layer weight calculation:
Inputting the preprocessed training voice x i into a weight calculation network f to obtain weights alpha l of different layers, wherein l=1, 2and … L;
Wherein the method comprises the steps of The parameters of the network f are calculated for the weights,And alpha l is more than or equal to 0 and less than or equal to 1; the weight calculation network f consists of a CNN network, a pooling layer, a full connection layer and a softmax layer, wherein the number of layers of the CNN network is set to be 4 to 6, and the number of layers of the full connection layer can be set to be 3 to 5; the weight representation of different layers normalizes the output of different layers and has different values for improving the performance of the CTC multi-layer voice recognition network g; the weight is calculated by the network f;
1-6) CTC multilayer loss calculation:
According to the weights alpha l of different layers, the CTC losses of different layers are weighted and summed to obtain CTC multi-layer losses, namely
Wherein L MultiLayer_CTC is CTC multilayer loss;
1-7) update parameter model convergence:
By using gradient descent method, minimize L MultiLayer_CTC, update network parameters phi and Repeating steps 1-3) through 1-6), calculating CTC multilayer loss L MultiLayer_CTC, updating network parameters phi andUntil L MultiLayer_CTC is smaller than a threshold value of 0.001, converging the model, and after training, obtaining a trained CTC multi-layer voice recognition network g and a weight calculation network f;
2) Model test stage:
2-1) test speech pretreatment:
ith sample data for test speech Framing is carried out, and the mel cepstrum characteristic is obtained to obtain the preprocessed test voice;
2-2) test speech recognition:
test speech after pretreatment And inputting the training CTC multi-layer voice recognition network g to obtain a voice recognition result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210619908.5A CN115019782B (en) | 2022-06-02 | 2022-06-02 | Voice recognition method based on CTC multilayer loss |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210619908.5A CN115019782B (en) | 2022-06-02 | 2022-06-02 | Voice recognition method based on CTC multilayer loss |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115019782A CN115019782A (en) | 2022-09-06 |
CN115019782B true CN115019782B (en) | 2024-07-16 |
Family
ID=83072786
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210619908.5A Active CN115019782B (en) | 2022-06-02 | 2022-06-02 | Voice recognition method based on CTC multilayer loss |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115019782B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111968629A (en) * | 2020-07-08 | 2020-11-20 | 重庆邮电大学 | Chinese speech recognition method combining Transformer and CNN-DFSMN-CTC |
CN113488028A (en) * | 2021-06-23 | 2021-10-08 | 中科极限元(杭州)智能科技股份有限公司 | Speech transcription recognition training decoding method and system based on rapid skip decoding |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10593321B2 (en) * | 2017-12-15 | 2020-03-17 | Mitsubishi Electric Research Laboratories, Inc. | Method and apparatus for multi-lingual end-to-end speech recognition |
CN114023316B (en) * | 2021-11-04 | 2023-07-21 | 匀熵科技(无锡)有限公司 | TCN-transducer-CTC-based end-to-end Chinese speech recognition method |
-
2022
- 2022-06-02 CN CN202210619908.5A patent/CN115019782B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111968629A (en) * | 2020-07-08 | 2020-11-20 | 重庆邮电大学 | Chinese speech recognition method combining Transformer and CNN-DFSMN-CTC |
CN113488028A (en) * | 2021-06-23 | 2021-10-08 | 中科极限元(杭州)智能科技股份有限公司 | Speech transcription recognition training decoding method and system based on rapid skip decoding |
Also Published As
Publication number | Publication date |
---|---|
CN115019782A (en) | 2022-09-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111312245B (en) | Voice response method, device and storage medium | |
WO2018227780A1 (en) | Speech recognition method and device, computer device and storage medium | |
CN111145729B (en) | Speech recognition model training method, system, mobile terminal and storage medium | |
CN112509564A (en) | End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism | |
CN107408384A (en) | The end-to-end speech recognition of deployment | |
Huang et al. | Pretraining techniques for sequence-to-sequence voice conversion | |
Wang et al. | Exploring rnn-transducer for chinese speech recognition | |
Lee et al. | Joint learning of phonetic units and word pronunciations for ASR | |
CN110070855A (en) | A kind of speech recognition system and method based on migration neural network acoustic model | |
CN112967720B (en) | End-to-end voice-to-text model optimization method under small amount of accent data | |
Chen | Simulation of English speech emotion recognition based on transfer learning and CNN neural network | |
CN114596844A (en) | Acoustic model training method, voice recognition method and related equipment | |
CN111666752A (en) | Circuit teaching material entity relation extraction method based on keyword attention mechanism | |
CN115376491A (en) | Voice confidence calculation method, system, electronic equipment and medium | |
CN113793599B (en) | Training method of voice recognition model, voice recognition method and device | |
CN111310892B (en) | Language model modeling method based on independent cyclic neural network | |
Hai et al. | Cross-lingual phone mapping for large vocabulary speech recognition of under-resourced languages | |
CN115376547B (en) | Pronunciation evaluation method, pronunciation evaluation device, computer equipment and storage medium | |
CN115019782B (en) | Voice recognition method based on CTC multilayer loss | |
CN114120973B (en) | Training method for voice corpus generation system | |
CN116978367A (en) | Speech recognition method, device, electronic equipment and storage medium | |
Deng et al. | History utterance embedding transformer lm for speech recognition | |
Yang et al. | Indonesian speech recognition based on Deep Neural Network | |
CN114333790A (en) | Data processing method, device, equipment, storage medium and program product | |
Liu et al. | Keyword retrieving in continuous speech using connectionist temporal classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |