CN115019782B - Voice recognition method based on CTC multilayer loss - Google Patents
Voice recognition method based on CTC multilayer loss Download PDFInfo
- Publication number
- CN115019782B CN115019782B CN202210619908.5A CN202210619908A CN115019782B CN 115019782 B CN115019782 B CN 115019782B CN 202210619908 A CN202210619908 A CN 202210619908A CN 115019782 B CN115019782 B CN 115019782B
- Authority
- CN
- China
- Prior art keywords
- ctc
- network
- layer
- speech recognition
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 23
- 238000012549 training Methods 0.000 claims abstract description 60
- 238000012360 testing method Methods 0.000 claims abstract description 35
- 238000004364 calculation method Methods 0.000 claims description 35
- 238000007781 pre-processing Methods 0.000 claims description 12
- 238000010276 construction Methods 0.000 claims description 6
- 238000011478 gradient descent method Methods 0.000 claims description 4
- 238000011176 pooling Methods 0.000 claims description 4
- 230000008569 process Effects 0.000 abstract description 3
- 230000008859 change Effects 0.000 abstract description 2
- 238000003909 pattern recognition Methods 0.000 abstract description 2
- 238000013527 convolutional neural network Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 5
- 238000013461 design Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 230000006872 improvement Effects 0.000 description 2
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 1
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 1
- 238000007792 addition Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/01—Assessment or evaluation of speech recognition systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Electrically Operated Instructional Devices (AREA)
- Telephonic Communication Services (AREA)
Abstract
一种基于CTC多层损失的语音识别方法,属于模式识别、声学领域。该方法对语音识别网络不同层的输出进行规范,使不同层的输出尽量接近所需要的语音识别结果,从而提高语音识别的性能。该方法包括模型训练与模型测试两个阶段:在训练阶段,将预处理后的训练集输入所搭建的多层语音识别网络中,计算不同层的损失和不同层的权重,将不同层损失加权求和得到多层损失,循环计算损失,更新网络参数直至收敛;在测试阶段,将预处理后的测试集输入训练好的多层语音识别网络,输出识别结果。本发明仅仅改变CTC语音识别模型训练阶段的损失函数,并不改变CTC语音识别模型的结构及其语音识别的过程,以低复杂度、低开销的特点提高语音识别的准确率。
A speech recognition method based on CTC multi-layer loss belongs to the field of pattern recognition and acoustics. The method standardizes the outputs of different layers of the speech recognition network so that the outputs of different layers are as close as possible to the required speech recognition results, thereby improving the performance of speech recognition. The method includes two stages: model training and model testing: in the training stage, the pre-processed training set is input into the constructed multi-layer speech recognition network, the losses of different layers and the weights of different layers are calculated, the weighted sum of the losses of different layers is obtained to obtain the multi-layer loss, the loss is calculated cyclically, and the network parameters are updated until convergence; in the testing stage, the pre-processed test set is input into the trained multi-layer speech recognition network, and the recognition result is output. The present invention only changes the loss function of the CTC speech recognition model training stage, and does not change the structure of the CTC speech recognition model and its speech recognition process, so as to improve the accuracy of speech recognition with the characteristics of low complexity and low overhead.
Description
技术领域Technical Field
本发明属于模式识别、声学领域,尤其涉及端到端模式的语音识别技术。The present invention belongs to the field of pattern recognition and acoustics, and in particular relates to an end-to-end speech recognition technology.
背景技术Background technique
语音识别在声学领域中是一个非常重要的研究课题,传统语音识别模型为使用隐马尔可夫模型的混合系统,包括声学模型、语言模型、发音词典,但模型设计复杂,难度大。端到端语音识别是当今比较主流的语音识别方法,与使用隐马尔可夫模型的混合系统等传统方法相比,端到端语音识别简化了模型设计、训练和解码过程。然而,这种改进却带来了更多的计算成本,许多先进的ASR体系结构采用基于注意力机制的编码解码体系结构,需要大量计算成本和较大的模型尺寸。此外,解码器以自回归方式运行,需要顺序计算,即只有在前一个令牌完成后才能开始生成下一个令牌。2006年提出的Connectionist TemporalClassification(CTC)模型不需要单独的解码器,设计更紧凑、更快速,适合用于语音识别技术中。近年来学者们通过修改模型结构和进行预训练提高了CTC的性能,但由于CTC模型的条件独立性假设,其性能仍弱于编码解码器模型,克服CTC条件独立性假设问题往往需要添加外部语言模型和进行波束搜索,但二者都存在高复杂度、高开销等情况,计算成本高,难以通过低复杂度、低开销的方式提高CTC模型的性能。Speech recognition is a very important research topic in the field of acoustics. The traditional speech recognition model is a hybrid system using a hidden Markov model, including an acoustic model, a language model, and a pronunciation dictionary, but the model design is complex and difficult. End-to-end speech recognition is a more mainstream speech recognition method today. Compared with traditional methods such as hybrid systems using hidden Markov models, end-to-end speech recognition simplifies the model design, training, and decoding processes. However, this improvement brings more computational costs. Many advanced ASR architectures use an encoding-decoding architecture based on an attention mechanism, which requires a lot of computational costs and a large model size. In addition, the decoder operates in an autoregressive manner and requires sequential calculations, that is, the next token can only be generated after the previous token is completed. The Connectionist Temporal Classification (CTC) model proposed in 2006 does not require a separate decoder, and its design is more compact and faster, making it suitable for use in speech recognition technology. In recent years, scholars have improved the performance of CTC by modifying the model structure and performing pre-training. However, due to the conditional independence assumption of the CTC model, its performance is still weaker than the encoder-decoder model. Overcoming the conditional independence assumption problem of CTC often requires adding an external language model and performing beam search. However, both of them have high complexity and high overhead, and the computational cost is high, making it difficult to improve the performance of the CTC model through low-complexity and low-overhead methods.
语音识别技术有多类应用环境。如:语音指令中,可通过语音直接对设备或者软件发布命令,适用于视频网站、智能硬件等各大搜索场景;游戏娱乐中,语音识别可以将语音转换成文字,很好地满足了用户的多元化聊天需求;同时在字幕生成与会议纪要中也有重要的应用。随着语音识别技术在生产生活中广泛运用,低复杂度、低开销、高性能的语音识别模型设计格外重要。Speech recognition technology has many application environments. For example, in voice commands, commands can be issued directly to devices or software through voice, which is suitable for major search scenarios such as video websites and smart hardware; in game entertainment, speech recognition can convert speech into text, which well meets the diverse chat needs of users; it also has important applications in subtitle generation and meeting minutes. With the widespread use of speech recognition technology in production and life, the design of low-complexity, low-cost, and high-performance speech recognition models is particularly important.
发明内容Summary of the invention
本发明针对于目前解决CTC模型条件独立性假设问题的方法(添加外部语言模型、进行波束搜索)存在高复杂度、高开销的情况,提出一种基于CTC多层损失的语音识别方法。该方法对语音识别网络不同层的输出进行规范,使不同层的输出尽量接近所需要的语音识别结果,从而提高语音识别的性能。对不同层的输出进行规范对CTC多层语音识别网络g的性能提升具有不同的价值,该价值用不同的权重表示。为避免人为主观地设置不同层的权重带来的影响,该权重通过网络训练学习计算得到。本方法在训练时获取网络不同层的输出,计算不同层的CTC损失和不同层的权重,对不同层的损失进行加权求和,得到最终的CTC多层损失,利用梯度下降法,更新模型参数至模型收敛,得到最终的模型。由于本方法仅仅改变CTC模型训练阶段的损失函数,并不改变CTC模型的结构及其测试过程,因此模型测试阶段没有额外开销。本方法易于实现,在不增加CTC模型的复杂度和开销的条件下,有效提高CTC模型语音识别的性能。The present invention is aimed at the situation that the current method for solving the conditional independence hypothesis problem of the CTC model (adding an external language model and performing beam search) has high complexity and high overhead, and proposes a speech recognition method based on CTC multi-layer loss. The method standardizes the outputs of different layers of the speech recognition network so that the outputs of different layers are as close as possible to the required speech recognition results, thereby improving the performance of speech recognition. Standardizing the outputs of different layers has different values for the performance improvement of the CTC multi-layer speech recognition network g, and the values are represented by different weights. In order to avoid the influence of subjectively setting the weights of different layers, the weights are calculated by network training learning. During training, the method obtains the outputs of different layers of the network, calculates the CTC losses of different layers and the weights of different layers, performs weighted summation on the losses of different layers, obtains the final CTC multi-layer loss, and uses the gradient descent method to update the model parameters until the model converges to obtain the final model. Since the method only changes the loss function of the CTC model training stage, and does not change the structure of the CTC model and its test process, there is no additional overhead in the model testing stage. The method is easy to implement and effectively improves the performance of CTC model speech recognition without increasing the complexity and overhead of the CTC model.
本发明提出一种基于CTC多层损失的语音识别方法,其特征在于,包括模型训练阶段与测试阶段,如附图1所示,训练阶段包括训练语音预处理、CTC多层语音识别网络搭建、不同层概率计算、不同层损失计算、不同层权重计算、CTC多层损失计算,更新参数模型收敛。模型测试阶段:包括测试语音预处理、测试语音识别。The present invention proposes a speech recognition method based on CTC multi-layer loss, which is characterized by comprising a model training phase and a test phase. As shown in FIG1 , the training phase comprises training speech preprocessing, CTC multi-layer speech recognition network construction, probability calculation of different layers, loss calculation of different layers, weight calculation of different layers, CTC multi-layer loss calculation, and updating parameter model convergence. Model testing phase: comprises test speech preprocessing and test speech recognition.
1)模型训练阶段:1) Model training phase:
1-1)训练语音预处理:1-1) Training speech preprocessing:
训练集合为S={(x1,y1),(x2,y2),…,(xN,yN)},表示有N个训练样本,其中第i个训练样本表示为(xi,yi),xi为输入样本,为一条语音,yi为对应的真实标签,也就是语音xi对应的文字信息。将训练集合中的语音xi分为Ti帧,并求解其梅尔倒谱特征,得到预处理后的训练语音。The training set is S = {(x 1 , y 1 ), (x 2 , y 2 ), …, (x N , y N )}, which means there are N training samples, where the i-th training sample is represented as ( xi , yi ), x i is the input sample, which is a voice, and yi is the corresponding true label, that is, the text information corresponding to the voice x i . The voice x i in the training set is divided into Ti frames, and its Mel cepstrum features are solved to obtain the preprocessed training voice.
1-2)CTC多层语音识别网络搭建:1-2) Construction of CTC multi-layer speech recognition network:
构建一个CTC多层语音识别网络g,如附图2所示,网络g包括L个Transformer层和1个softmax层,整个网络参数为φ,L的范围为20到28。定义网络的输入为经过预处理后的训练语音xi,网络的输出为该训练语音对应的文字信息yi。网络函数为Construct a CTC multi-layer speech recognition network g, as shown in Figure 2. The network g includes L Transformer layers and 1 softmax layer. The entire network parameter is φ, and the range of L is 20 to 28. Define the input of the network as the preprocessed training speech x i , and the output of the network is the text information y i corresponding to the training speech. The network function is
yi=g(φ;xi)(i=1,2,…N) yi =g(φ; xi )(i=1,2,…N)
1-3)不同层概率计算:1-3) Probability calculation at different layers:
将预处理后的训练语音xi输入网络g,xl表示训练语音xi经过网络g第l层后的输出,计算根据xl可以解码得到真实标签yi的概率P(yi|xl)为The preprocessed training speech xi is input into the network g, xl represents the output of the training speech xi after passing through the lth layer of the network g, and the probability P( yi | xl ) that the true label yi can be decoded according to xl is calculated as
该概率称为不同层概率,其中B-1(yi)是与yi一致的长度为Ti的对齐集合,集合中的内容为xl映射到yi的所有路径,路径中包括空白标记,q为其中一条路径。This probability is called the different layer probability, where B -1 (y i ) is an alignment set of length T i that is consistent with y i , and the contents of the set are all paths that map x l to y i , including blank marks, and q is one of the paths.
1-4)不同层CTC损失计算:1-4) Calculation of CTC loss at different layers:
CTC损失函数的定义为解码得到真实标签的概率P(yi|xl)的负对数之和。根据1-3)得到的不同层概率,计算不同层CTC损失为The CTC loss function is defined as the sum of the negative logarithms of the probability P(y i |x l ) of decoding to obtain the true label. According to the different layer probabilities obtained in 1-3), the CTC loss of different layers is calculated as
LCTC_l=-ln P(yi|xl)(l=1,2,…L)L CTC_l = -ln P(y i |x l )(l = 1, 2, ... L)
1-5)不同层权重计算:1-5) Weight calculation of different layers:
将预处理后的训练语音xi输入权重计算网络f,得到不同层的权重αl(l=1,2,…L),The preprocessed training speech xi is input into the weight calculation network f to obtain the weights αl (l=1,2,…L) of different layers.
其中为权重计算网络f的参数,且0≤αl≤1。如附图3所示,权重计算网络f由卷积神经(CNN)网络、池化层、全连接层、softmax层组成,其中CNN网络层数可设置为4到6层,全连接层的层数可以设置为3到5层。不同层的权重表示对不同层的输出进行规范对CTC多层语音识别网络g的性能提升具有不同的价值。该权重由网络f计算得到,避免人为主观地设置权重。in Calculate the parameters of the network f for the weights, And 0≤α l ≤1. As shown in Figure 3, the weight calculation network f is composed of a convolutional neural network (CNN) network, a pooling layer, a fully connected layer, and a softmax layer, wherein the number of CNN network layers can be set to 4 to 6 layers, and the number of fully connected layers can be set to 3 to 5 layers. The weights of different layers indicate that the standardization of the outputs of different layers has different values for improving the performance of the CTC multi-layer speech recognition network g. The weight is calculated by the network f to avoid subjective weight setting.
1-6)CTC多层损失计算:1-6) CTC multi-layer loss calculation:
如附图2所示,根据不同层的权重αl,将不同层的CTC损失加权求和,得到CTC多层损失,即As shown in Figure 2, according to the weights α l of different layers, the CTC losses of different layers are weighted and summed to obtain the CTC multi-layer loss, that is,
其中LMulyiLayer_CTC为CTC多层损失。Where L MulyiLayer_CTC is the CTC multi-layer loss.
1-7)更新参数模型收敛:1-7) Update parameter model convergence:
利用梯度下降法,最小化LMultiLayer_CTC,更新网络参数φ和重复步骤1-3)至1-6),计算CTC多层损失LMultiLayer_CTC,更新网络参数φ和直到LMultiLayer_CTC小于阈值0.001,模型收敛,训练完毕,得到训练好的CTC多层语音识别网络g和权重计算网络f。Using the gradient descent method, minimize L MultiLayer_CTC and update the network parameters φ and Repeat steps 1-3) to 1-6), calculate the CTC multi-layer loss L MultiLayer_CTC , and update the network parameters φ and Until L MultiLayer_CTC is less than the threshold 0.001, the model converges and the training is completed, and the trained CTC multi-layer speech recognition network g and weight calculation network f are obtained.
2)模型测试阶段:2) Model testing phase:
2-1)测试语音预处理:2-1) Test speech preprocessing:
对测试语音的第i个样本数据进行分帧,并求取梅尔倒谱特征,得到预处理后的测试语音。For the i-th sample data of the test speech The speech is divided into frames and Mel-frequency cepstrum features are obtained to obtain the preprocessed test speech.
2-2)测试语音识别:2-2) Test speech recognition:
将预处理后测试语音输入已训练好的CTC多层语音识别网络g中,得到语音识别的结果。The preprocessed test speech Input the trained CTC multi-layer speech recognition network g to obtain the speech recognition result.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是本发明方法的步骤。FIG. 1 shows the steps of the method of the present invention.
图2是本发明方法的整体架构。FIG. 2 is the overall architecture of the method of the present invention.
图3是权重计算网络的组成。Figure 3 shows the composition of the weight calculation network.
具体实施方式Detailed ways
本发明提出一种基于CTC多层损失的语音识别方法,其特征在于,包括模型训练阶段与测试阶段,如附图1所示,模型训练阶段:包括训练语音预处理、CTC多层语音识别网络搭建、不同层概率计算、不同层损失计算、CTC多层损失计算,更新参数模型收敛。模型测试阶段:包括测试语音预处理、测试语音识别。具体实施方法说明如下。The present invention proposes a speech recognition method based on CTC multi-layer loss, characterized in that it includes a model training phase and a test phase, as shown in Figure 1, the model training phase includes training speech preprocessing, CTC multi-layer speech recognition network construction, probability calculation of different layers, loss calculation of different layers, CTC multi-layer loss calculation, and updating parameter model convergence. The model testing phase includes test speech preprocessing and test speech recognition. The specific implementation method is described as follows.
1)模型训练阶段:1) Model training phase:
1-1)训练语音预处理:1-1) Training speech preprocessing:
训练集合为S={(x1,y1),(x2,y2),…,(xN,yN)},表示有N个训练样本,其中第i个训练样本表示为(xi,yi),xi为输入样本,为一条语音,yi为对应的真实标签,也就是语音xi对应的文字信息。将训练集合中的语音xi分为Ti帧,并求解其梅尔倒谱特征,得到预处理后的训练语音。The training set is S = {(x 1 , y 1 ), (x 2 , y 2 ), …, (x N , y N )}, which means there are N training samples, where the i-th training sample is represented as ( xi , yi ), x i is the input sample, which is a voice, and yi is the corresponding true label, that is, the text information corresponding to the voice x i . The voice x i in the training set is divided into Ti frames, and its Mel cepstrum features are solved to obtain the preprocessed training voice.
本实施例中,使用aishell-1数据集进行训练,包括150小时训练集共120098条语音。In this embodiment, the aishell-1 dataset is used for training, including a 150-hour training set with a total of 120,098 voices.
1-2)CTC多层语音识别网络搭建:1-2) Construction of CTC multi-layer speech recognition network:
构建一个CTC多层语音识别网络g,如附图2所示,网络g包括L个Transformer层和1个softmax层,整个网络参数为φ,L的范围为12到32。本实施例中,L设置为24。定义网络的输入为经过预处理后的训练语音xi,网络的输出为该语音对应的文字信息yi。网络函数为Construct a CTC multi-layer speech recognition network g, as shown in Figure 2. The network g includes L Transformer layers and 1 softmax layer. The entire network parameter is φ, and the range of L is 12 to 32. In this embodiment, L is set to 24. Define the input of the network as the preprocessed training speech x i , and the output of the network is the text information y i corresponding to the speech. The network function is
yi=g(φ;xi)(i=1,2,…N) yi =g(φ; xi )(i=1,2,…N)
1-3)不同层概率计算:1-3) Probability calculation at different layers:
将预处理后的训练语音xi输入网络g,xl表示训练语音xi经过网络g第l层后的输出,计算根据xl可以解码得到真实标签yi的概率P(yi|xl)为The preprocessed training speech xi is input into the network g, xl represents the output of the training speech xi after passing through the lth layer of the network g, and the probability P( yi | xl ) that the true label yi can be decoded according to xl is calculated as
该概率称为不同层概率,其中B-1(yi)是与yi一致的长度为Ti的对齐集合,集合中的内容为xl映射到yi的所有路径,路径中包括空白标记,q为其中一条路径。This probability is called the different layer probability, where B -1 (y i ) is an alignment set of length T i that is consistent with y i , and the contents of the set are all paths that map x l to y i , including blank marks, and q is one of the paths.
1-4)不同层CTC损失计算:1-4) Calculation of CTC loss at different layers:
CTC损失函数的定义为解码得到真实标签的概率P(yi|xl)的负对数之和。根据1-3)得到的不同层概率,计算不同层CTC损失为The CTC loss function is defined as the sum of the negative logarithms of the probability P(y i |x l ) of decoding to obtain the true label. According to the different layer probabilities obtained in 1-3), the CTC loss of different layers is calculated as
LCTC_l=-ln P(yi|xl)(l=1,2,…L)L CTC_l = -ln P(y i |x l )(l = 1, 2, ... L)
本实施例中,计算多层语音识别网络g每一层的CTC损失。In this embodiment, the CTC loss of each layer of the multi-layer speech recognition network g is calculated.
1-5)不同层权重计算:1-5) Weight calculation of different layers:
将预处理后的训练语音xi输入权重计算网络f,得到不同层的权重αl(l=1,2,…L),The preprocessed training speech xi is input into the weight calculation network f to obtain the weights αl (l=1,2,…L) of different layers.
其中为权重计算网络f的参数,且0≤αl≤1。如附图3所示,权重计算网络f由卷积神经(CNN)网络、池化层、全连接层、softmax层组成,其中CNN网络层数可设置为4到8层,全连接层的层数可以设置为3到6层。不同层的权重表示对不同层的输出进行规范对CTC多层语音识别网络g的性能提升具有不同的价值。该权重由网络f通过训练自动学习得到,避免人为主观地设置权重。in Calculate the parameters of the network f for the weights, And 0≤α l ≤1. As shown in Figure 3, the weight calculation network f is composed of a convolutional neural network (CNN) network, a pooling layer, a fully connected layer, and a softmax layer, wherein the number of CNN network layers can be set to 4 to 8 layers, and the number of fully connected layers can be set to 3 to 6 layers. The weights of different layers indicate that the standardization of the outputs of different layers has different values for improving the performance of the CTC multi-layer speech recognition network g. The weights are automatically learned by the network f through training to avoid subjective weight setting.
本实施例中,权重计算网络f由4层的卷积神经(CNN)网络、池化层、3层的全连接层、softmax层组成。最终网络f自动计算得到的权重参数αl具有高层的权重大,低层权重小的特点。In this embodiment, the weight calculation network f is composed of a 4-layer convolutional neural network (CNN) network, a pooling layer, a 3-layer fully connected layer, and a softmax layer. The weight parameter α l automatically calculated by the final network f has the characteristics of large weights at high levels and small weights at low levels.
1-6)CTC多层损失计算:1-6) CTC multi-layer loss calculation:
如附图2所示,根据不同层的权重αl,将不同层的CTC损失加权求和,得到CTC多层损失,即As shown in Figure 2, according to the weights α l of different layers, the CTC losses of different layers are weighted and summed to obtain the CTC multi-layer loss, that is,
其中LMultiLayer_CTC为CTC多层损失。Where L MultiLayer_CTC is the CTC multi-layer loss.
1-7)更新参数模型收敛:1-7) Update parameter model convergence:
利用梯度下降法,最小化LMultiLayer_CTC,更新网络参数φ和重复步骤1-3)至1-6),计算CTC多层损失LMultiLayer_CTC,更新网络参数φ和直到LMultilayer_CTC小于阈值0.001,训练完毕,得到CTC多层语音识别网络g和权重计算网络f。Using the gradient descent method, minimize L MultiLayer_CTC and update the network parameters φ and Repeat steps 1-3) to 1-6), calculate the CTC multi-layer loss L MultiLayer_CTC , and update the network parameters φ and Until L Multilayer_CTC is less than the threshold 0.001, the training is completed, and the CTC multi-layer speech recognition network g and the weight calculation network f are obtained.
2)模型测试阶段:2) Model testing phase:
2-1)测试语音预处理:2-1) Test speech preprocessing:
对测试语音的第i个样本数据进行分帧,并求取梅尔倒谱特征。For the i-th sample data of the test speech Frame the image and obtain the Mel-frequency cepstrum features.
本实施例中,使用aishell-1数据集进行测试,包括5小时训练集共7176条语音。In this embodiment, the aishell-1 dataset is used for testing, including a 5-hour training set with a total of 7176 voices.
2-2)测试语音识别:2-2) Test speech recognition:
将测试序列输入已训练好的CTC多层语音识别网络g中,得到语音识别的结果。The test sequence Input the trained CTC multi-layer speech recognition network g to obtain the speech recognition result.
本文中所描述的具体实施例仅仅是对本发明精神作举例说明。本发明所属技术领域的技术人员可以对所描述的具体实施例做各种各样的修改或补充或采用类似的方式替代,但并不会偏离本发明的精神或者超越所附权利要求书所定义的范围。The specific embodiments described herein are merely examples of the spirit of the present invention. Those skilled in the art may make various modifications or additions to the specific embodiments described or replace them in similar ways, but they will not deviate from the spirit of the present invention or exceed the scope defined by the appended claims.
Claims (1)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210619908.5A CN115019782B (en) | 2022-06-02 | 2022-06-02 | Voice recognition method based on CTC multilayer loss |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210619908.5A CN115019782B (en) | 2022-06-02 | 2022-06-02 | Voice recognition method based on CTC multilayer loss |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115019782A CN115019782A (en) | 2022-09-06 |
CN115019782B true CN115019782B (en) | 2024-07-16 |
Family
ID=83072786
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210619908.5A Active CN115019782B (en) | 2022-06-02 | 2022-06-02 | Voice recognition method based on CTC multilayer loss |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115019782B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111968629A (en) * | 2020-07-08 | 2020-11-20 | 重庆邮电大学 | Chinese speech recognition method combining Transformer and CNN-DFSMN-CTC |
CN113488028A (en) * | 2021-06-23 | 2021-10-08 | 中科极限元(杭州)智能科技股份有限公司 | Speech transcription recognition training decoding method and system based on rapid skip decoding |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10593321B2 (en) * | 2017-12-15 | 2020-03-17 | Mitsubishi Electric Research Laboratories, Inc. | Method and apparatus for multi-lingual end-to-end speech recognition |
CN114023316B (en) * | 2021-11-04 | 2023-07-21 | 匀熵科技(无锡)有限公司 | TCN-transducer-CTC-based end-to-end Chinese speech recognition method |
-
2022
- 2022-06-02 CN CN202210619908.5A patent/CN115019782B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111968629A (en) * | 2020-07-08 | 2020-11-20 | 重庆邮电大学 | Chinese speech recognition method combining Transformer and CNN-DFSMN-CTC |
CN113488028A (en) * | 2021-06-23 | 2021-10-08 | 中科极限元(杭州)智能科技股份有限公司 | Speech transcription recognition training decoding method and system based on rapid skip decoding |
Also Published As
Publication number | Publication date |
---|---|
CN115019782A (en) | 2022-09-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111312245B (en) | Voice response method, device and storage medium | |
Coucke et al. | Efficient keyword spotting using dilated convolutions and gating | |
CN110164476B (en) | A Speech Emotion Recognition Method Based on BLSTM Based on Multi-output Feature Fusion | |
CN114023316A (en) | TCN-Transformer-CTC-based end-to-end Chinese voice recognition method | |
CN110223714B (en) | Emotion recognition method based on voice | |
CN110556100A (en) | Training method and system of end-to-end speech recognition model | |
CN111754992B (en) | A noise-robust audio-video dual-modal speech recognition method and system | |
JP2023545988A (en) | Transformer transducer: One model that combines streaming and non-streaming speech recognition | |
WO2021057038A1 (en) | Apparatus and method for speech recognition and keyword detection based on multi-task model | |
Wang et al. | Exploring rnn-transducer for chinese speech recognition | |
CN109036467A (en) | CFFD extracting method, speech-emotion recognition method and system based on TF-LSTM | |
CN110148408A (en) | A kind of Chinese speech recognition method based on depth residual error | |
CN113345418B (en) | Multilingual model training method based on cross-language self-training | |
CN115019776A (en) | Voice recognition model, training method thereof, voice recognition method and device | |
KR20240068704A (en) | Contrast Siamese networks for semi-supervised speech recognition. | |
WO2023226239A1 (en) | Object emotion analysis method and apparatus and electronic device | |
CN112735404A (en) | Ironic detection method, system, terminal device and storage medium | |
CN113539238B (en) | End-to-end language identification and classification method based on cavity convolutional neural network | |
CN116486794A (en) | A Chinese-English Mixed Speech Recognition Method | |
CN114596843A (en) | A fusion method based on end-to-end speech recognition model and language model | |
Qiu | Construction of english speech recognition model by fusing cnn and random deep factorization tdnn | |
CN115019782B (en) | Voice recognition method based on CTC multilayer loss | |
CN115240645B (en) | Streaming speech recognition method based on attention re-scoring | |
CN115376547B (en) | Pronunciation evaluation method, pronunciation evaluation device, computer equipment and storage medium | |
Deng et al. | History utterance embedding transformer lm for speech recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |