CN115019782B

CN115019782B - Voice recognition method based on CTC multilayer loss

Info

Publication number: CN115019782B
Application number: CN202210619908.5A
Authority: CN
Inventors: 陈仙红; 罗德雨; 鲍长春
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2022-06-02
Filing date: 2022-06-02
Publication date: 2024-07-16
Anticipated expiration: 2042-06-02
Also published as: CN115019782A

Abstract

A speech recognition method based on CTC multi-layer loss belongs to the field of pattern recognition and acoustics. The method standardizes the outputs of different layers of the speech recognition network so that the outputs of different layers are as close as possible to the required speech recognition results, thereby improving the performance of speech recognition. The method includes two stages: model training and model testing: in the training stage, the pre-processed training set is input into the constructed multi-layer speech recognition network, the losses of different layers and the weights of different layers are calculated, the weighted sum of the losses of different layers is obtained to obtain the multi-layer loss, the loss is calculated cyclically, and the network parameters are updated until convergence; in the testing stage, the pre-processed test set is input into the trained multi-layer speech recognition network, and the recognition result is output. The present invention only changes the loss function of the CTC speech recognition model training stage, and does not change the structure of the CTC speech recognition model and its speech recognition process, so as to improve the accuracy of speech recognition with the characteristics of low complexity and low overhead.

Description

A speech recognition method based on CTC multi-layer loss

技术领域Technical Field

本发明属于模式识别、声学领域，尤其涉及端到端模式的语音识别技术。The present invention belongs to the field of pattern recognition and acoustics, and in particular relates to an end-to-end speech recognition technology.

背景技术Background technique

语音识别在声学领域中是一个非常重要的研究课题，传统语音识别模型为使用隐马尔可夫模型的混合系统，包括声学模型、语言模型、发音词典，但模型设计复杂，难度大。端到端语音识别是当今比较主流的语音识别方法，与使用隐马尔可夫模型的混合系统等传统方法相比，端到端语音识别简化了模型设计、训练和解码过程。然而，这种改进却带来了更多的计算成本，许多先进的ASR体系结构采用基于注意力机制的编码解码体系结构，需要大量计算成本和较大的模型尺寸。此外，解码器以自回归方式运行，需要顺序计算，即只有在前一个令牌完成后才能开始生成下一个令牌。2006年提出的Connectionist TemporalClassification(CTC)模型不需要单独的解码器，设计更紧凑、更快速，适合用于语音识别技术中。近年来学者们通过修改模型结构和进行预训练提高了CTC的性能，但由于CTC模型的条件独立性假设，其性能仍弱于编码解码器模型，克服CTC条件独立性假设问题往往需要添加外部语言模型和进行波束搜索，但二者都存在高复杂度、高开销等情况，计算成本高，难以通过低复杂度、低开销的方式提高CTC模型的性能。Speech recognition is a very important research topic in the field of acoustics. The traditional speech recognition model is a hybrid system using a hidden Markov model, including an acoustic model, a language model, and a pronunciation dictionary, but the model design is complex and difficult. End-to-end speech recognition is a more mainstream speech recognition method today. Compared with traditional methods such as hybrid systems using hidden Markov models, end-to-end speech recognition simplifies the model design, training, and decoding processes. However, this improvement brings more computational costs. Many advanced ASR architectures use an encoding-decoding architecture based on an attention mechanism, which requires a lot of computational costs and a large model size. In addition, the decoder operates in an autoregressive manner and requires sequential calculations, that is, the next token can only be generated after the previous token is completed. The Connectionist Temporal Classification (CTC) model proposed in 2006 does not require a separate decoder, and its design is more compact and faster, making it suitable for use in speech recognition technology. In recent years, scholars have improved the performance of CTC by modifying the model structure and performing pre-training. However, due to the conditional independence assumption of the CTC model, its performance is still weaker than the encoder-decoder model. Overcoming the conditional independence assumption problem of CTC often requires adding an external language model and performing beam search. However, both of them have high complexity and high overhead, and the computational cost is high, making it difficult to improve the performance of the CTC model through low-complexity and low-overhead methods.

语音识别技术有多类应用环境。如：语音指令中，可通过语音直接对设备或者软件发布命令，适用于视频网站、智能硬件等各大搜索场景；游戏娱乐中，语音识别可以将语音转换成文字，很好地满足了用户的多元化聊天需求；同时在字幕生成与会议纪要中也有重要的应用。随着语音识别技术在生产生活中广泛运用，低复杂度、低开销、高性能的语音识别模型设计格外重要。Speech recognition technology has many application environments. For example, in voice commands, commands can be issued directly to devices or software through voice, which is suitable for major search scenarios such as video websites and smart hardware; in game entertainment, speech recognition can convert speech into text, which well meets the diverse chat needs of users; it also has important applications in subtitle generation and meeting minutes. With the widespread use of speech recognition technology in production and life, the design of low-complexity, low-cost, and high-performance speech recognition models is particularly important.

发明内容Summary of the invention

本发明针对于目前解决CTC模型条件独立性假设问题的方法(添加外部语言模型、进行波束搜索)存在高复杂度、高开销的情况，提出一种基于CTC多层损失的语音识别方法。该方法对语音识别网络不同层的输出进行规范，使不同层的输出尽量接近所需要的语音识别结果，从而提高语音识别的性能。对不同层的输出进行规范对CTC多层语音识别网络g的性能提升具有不同的价值，该价值用不同的权重表示。为避免人为主观地设置不同层的权重带来的影响，该权重通过网络训练学习计算得到。本方法在训练时获取网络不同层的输出，计算不同层的CTC损失和不同层的权重，对不同层的损失进行加权求和，得到最终的CTC多层损失，利用梯度下降法，更新模型参数至模型收敛，得到最终的模型。由于本方法仅仅改变CTC模型训练阶段的损失函数，并不改变CTC模型的结构及其测试过程，因此模型测试阶段没有额外开销。本方法易于实现，在不增加CTC模型的复杂度和开销的条件下，有效提高CTC模型语音识别的性能。The present invention is aimed at the situation that the current method for solving the conditional independence hypothesis problem of the CTC model (adding an external language model and performing beam search) has high complexity and high overhead, and proposes a speech recognition method based on CTC multi-layer loss. The method standardizes the outputs of different layers of the speech recognition network so that the outputs of different layers are as close as possible to the required speech recognition results, thereby improving the performance of speech recognition. Standardizing the outputs of different layers has different values for the performance improvement of the CTC multi-layer speech recognition network g, and the values are represented by different weights. In order to avoid the influence of subjectively setting the weights of different layers, the weights are calculated by network training learning. During training, the method obtains the outputs of different layers of the network, calculates the CTC losses of different layers and the weights of different layers, performs weighted summation on the losses of different layers, obtains the final CTC multi-layer loss, and uses the gradient descent method to update the model parameters until the model converges to obtain the final model. Since the method only changes the loss function of the CTC model training stage, and does not change the structure of the CTC model and its test process, there is no additional overhead in the model testing stage. The method is easy to implement and effectively improves the performance of CTC model speech recognition without increasing the complexity and overhead of the CTC model.

本发明提出一种基于CTC多层损失的语音识别方法，其特征在于，包括模型训练阶段与测试阶段，如附图1所示，训练阶段包括训练语音预处理、CTC多层语音识别网络搭建、不同层概率计算、不同层损失计算、不同层权重计算、CTC多层损失计算，更新参数模型收敛。模型测试阶段：包括测试语音预处理、测试语音识别。The present invention proposes a speech recognition method based on CTC multi-layer loss, which is characterized by comprising a model training phase and a test phase. As shown in FIG1 , the training phase comprises training speech preprocessing, CTC multi-layer speech recognition network construction, probability calculation of different layers, loss calculation of different layers, weight calculation of different layers, CTC multi-layer loss calculation, and updating parameter model convergence. Model testing phase: comprises test speech preprocessing and test speech recognition.

1)模型训练阶段：1) Model training phase:

1-1)训练语音预处理：1-1) Training speech preprocessing:

训练集合为S＝{(x¹，y¹),(x²，y²),…,(x^N，y^N)}，表示有N个训练样本，其中第i个训练样本表示为(xⁱ，yⁱ)，xⁱ为输入样本，为一条语音，yⁱ为对应的真实标签，也就是语音xⁱ对应的文字信息。将训练集合中的语音xⁱ分为Tⁱ帧，并求解其梅尔倒谱特征，得到预处理后的训练语音。The training set is S = {(x ¹ , y ¹ ), (x ² , y ² ), …, (x ^N , y ^N )}, which means there are N training samples, where the i-th training sample is represented as ( ^xi , ^yi ), x ⁱ is the input sample, which is a voice, and ^yi is the corresponding true label, that is, the text information corresponding to the voice x ⁱ . The voice x ⁱ in the training set is divided into ^Ti frames, and its Mel cepstrum features are solved to obtain the preprocessed training voice.

1-2)CTC多层语音识别网络搭建：1-2) Construction of CTC multi-layer speech recognition network:

构建一个CTC多层语音识别网络g，如附图2所示，网络g包括L个Transformer层和1个softmax层，整个网络参数为φ，L的范围为20到28。定义网络的输入为经过预处理后的训练语音xⁱ，网络的输出为该训练语音对应的文字信息yⁱ。网络函数为Construct a CTC multi-layer speech recognition network g, as shown in Figure 2. The network g includes L Transformer layers and 1 softmax layer. The entire network parameter is φ, and the range of L is 20 to 28. Define the input of the network as the preprocessed training speech x ⁱ , and the output of the network is the text information y ⁱ corresponding to the training speech. The network function is

yⁱ＝g(φ；xⁱ)(i＝1,2,…N) ^yi ＝g(φ； ^xi )(i＝1,2,…N)

1-3)不同层概率计算：1-3) Probability calculation at different layers:

将预处理后的训练语音xⁱ输入网络g，x_l表示训练语音xⁱ经过网络g第l层后的输出，计算根据x_l可以解码得到真实标签yⁱ的概率P(yⁱ|x_l)为The preprocessed training speech ^xi is input into the network g, _xl represents the output of the training speech ^xi after passing through the lth layer of the network g, and the probability P( ^yi | _xl ) that the true label ^yi can be decoded according to _xl is calculated as

该概率称为不同层概率，其中B^-1(yⁱ)是与yⁱ一致的长度为Tⁱ的对齐集合，集合中的内容为x_l映射到yⁱ的所有路径，路径中包括空白标记，q为其中一条路径。This probability is called the different layer probability, where B ^-1 (y ⁱ ) is an alignment set of length T ⁱ that is consistent with y ⁱ , and the contents of the set are all paths that map x _l to y ⁱ , including blank marks, and q is one of the paths.

1-4)不同层CTC损失计算：1-4) Calculation of CTC loss at different layers:

CTC损失函数的定义为解码得到真实标签的概率P(yⁱ|x_l)的负对数之和。根据1-3)得到的不同层概率，计算不同层CTC损失为The CTC loss function is defined as the sum of the negative logarithms of the probability P(y ⁱ |x _l ) of decoding to obtain the true label. According to the different layer probabilities obtained in 1-3), the CTC loss of different layers is calculated as

L_{CTC_l}＝-ln P(yⁱ|x_l)(l＝1,2,…L)L _{CTC_l} = -ln P(y ⁱ |x _l )(l = 1, 2, ... L)

1-5)不同层权重计算：1-5) Weight calculation of different layers:

将预处理后的训练语音xⁱ输入权重计算网络f，得到不同层的权重α_l(l＝1,2,…L)，The preprocessed training speech xi ^is input into the weight calculation network f to obtain the weights _αl (l＝1,2,…L) of different layers.

其中为权重计算网络f的参数，且0≤α_l≤1。如附图3所示，权重计算网络f由卷积神经(CNN)网络、池化层、全连接层、softmax层组成，其中CNN网络层数可设置为4到6层，全连接层的层数可以设置为3到5层。不同层的权重表示对不同层的输出进行规范对CTC多层语音识别网络g的性能提升具有不同的价值。该权重由网络f计算得到，避免人为主观地设置权重。in Calculate the parameters of the network f for the weights, And 0≤α _l ≤1. As shown in Figure 3, the weight calculation network f is composed of a convolutional neural network (CNN) network, a pooling layer, a fully connected layer, and a softmax layer, wherein the number of CNN network layers can be set to 4 to 6 layers, and the number of fully connected layers can be set to 3 to 5 layers. The weights of different layers indicate that the standardization of the outputs of different layers has different values for improving the performance of the CTC multi-layer speech recognition network g. The weight is calculated by the network f to avoid subjective weight setting.

1-6)CTC多层损失计算：1-6) CTC multi-layer loss calculation:

如附图2所示，根据不同层的权重α_l，将不同层的CTC损失加权求和，得到CTC多层损失，即As shown in Figure 2, according to the weights α _l of different layers, the CTC losses of different layers are weighted and summed to obtain the CTC multi-layer loss, that is,

其中L_{MulyiLayer_CTC}为CTC多层损失。Where L _{MulyiLayer_CTC} is the CTC multi-layer loss.

1-7)更新参数模型收敛：1-7) Update parameter model convergence:

利用梯度下降法，最小化L_{MultiLayer_CTC}，更新网络参数φ和重复步骤1-3)至1-6)，计算CTC多层损失L_{MultiLayer_CTC}，更新网络参数φ和直到L_{MultiLayer_CTC}小于阈值0.001，模型收敛，训练完毕，得到训练好的CTC多层语音识别网络g和权重计算网络f。Using the gradient descent method, minimize L _{MultiLayer_CTC} and update the network parameters φ and Repeat steps 1-3) to 1-6), calculate the CTC multi-layer loss L _{MultiLayer_CTC} , and update the network parameters φ and Until L _{MultiLayer_CTC} is less than the threshold 0.001, the model converges and the training is completed, and the trained CTC multi-layer speech recognition network g and weight calculation network f are obtained.

2)模型测试阶段：2) Model testing phase:

2-1)测试语音预处理：2-1) Test speech preprocessing:

对测试语音的第i个样本数据进行分帧，并求取梅尔倒谱特征，得到预处理后的测试语音。For the i-th sample data of the test speech The speech is divided into frames and Mel-frequency cepstrum features are obtained to obtain the preprocessed test speech.

2-2)测试语音识别：2-2) Test speech recognition:

将预处理后测试语音输入已训练好的CTC多层语音识别网络g中，得到语音识别的结果。The preprocessed test speech Input the trained CTC multi-layer speech recognition network g to obtain the speech recognition result.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明方法的步骤。FIG. 1 shows the steps of the method of the present invention.

图2是本发明方法的整体架构。FIG. 2 is the overall architecture of the method of the present invention.

图3是权重计算网络的组成。Figure 3 shows the composition of the weight calculation network.

具体实施方式Detailed ways

本发明提出一种基于CTC多层损失的语音识别方法，其特征在于，包括模型训练阶段与测试阶段，如附图1所示，模型训练阶段：包括训练语音预处理、CTC多层语音识别网络搭建、不同层概率计算、不同层损失计算、CTC多层损失计算，更新参数模型收敛。模型测试阶段：包括测试语音预处理、测试语音识别。具体实施方法说明如下。The present invention proposes a speech recognition method based on CTC multi-layer loss, characterized in that it includes a model training phase and a test phase, as shown in Figure 1, the model training phase includes training speech preprocessing, CTC multi-layer speech recognition network construction, probability calculation of different layers, loss calculation of different layers, CTC multi-layer loss calculation, and updating parameter model convergence. The model testing phase includes test speech preprocessing and test speech recognition. The specific implementation method is described as follows.

1)模型训练阶段：1) Model training phase:

1-1)训练语音预处理：1-1) Training speech preprocessing:

本实施例中，使用aishell-1数据集进行训练，包括150小时训练集共120098条语音。In this embodiment, the aishell-1 dataset is used for training, including a 150-hour training set with a total of 120,098 voices.

构建一个CTC多层语音识别网络g，如附图2所示，网络g包括L个Transformer层和1个softmax层，整个网络参数为φ，L的范围为12到32。本实施例中，L设置为24。定义网络的输入为经过预处理后的训练语音xⁱ，网络的输出为该语音对应的文字信息yⁱ。网络函数为Construct a CTC multi-layer speech recognition network g, as shown in Figure 2. The network g includes L Transformer layers and 1 softmax layer. The entire network parameter is φ, and the range of L is 12 to 32. In this embodiment, L is set to 24. Define the input of the network as the preprocessed training speech x ⁱ , and the output of the network is the text information y ⁱ corresponding to the speech. The network function is

yⁱ＝g(φ；xⁱ)(i＝1,2,…N) ^yi ＝g(φ； ^xi )(i＝1,2,…N)

1-3)不同层概率计算：1-3) Probability calculation at different layers:

本实施例中，计算多层语音识别网络g每一层的CTC损失。In this embodiment, the CTC loss of each layer of the multi-layer speech recognition network g is calculated.

1-5)不同层权重计算：1-5) Weight calculation of different layers:

其中为权重计算网络f的参数，且0≤α_l≤1。如附图3所示，权重计算网络f由卷积神经(CNN)网络、池化层、全连接层、softmax层组成，其中CNN网络层数可设置为4到8层，全连接层的层数可以设置为3到6层。不同层的权重表示对不同层的输出进行规范对CTC多层语音识别网络g的性能提升具有不同的价值。该权重由网络f通过训练自动学习得到，避免人为主观地设置权重。in Calculate the parameters of the network f for the weights, And 0≤α _l ≤1. As shown in Figure 3, the weight calculation network f is composed of a convolutional neural network (CNN) network, a pooling layer, a fully connected layer, and a softmax layer, wherein the number of CNN network layers can be set to 4 to 8 layers, and the number of fully connected layers can be set to 3 to 6 layers. The weights of different layers indicate that the standardization of the outputs of different layers has different values for improving the performance of the CTC multi-layer speech recognition network g. The weights are automatically learned by the network f through training to avoid subjective weight setting.

本实施例中，权重计算网络f由4层的卷积神经(CNN)网络、池化层、3层的全连接层、softmax层组成。最终网络f自动计算得到的权重参数α_l具有高层的权重大，低层权重小的特点。In this embodiment, the weight calculation network f is composed of a 4-layer convolutional neural network (CNN) network, a pooling layer, a 3-layer fully connected layer, and a softmax layer. The weight parameter α _l automatically calculated by the final network f has the characteristics of large weights at high levels and small weights at low levels.

1-6)CTC多层损失计算：1-6) CTC multi-layer loss calculation:

其中L_{MultiLayer_CTC}为CTC多层损失。Where L _{MultiLayer_CTC} is the CTC multi-layer loss.

1-7)更新参数模型收敛：1-7) Update parameter model convergence:

利用梯度下降法，最小化L_{MultiLayer_CTC}，更新网络参数φ和重复步骤1-3)至1-6)，计算CTC多层损失L_{MultiLayer_CTC}，更新网络参数φ和直到L_{Multilayer_CTC}小于阈值0.001，训练完毕，得到CTC多层语音识别网络g和权重计算网络f。Using the gradient descent method, minimize L _{MultiLayer_CTC} and update the network parameters φ and Repeat steps 1-3) to 1-6), calculate the CTC multi-layer loss L _{MultiLayer_CTC} , and update the network parameters φ and Until L _{Multilayer_CTC} is less than the threshold 0.001, the training is completed, and the CTC multi-layer speech recognition network g and the weight calculation network f are obtained.

2)模型测试阶段：2) Model testing phase:

2-1)测试语音预处理：2-1) Test speech preprocessing:

对测试语音的第i个样本数据进行分帧，并求取梅尔倒谱特征。For the i-th sample data of the test speech Frame the image and obtain the Mel-frequency cepstrum features.

本实施例中，使用aishell-1数据集进行测试，包括5小时训练集共7176条语音。In this embodiment, the aishell-1 dataset is used for testing, including a 5-hour training set with a total of 7176 voices.

2-2)测试语音识别：2-2) Test speech recognition:

将测试序列输入已训练好的CTC多层语音识别网络g中，得到语音识别的结果。The test sequence Input the trained CTC multi-layer speech recognition network g to obtain the speech recognition result.

本文中所描述的具体实施例仅仅是对本发明精神作举例说明。本发明所属技术领域的技术人员可以对所描述的具体实施例做各种各样的修改或补充或采用类似的方式替代，但并不会偏离本发明的精神或者超越所附权利要求书所定义的范围。The specific embodiments described herein are merely examples of the spirit of the present invention. Those skilled in the art may make various modifications or additions to the specific embodiments described or replace them in similar ways, but they will not deviate from the spirit of the present invention or exceed the scope defined by the appended claims.

Claims

1. A speech recognition method based on CTC multi-layer loss, characterized in that it is divided into a model training phase and a model testing phase; the model training phase includes training speech preprocessing, CTC multi-layer speech recognition network construction, probability calculation of different layers, CTC loss calculation of different layers, weight calculation of different layers, CTC multi-layer loss calculation, and updating parameter model convergence; the model testing phase includes test speech preprocessing and test speech recognition;

The specific steps include:

1) Model training phase:

1-1) Training speech preprocessing:

The training set is S = {(x ¹ , y ¹ ), (x ² , y ² ), …, (x ^N , y ^N )}, which means there are N training samples, where the i-th training sample is represented by ( ^xi , ^yi ), x ⁱ is the input sample, which is a speech, and ^yi is the corresponding true label, that is, the text information corresponding to the speech x ⁱ ; the speech x ⁱ in the training set is divided into ^Ti frames, and its Mel cepstrum features are solved to obtain the preprocessed training speech;

1-2) Construction of CTC multi-layer speech recognition network:

Construct a CTC multi-layer speech recognition network g, which includes L Transformer layers and 1 softmax layer. The entire network parameter is φ, and the range of L is 20 to 28. Define the input of the network as the preprocessed training speech x ⁱ , and the output of the network as the text information y ⁱ corresponding to the training speech. The network function is

^yi = g(φ; ^xi ), where i = 1, 2, ... N;

1-3) Probability calculation at different layers:

The preprocessed training speech ^xi is input into the network g, _xl represents the output of the training speech ^xi after passing through the lth layer of the network g, and the probability P( ^yi | _xl ) that the true label ^yi can be decoded according to _xl is calculated as

Where l = 1, 2, ... L;

This probability is called the probability of different layers, where B ^-1 (y ⁱ ) is an alignment set of length T ⁱ that is consistent with y ⁱ , and the content of the set is all the paths that map x _l to y ⁱ , including blank marks in the path, and q is one of the paths;

1-4) Calculation of CTC loss at different layers:

The CTC loss function is defined as the sum of the negative logarithms of the probability P(y ⁱ |x _l ) of decoding to obtain the true label; according to the different layer probabilities obtained in 1-3), the CTC loss of different layers is calculated as L _{CTC_l} = -lnP(y ⁱ |x _l ), where l = 1, 2, ... L;

1-5) Weight calculation of different layers:

Input the preprocessed training speech x ⁱ into the weight calculation network f to obtain the weights α _l of different layers, where l = 1, 2, ... L;

in Calculate the parameters of the network f for the weights, and 0≤α _l ≤1; the weight calculation network f is composed of a CNN network, a pooling layer, a fully connected layer, and a softmax layer, wherein the number of CNN network layers is set to 4 to 6 layers, and the number of fully connected layers can be set to 3 to 5 layers; the weights of different layers represent that the standardization of the outputs of different layers has different values for improving the performance of the CTC multi-layer speech recognition network g; the weight is calculated by the network f;

1-6) CTC multi-layer loss calculation:

According to the weights α _l of different layers, the CTC losses of different layers are weighted and summed to obtain the CTC multi-layer loss, that is,

Where L _{MultiLayer_CTC} is the CTC multi-layer loss;

1-7) Update parameter model convergence:

Using the gradient descent method, minimize L _{MultiLayer_CTC} and update the network parameters φ and Repeat steps 1-3) to 1-6), calculate the CTC multi-layer loss L _{MultiLayer_CTC} , and update the network parameters φ and Until L _{MultiLayer_CTC} is less than the threshold 0.001, the model converges and the training is completed, and the trained CTC multi-layer speech recognition network g and weight calculation network f are obtained;

2) Model testing phase:

2-1) Test speech preprocessing:

For the i-th sample data of the test speech Frame division is performed and Mel-frequency cepstrum features are obtained to obtain the preprocessed test speech;

2-2) Test speech recognition:

The preprocessed test speech Input the trained CTC multi-layer speech recognition network g to obtain the speech recognition result.