CN113160798B

CN113160798B - A method and system for speech recognition of civil aviation air traffic control in China

Info

Publication number: CN113160798B
Application number: CN202110467893.0A
Authority: CN
Inventors: 罗林开; 俞涵; 张松飞; 彭洪; 黄俊祥; 江居旺
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2021-04-28
Filing date: 2021-04-28
Publication date: 2024-04-16
Anticipated expiration: 2041-04-28
Also published as: CN113160798A

Abstract

The present invention discloses a method and system for recognizing Chinese civil aviation air traffic control speech. The method comprises: obtaining speech feature data, which is time series feature information extracted based on speech signals; inputting the speech feature data into a trained acoustic model to obtain a recognition result, which represents the Chinese terminology text of air traffic control corresponding to the speech signal; the acoustic model comprises: a TRM module, a BiGRU module, a fully connected layer and a CTC module connected in sequence, the TRM module comprises a multi-head self-attention layer, a first residual connection and a layer normalization layer, a feedforward layer and a second residual connection and a layer normalization layer connected in sequence, the BiGRU module comprises a bidirectional gated recurrent unit network, the CTC module comprises a connection time series classification layer, and the acoustic model is trained by speech samples of air traffic control instruction terms with Chinese character labels. The present invention has the advantage of high recognition accuracy.

Description

A method and system for speech recognition of civil aviation air traffic control in China

技术领域Technical Field

本发明涉及语音识别技术领域，特别是涉及一种中文民航空中交通管制语音识别方法及系统。The present invention relates to the technical field of speech recognition, and in particular to a speech recognition method and system for civil aviation air traffic control.

背景技术Background technique

空中交通管制主要对地面滑行和航线飞行的飞机进行指挥和调度，是空中交通安全和效率的重要保障，其对空中交通管制人员的依赖极强。空中交通管制人员和机组人员之间的陆空通话与飞行安全密切相关，有必要将陆空通话转化为文本记录并存档。Air traffic control mainly directs and dispatches aircraft taxiing on the ground and flying on routes. It is an important guarantee for air traffic safety and efficiency, and it relies heavily on air traffic controllers. The air-to-ground conversations between air traffic controllers and crew members are closely related to flight safety. It is necessary to convert the air-to-ground conversations into text records and archive them.

应用在中文民航空中交通管制语音识别领域的现有语音识别技术主要是基于深度学习的“CLDNN”神经网络，由多层CNN、多层LSTM、多层全连接神经网络组成，但现有技术方案的识别准确度还有待提高。The existing speech recognition technology used in the field of civil aviation air traffic control speech recognition is mainly based on the "CLDNN" neural network based on deep learning, which is composed of multi-layer CNN, multi-layer LSTM, and multi-layer fully connected neural network. However, the recognition accuracy of the existing technical solutions needs to be improved.

发明内容Summary of the invention

本发明的目的是提供一种识别准确度高的中文民航空中交通管制语音识别方法及系统。The object of the present invention is to provide a method and system for recognizing speech of civil aviation air traffic control in China with high recognition accuracy.

为实现上述目的，本发明提供了如下方案：To achieve the above object, the present invention provides the following solutions:

一种中文民航空中交通管制语音识别方法，包括：A method for speech recognition of civil aviation air traffic control in China, comprising:

获取语音特征数据，所述语音特征数据为基于语音信号提取得到的时序特征信息；Acquire speech feature data, where the speech feature data is time series feature information extracted based on a speech signal;

将所述语音特征数据输入经训练的声学模型，得到识别结果，所述识别结果表示所述语音信号对应的空中交通管制中文术语文字；所述声学模型包括：依次连接的TRM模块、BiGRU模块、全连接层和CTC模块，所述TRM模块包括依次连接的多头自注意力层、第一残差连接和层标准化层、前馈层以及第二残差连接和层标准化层，所述BiGRU模块包括双向门控循环单元网络，所述CTC模块包括连接时序分类层，所述声学模型由带有中文文字标签的空管指令术语语音样本训练得到。The speech feature data is input into a trained acoustic model to obtain a recognition result, wherein the recognition result represents the Chinese terminology of air traffic control corresponding to the speech signal; the acoustic model comprises: a TRM module, a BiGRU module, a fully connected layer and a CTC module connected in sequence, the TRM module comprises a multi-head self-attention layer, a first residual connection and a layer normalization layer, a feedforward layer and a second residual connection and a layer normalization layer connected in sequence, the BiGRU module comprises a bidirectional gated recurrent unit network, the CTC module comprises a connection temporal classification layer, and the acoustic model is trained by speech samples of air traffic control instruction terminology with Chinese character labels.

可选的，在所述获取语音特征数据之前，还包括：Optionally, before acquiring the voice feature data, the method further includes:

对所述语音信号进行分帧操作，得到多个语音帧；Performing a frame operation on the speech signal to obtain a plurality of speech frames;

根据所述语音帧，确定所述语音特征数据；每一所述语音特征数据对应多个连续的语音帧。The speech feature data is determined according to the speech frame; each piece of speech feature data corresponds to a plurality of continuous speech frames.

可选的，每一所述语音特征数据对应一基准语音帧以及所述基准语音帧前设定数量的语音帧和所述基准语音帧后设定数量的语音帧。Optionally, each of the speech feature data corresponds to a reference speech frame and a set number of speech frames before the reference speech frame and a set number of speech frames after the reference speech frame.

可选的，当所述基准语音帧为所述语音信号的前m帧或后n帧时，分别在所述基准语音帧所属的语音特征数据前补零或后补零，以使各所述语音特征数据的数据长度相同，其中，m和n均为正整数。Optionally, when the reference speech frame is the first m frames or the last n frames of the speech signal, the speech feature data belonging to the reference speech frame are padded with zeros in front or behind, respectively, so that the data lengths of each speech feature data are the same, wherein m and n are both positive integers.

可选的，所述根据所述语音帧，确定所述语音特征数据，具体包括：Optionally, determining the voice feature data according to the voice frame specifically includes:

对所述语音帧进行采样，得到多个采样点；Sampling the speech frame to obtain a plurality of sampling points;

基于所述采样点，确定所述语音特征数据，每一所述语音特征数据对应多个连续语音帧中的采样点。Based on the sampling points, the speech feature data are determined, each of the speech feature data corresponds to a sampling point in a plurality of continuous speech frames.

可选的，所述语音特征数据为语音的梅尔频率倒谱系数。Optionally, the speech feature data is Mel-frequency cepstral coefficients of the speech.

可选的，在对所述语音信号进行分帧操作之前，还包括：Optionally, before performing the framing operation on the speech signal, the method further includes:

对所述语音信号进行去静音处理。The voice signal is de-mute processed.

可选的，所述语音信号中相邻语音帧具有设定比例的重叠区域。Optionally, adjacent speech frames in the speech signal have an overlapping area with a set ratio.

本发明还提供了一种中文民航空中交通管制语音识别系统，包括：The present invention also provides a Chinese civil aviation air traffic control speech recognition system, comprising:

语音特征数据获取模块，用于获取语音特征数据，所述语音特征数据为基于语音信号提取得到的时序特征信息；A speech feature data acquisition module, used to acquire speech feature data, wherein the speech feature data is time series feature information extracted based on speech signals;

语音识别模块，用于将所述语音特征数据输入经训练的声学模型，得到识别结果，所述识别结果表示所述语音信号对应的空中交通管制中文术语文字；所述声学模型包括：依次连接的TRM模块、BiGRU模块、全连接层和CTC模块，所述TRM模块包括依次连接的多头自注意力层、第一残差连接和层标准化层、前馈层以及第二残差连接和层标准化层，所述BiGRU模块包括双向门控循环单元网络，所述CTC模块包括连接时序分类层，所述声学模型由带有中文文字标签的空管指令术语语音样本训练得到。A speech recognition module is used to input the speech feature data into a trained acoustic model to obtain a recognition result, wherein the recognition result represents the Chinese terminology of air traffic control corresponding to the speech signal; the acoustic model comprises: a TRM module, a BiGRU module, a fully connected layer and a CTC module connected in sequence, the TRM module comprises a multi-head self-attention layer, a first residual connection and a layer normalization layer, a feedforward layer and a second residual connection and a layer normalization layer connected in sequence, the BiGRU module comprises a bidirectional gated recurrent unit network, the CTC module comprises a connection temporal classification layer, and the acoustic model is trained by speech samples of air traffic control instruction terminology with Chinese character labels.

可选的，所述中文民航空中交通管制语音识别系统还包括：Optionally, the civil aviation air traffic control speech recognition system further includes:

去静音模块，用于对所述语音信号进行去静音处理；A de-mute module, used for performing de-mute processing on the voice signal;

分帧模块，用于对所述语音信号进行分帧操作，得到多个语音帧，相邻语音帧具有设定比例的重叠区域；A framing module, used for performing a framing operation on the speech signal to obtain a plurality of speech frames, wherein adjacent speech frames have an overlapping area of a set ratio;

语音特征数据确定模块，用于根据所述语音帧，确定所述语音特征数据；每一所述语音特征数据对应多个连续的语音帧，所述语音特征数据为语音的梅尔频率倒谱系数。The speech feature data determination module is used to determine the speech feature data according to the speech frame; each of the speech feature data corresponds to a plurality of continuous speech frames, and the speech feature data is the Mel-frequency cepstral coefficient of the speech.

根据本发明提供的具体实施例，公开了以下技术效果：本发明实施例提供的声学模型结构中，TRM模块能够将输入的语音特征进行编码，通过自注意力机制实现了输入语音帧与帧之间的相互联系，得到了一种关联上下文语音信息的特征表示。BiGRU是将双向循环神经网络与门控循环单元网络相结合的产物，兼具两者的优点，既可以同门控循环单元网络一样处理时序依赖关系，又可以与双向循环神经网络一样具有上下文信息。CTC解决了语音输入序列与标签序列不对齐的问题，从而实现端到端的语音识别。基于上述原因，本发明实施例提供的中文民航空中交通管制语音识别方法具有识别准确度高的优势。According to the specific embodiments provided by the present invention, the following technical effects are disclosed: In the acoustic model structure provided by the embodiments of the present invention, the TRM module can encode the input speech features, realize the mutual connection between the input speech frames through the self-attention mechanism, and obtain a feature representation of the associated context speech information. BiGRU is the product of combining a bidirectional recurrent neural network with a gated recurrent unit network, and has the advantages of both. It can process temporal dependencies like a gated recurrent unit network, and has context information like a bidirectional recurrent neural network. CTC solves the problem of misalignment between the speech input sequence and the label sequence, thereby realizing end-to-end speech recognition. Based on the above reasons, the civil aviation air traffic control speech recognition method provided in the embodiments of the present invention has the advantage of high recognition accuracy.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required for use in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative labor.

图1为本发明实施例提供的中文民航空中交通管制语音识别方法的流程示意图；FIG1 is a schematic flow chart of a method for speech recognition of civil aviation air traffic control provided in an embodiment of the present invention;

图2为本发明实施例提供的又一中文民航空中交通管制语音识别方法的流程示意图；FIG2 is a flow chart of another method for speech recognition of civil aviation air traffic control provided by an embodiment of the present invention;

图3为本发明实施例中声学模型的结构示意图；FIG3 is a schematic diagram of the structure of an acoustic model in an embodiment of the present invention;

图4为本发明实施例中TRM模块的结构示意图；FIG4 is a schematic diagram of the structure of a TRM module in an embodiment of the present invention;

图5为本发明实施例中TRM模块中Multihead Self-Attention的结构示意图；FIG5 is a schematic diagram of the structure of Multihead Self-Attention in the TRM module in an embodiment of the present invention;

图6为本发明实施例中BiGRU模块的结构示意图；FIG6 is a schematic diagram of the structure of a BiGRU module in an embodiment of the present invention;

图7为本发明实施例中BiGRU模块中GRU的结构示意图；FIG7 is a schematic diagram of the structure of a GRU in a BiGRU module according to an embodiment of the present invention;

图8为本发明实施例中的识别流程示意图；FIG8 is a schematic diagram of an identification process in an embodiment of the present invention;

图9为本发明实施例提供的中文民航空中交通管制语音识别系统的结构示意图。FIG9 is a schematic diagram of the structure of a Chinese civil aviation air traffic control speech recognition system provided in an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will be combined with the drawings in the embodiments of the present invention to clearly and completely describe the technical solutions in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

为使本发明的上述目的、特征和优点能够更加明显易懂，下面结合附图和具体实施方式对本发明作进一步详细的说明。In order to make the above-mentioned objects, features and advantages of the present invention more obvious and easy to understand, the present invention is further described in detail below with reference to the accompanying drawings and specific embodiments.

实施例1Example 1

参见图1，本实施例提供了一种中文民航空中交通管制语音识别方法，该方法包括以下步骤：Referring to FIG. 1 , this embodiment provides a method for recognizing speech in civil aviation air traffic control, the method comprising the following steps:

步骤101：获取语音特征数据，所述语音特征数据为基于语音信号提取得到的时序特征信息；Step 101: Acquire speech feature data, where the speech feature data is time series feature information extracted based on a speech signal;

步骤102：将所述语音特征数据输入经训练的声学模型，得到识别结果，所述识别结果表示所述语音信号对应的空中交通管制中文术语文字；所述声学模型包括：依次连接的TRM模块、BiGRU模块、全连接层FC和CTC模块，所述TRM模块包括依次连接的多头自注意力层、第一残差连接和层标准化层、前馈层以及第二残差连接和层标准化层，所述BiGRU模块包括双向门控循环单元网络，所述CTC模块包括连接时序分类层，所述声学模型由带有中文文字标签的空管指令术语语音样本训练得到。声学模型的训练利用Adam优化器通过反向传播算法对训练数据进行拟合，在验证集上调整参数，并在测试数据上评估模型优劣。Step 102: Input the speech feature data into the trained acoustic model to obtain a recognition result, wherein the recognition result indicates the Chinese terminology of air traffic control corresponding to the speech signal; the acoustic model comprises: a TRM module, a BiGRU module, a fully connected layer FC and a CTC module connected in sequence, the TRM module comprises a multi-head self-attention layer, a first residual connection and a layer normalization layer, a feedforward layer and a second residual connection and a layer normalization layer connected in sequence, the BiGRU module comprises a bidirectional gated recurrent unit network, the CTC module comprises a connection time series classification layer, and the acoustic model is trained by speech samples of air traffic control instruction terminology with Chinese character labels. The training of the acoustic model uses the Adam optimizer to fit the training data through the back-propagation algorithm, adjusts the parameters on the validation set, and evaluates the model quality on the test data.

在本发明实施例提供的声学模型结构中，TRM模块能够将输入的语音特征进行编码，通过自注意力机制分别计算每一帧特征与输入语音所有帧数据的相似度，充分考虑输入语音帧与帧之间发音和语义的相互联系，重新计算得到一种关联了上下文语音信息的特征表示。BiGRU是将双向循环神经网络与门控循环单元网络相结合的产物，兼具两者的优点，既可以同门控循环单元网络一样处理时序依赖关系，又可以与双向循环神经网络一样具有上下文信息，适合作为语音识别声学模型的重要模块。CTC是用来解决输入序列和输出序列难以一一对应的问题，而语音就是一个典型的输入序列与标签序列不对齐的问题，CTC正是针对这样的问题，使深度学习模型自动学习对齐，从而实现端到端的语音识别。基于上述原因，本发明实施例提供的中文民航空中交通管制语音识别方法具有识别准确度高的优势。另外，本发明实施例提供的声学模型结构仅由TRM和BiGRU层组成，不容易出现梯度消失和梯度爆炸等问题，模型训练过程容易收敛，并且对训练数据量要求相对较低，数据集标注成本低。In the acoustic model structure provided in the embodiment of the present invention, the TRM module can encode the input speech features, calculate the similarity between each frame feature and all frame data of the input speech through the self-attention mechanism, fully consider the pronunciation and semantics of the input speech frames, and recalculate a feature representation associated with contextual speech information. BiGRU is a product of combining a bidirectional recurrent neural network with a gated recurrent unit network, and has the advantages of both. It can process temporal dependencies like a gated recurrent unit network, and has contextual information like a bidirectional recurrent neural network, and is suitable as an important module of a speech recognition acoustic model. CTC is used to solve the problem that input sequences and output sequences are difficult to correspond one-to-one, and speech is a typical problem of misalignment between input sequences and label sequences. CTC is aimed at such a problem, so that the deep learning model automatically learns alignment, thereby realizing end-to-end speech recognition. Based on the above reasons, the civil aviation air traffic control speech recognition method provided in the embodiment of the present invention has the advantage of high recognition accuracy. In addition, the acoustic model structure provided by the embodiment of the present invention is composed of only TRM and BiGRU layers, and is not prone to problems such as gradient disappearance and gradient explosion. The model training process is easy to converge, and the requirement for the amount of training data is relatively low, and the cost of data set annotation is low.

本实施例针对中文ATC指令与汉语普通话发音部分不兼容的问题，自建了ATC语音数据集，设计了一个包含自注意力机制的深度学习架构TRM-BiGRU-CTC，并在ATC数据集上进行训练与验证，得到一个中文民航空中交通管制语音识别声学模型。本发明提供的中文民航空中交通管制语音识别声学模型对测试语音识别准确率高，能够识别出在噪声的环境下录制的快语速的ATC语音内容。针对ATC语音中大量与普通话发音不同的数字、高度、字母等专业用语，也能够自动转化为对应的文字序列。In this embodiment, in order to solve the problem of incompatibility between Chinese ATC instructions and Mandarin Chinese pronunciation, an ATC speech data set was built, a deep learning architecture TRM-BiGRU-CTC including a self-attention mechanism was designed, and training and verification were performed on the ATC data set to obtain a Chinese civil aviation air traffic control speech recognition acoustic model. The Chinese civil aviation air traffic control speech recognition acoustic model provided by the present invention has a high accuracy rate for test speech recognition and can recognize ATC speech content recorded at a fast speed in a noisy environment. A large number of professional terms such as numbers, heights, letters, etc. in ATC speech that are different from Mandarin pronunciation can also be automatically converted into corresponding text sequences.

作为一种实施方式，本实施例提供的中文民航空中交通管制语音识别方法首先采集中文民航空中交通管制语音；然后自建ATC空管语音数据集并进行数据预处理，数据预处理包括去除静音段、提取空管语音的特征和进行特征处理；设计一个包含自注意力机制的声学模型TRM-BiGRU-CTC，并在预处理后的ATC语音数据集上进行训练；待识别空管语音经过特征提取和特征处理后输入训练后得到的声学模型；将声学模型的输出经过连接时序分类(Connectionist Temporal Classification，CTC)解码，得到空管语音内容对应的中文文字序列。As an implementation method, the Chinese civil aviation air traffic control speech recognition method provided in this embodiment first collects Chinese civil aviation air traffic control speech; then builds an ATC air traffic control speech data set and performs data preprocessing, the data preprocessing includes removing silent segments, extracting features of air traffic control speech and performing feature processing; designs an acoustic model TRM-BiGRU-CTC including a self-attention mechanism, and trains it on the preprocessed ATC speech data set; the air traffic control speech to be recognized is input into the acoustic model obtained after training after feature extraction and feature processing; the output of the acoustic model is decoded by connectionist temporal classification (CTC) to obtain a Chinese text sequence corresponding to the air traffic control speech content.

进一步的，文民航空中交通管制语音的格式规定为WAV格式，如果是其他不同格式，如MP3、OGG等，需先进行格式转换，确保语音数据具有统一的WAV格式。Furthermore, the format of traffic control voice in civil aviation is stipulated to be WAV format. If it is in other different formats, such as MP3, OGG, etc., it is necessary to convert the format first to ensure that the voice data has a unified WAV format.

自建的ATC空管语音数据集中的语音数据全部来源于某空管区域的实际操作环境，并且依据《空中交通无线电通话用语指南》对收集的ATC语音进行人工标注与校对，数据量规模足够涵盖该空管区域绝大部分情况下的通话用语，确保在该数据集上训练得到的语音识别模型贴合实际环境。The speech data in the self-built ATC speech dataset all come from the actual operating environment of a certain air traffic control area, and the collected ATC speech is manually annotated and proofread according to the "Guide to Air Traffic Radio Call Phrases". The data volume is sufficient to cover the call phrases in most cases in the air traffic control area, ensuring that the speech recognition model trained on this dataset fits the actual environment.

在本实施例中的步骤101之前，还可以包括：Before step 101 in this embodiment, the following steps may also be included:

对所述语音信号进行分帧操作，得到多个语音帧，优选的，语音信号中相邻语音帧具有设定比例的重叠区域；Performing a frame operation on the speech signal to obtain a plurality of speech frames, preferably, adjacent speech frames in the speech signal have an overlapping area of a set ratio;

根据所述语音帧，确定所述语音特征数据；每一所述语音特征数据对应多个连续的语音帧，优选的，本实施例中所述的语音特征数据为语音的梅尔频率倒谱系数。进一步的，每一所述语音特征数据可以对应一基准语音帧以及所述基准语音帧前设定数量的语音帧和所述基准语音帧后设定数量的语音帧。其中，当所述基准语音帧为所述语音信号的前m帧或后n帧时，分别在所述基准语音帧所属的语音特征数据前补零或后补零，以使各所述语音特征数据的数据长度相同。此外，所述根据所述语音帧，确定所述语音特征数据，具体包括：对所述语音帧进行采样，得到多个采样点；基于所述采样点，确定所述语音特征数据，每一所述语音特征数据对应多个连续语音帧中的采样点。According to the speech frame, the speech feature data is determined; each of the speech feature data corresponds to a plurality of continuous speech frames. Preferably, the speech feature data in this embodiment is the Mel-frequency cepstral coefficient of the speech. Furthermore, each of the speech feature data may correspond to a reference speech frame and a set number of speech frames before the reference speech frame and a set number of speech frames after the reference speech frame. Wherein, when the reference speech frame is the first m frames or the last n frames of the speech signal, the speech feature data belonging to the reference speech frame is padded with zeros in front or behind, respectively, so that the data length of each of the speech feature data is the same. In addition, according to the speech frame, the speech feature data is determined, specifically including: sampling the speech frame to obtain a plurality of sampling points; based on the sampling points, the speech feature data is determined, and each of the speech feature data corresponds to a sampling point in a plurality of continuous speech frames.

语音特征数据指的是语音的梅尔频率倒谱系数(Mel Frequency CepstralCoefficent，MFCC)特征，该特征具有符合人耳听觉的特性。由于每一帧特征仅包含很短时长内的语音信息，多数语音帧不足以表达一个中文字符，所以还需要针对这一问题进一步地做特征处理。具体的，对提取到的MFCC特征的每一帧数据进行左右拼帧操作，即对于当前帧，取其左m帧，右n帧的MFCC特征与当前帧拼接起来作为当前语音帧的特征，这一操作的目的是使输入声学模型的每一帧数据都具有更多上下文相关信息。Speech feature data refers to the Mel Frequency Cepstral Coefficent (MFCC) feature of speech, which has characteristics that conform to human hearing. Since each frame feature only contains speech information within a very short period of time, most speech frames are not enough to express a Chinese character, so further feature processing is required for this problem. Specifically, a left-right splicing operation is performed on each frame of data of the extracted MFCC feature, that is, for the current frame, the MFCC features of the left m frames and the right n frames are taken and spliced with the current frame as the features of the current speech frame. The purpose of this operation is to make each frame of data input to the acoustic model have more context-related information.

在本实施例中，在对所述语音信号进行分帧操作之前，还包括：对所述语音信号进行去静音处理。In this embodiment, before performing the frame operation on the voice signal, the method further includes: performing a mute removal process on the voice signal.

参见图2，本实施例可以包括两个阶段：训练阶段与识别阶段。Referring to FIG. 2 , this embodiment may include two stages: a training stage and a recognition stage.

首先，由于空中交通管制语音的特殊性，我们需要自建ATC语音数据集。在某空管区域实际操作环境下收集大量的ATC语音，并将语音文件的格式规范为WAV格式，比特率为128kbps，采样率为8000Hz。对ATC音频依据《空中交通无线电通话用语指南》进行人工标注与校对。下面就标注中的部分特殊发音对应的针对性标注做以下解释：(1)字母的发音均标注成大写字母，例如字母A，其空管发音为Alpha，则对应的标注为A；(2)数字的标注均尽量标注成阿拉伯数字，另外，关于高度数字的标注，不同的空管员或飞行员对同一高度可能有不同读法，例如2100米，若发音为两幺，则标注成21，若发音为两千一，则标注为2千1；(3)对于一些特殊的航路点，例如NASPA直接标注为NASPA，在识别时按照一个建模单元来看待，而不是将其看作独立的N、A、S、P和A五个建模单元。本实施例中的自建ATC数据集共有9300条语音样本，总时长约为47小时，其中7700条数据用来做训练集，540条数据用来做验证集，1060条数据用来做测试集。与现有技术提及的空管语音数据集相比，本实施例的ATC语音数据集具有较小的数据量，数据标注成本也随之降低。但是数据集的数据量依然足够涵盖该空管领域中地空对话绝大多数可能出现的字，并且由于此数据集中的中文字符类别数与拼音类别数差距不悬殊，故可以直接选择中文字符作为建模单元，其最直接的优点就是不需要额外的语言模型来进行转换，只需要训练声学模型即可。First, due to the particularity of air traffic control speech, we need to build our own ATC speech dataset. A large amount of ATC speech is collected in the actual operation environment of a certain air traffic control area, and the format of the speech file is standardized into WAV format, with a bit rate of 128kbps and a sampling rate of 8000Hz. The ATC audio is manually annotated and proofread according to the "Guide to Air Traffic Radio Calling Phrases". The following is an explanation of the targeted annotations corresponding to some special pronunciations in the annotations: (1) The pronunciation of letters is marked as capital letters. For example, the letter A, whose air traffic control pronunciation is Alpha, is marked as A; (2) The marking of numbers is marked as Arabic numerals as much as possible. In addition, regarding the marking of altitude numbers, different air traffic controllers or pilots may have different pronunciations for the same altitude. For example, if 2100 meters is pronounced as two one, it is marked as 21, and if it is pronounced as two thousand one, it is marked as 2 thousand one; (3) For some special waypoints, such as NASPA, it is directly marked as NASPA. When identifying, it is treated as a modeling unit instead of being regarded as five independent modeling units of N, A, S, P and A. The self-built ATC data set in this embodiment has a total of 9,300 voice samples, with a total duration of about 47 hours, of which 7,700 data are used as training sets, 540 data are used as verification sets, and 1,060 data are used as test sets. Compared with the air traffic control voice data set mentioned in the prior art, the ATC voice data set of this embodiment has a smaller data volume, and the data annotation cost is also reduced accordingly. However, the data volume of the data set is still sufficient to cover the vast majority of characters that may appear in the ground-to-air dialogue in the air traffic control field, and since the number of Chinese character categories in this data set is not much different from the number of pinyin categories, Chinese characters can be directly selected as modeling units. The most direct advantage is that no additional language model is required for conversion, and only the acoustic model needs to be trained.

接着，对去除静音段后的训练语音数据进行特征提取。特征提取首先需要进行分帧操作，即将N个采样点合成一个观测单位，由于语音信号具有短时平稳性(10-30ms内可以认为语音信号近似不变)，故本实施例中设定一帧涵盖的时间是25ms，由于空管语音数据的采样率是8000Hz，所以每一帧内有200个采样点，为了避免前后相邻帧的特征参数会出现突变，因此一般会使相邻帧之间有一段重叠区域，本实施例中设定的重叠区域是12.5ms，即每隔12.5ms取一帧。但是本实施例中的部分ATC语音样本的帧数过长，导致对设备要求更高，本实施例中采取隔帧采样的降采样方式来缓解设备压力。本发明实施例提取26维的MFCC特征，即每一帧的特征向量具有26个特征。由于一帧语音只包含25ms的内容，一般情况下不足以表达一个音节，所以要对提取得到的MFCC特征进行拼帧处理，即对于当前帧，取左m帧、右n帧的MFCC特征，再与当前帧拼接起来作为当前帧的特征。在本实施例中，设置m＝7、n＝7，若当前帧位于本段语音的前7帧或者是结尾7帧时，就进行填零操作，而如果当前帧位于中间时，就将当前帧的左右各7帧的MFCC特征拼接到当前帧上。这样原本一帧仅仅只有26维的MFCC特征，经过拼帧处理后，每一帧就具有390((7+7+1)*26)维特征，包含375ms的语音内容信息，解决了单帧数据包含的信息量较少的问题，还可以使每一帧都具有上下文相关信息。Next, feature extraction is performed on the training speech data after removing the silent segment. Feature extraction first requires a frame operation, that is, N sampling points are synthesized into an observation unit. Since the speech signal has short-term stability (the speech signal can be considered to be approximately unchanged within 10-30ms), the time covered by a frame is set to 25ms in this embodiment. Since the sampling rate of the air traffic control speech data is 8000Hz, there are 200 sampling points in each frame. In order to avoid sudden changes in the feature parameters of the adjacent frames, there is generally an overlapping area between adjacent frames. The overlapping area set in this embodiment is 12.5ms, that is, a frame is taken every 12.5ms. However, the frame number of some ATC speech samples in this embodiment is too long, resulting in higher requirements for the equipment. In this embodiment, a downsampling method of frame sampling is adopted to relieve equipment pressure. The embodiment of the present invention extracts 26-dimensional MFCC features, that is, the feature vector of each frame has 26 features. Since a frame of speech contains only 25ms of content, which is generally not enough to express a syllable, the extracted MFCC features need to be processed by frame splicing, that is, for the current frame, the MFCC features of the left m frames and the right n frames are taken, and then spliced with the current frame as the features of the current frame. In this embodiment, m=7 and n=7 are set. If the current frame is located in the first 7 frames of this segment of speech or the last 7 frames, a zero filling operation is performed, and if the current frame is located in the middle, the MFCC features of the left and right 7 frames of the current frame are spliced to the current frame. In this way, a frame originally only has 26-dimensional MFCC features. After the frame splicing process, each frame has 390 ((7+7+1)*26) dimensional features, including 375ms of speech content information, which solves the problem of less information contained in a single frame data, and can also make each frame have context-related information.

然后，搭建包含自注意力机制的中文民航空中交通管制语音识别方法声学模型，其模型结构如图3所示，本发明申请中称这种网络结构为“TRM-BiGRU-CTC”网络，其中TRM模块指Transformer模型中的Encoder Block，每个TRM模块由多头自注意力(MultiheadSelf-Attention)层和前馈(FeedForward)层组成，各层均加入残差连接和层标准化；BiGRU模块指双向门控循环单元网络；CTC模块指连接时序分类。Then, an acoustic model of the Chinese civil aviation air traffic control speech recognition method including a self-attention mechanism is built, and its model structure is shown in Figure 3. In the present invention application, this network structure is referred to as a "TRM-BiGRU-CTC" network, wherein the TRM module refers to the Encoder Block in the Transformer model, and each TRM module consists of a multi-head self-attention (MultiheadSelf-Attention) layer and a feedforward (FeedForward) layer, and each layer is added with residual connections and layer normalization; the BiGRU module refers to a bidirectional gated recurrent unit network; and the CTC module refers to connection timing classification.

声学模型的一个重要组成模块为TRM，其结构如图4所示。TRM模块主要分为两个部分：多头自注意力(Multihead Self-Attention)层和前馈(Feed Forward)层，图中的“Add&Norm”表示的是残差连接和层标准化。残差连接可以有效改善梯度消失和网络退化问题，而层标准化可以加速网络收敛。An important component of the acoustic model is TRM, and its structure is shown in Figure 4. The TRM module is mainly divided into two parts: the Multihead Self-Attention layer and the Feed Forward layer. The "Add&Norm" in the figure represents the residual connection and layer normalization. The residual connection can effectively improve the gradient disappearance and network degradation problems, while the layer normalization can accelerate the network convergence.

TRM模块的Feed Forward层由两层线性层组成，在本实施例中两层线性层的隐藏单元个数分别为1024和390，输出与输入维度相同便于残差连接，第一层线性层加入ReLU激活函数，无Dropout，第二层线性层无激活函数，Dropout的比例为0.3。The Feed Forward layer of the TRM module consists of two linear layers. In this embodiment, the number of hidden units in the two linear layers is 1024 and 390 respectively. The output and input dimensions are the same for residual connection. The first linear layer adds a ReLU activation function without Dropout. The second linear layer has no activation function, and the Dropout ratio is 0.3.

TRM模块的Multihead Self-Attention层的结构如图5所示，对于输入X先经过线性层Linear将X分别映射成Q，K，V三种不同的表示，Q表示查询，K表示键，V表示值，在本实施例中线性层的输出维度为390。在本实施例中的多头注意力的头数设为5，将输入X经过5种不同的线性变换得到5种不同的自注意力映射表示，拼接起来后作为新的特征编码。故将X经过Linear层的三种映射Q，K，V依据其特征维度顺序分割成5份，如图5的a1，a2，a3，a4，a5所示，每个a的特征维度为78(390/5)，且包含Q，K，V三种表示，这三种表示再输入到Self-Attention层，图5中的Self-Attention表示的计算过程是其中d_k表示Q，K，V的特征维度，在本实施例中即为78。通过这一计算可以把输入多时间帧数据的每一帧与所有帧分别计算相似度，并把相似度作为权重，对所有帧进行加权求和，得到每一帧关于输入所有帧的关联表示，即图5中的b1-b5，相关性越高的帧之间对彼此的关联表示的影响也就越大，表现为代表相似度的权重越大。将b1-b5拼接(Concat)起来即得到输出Y。Y的特征维度仍然为390，与输入X保持一致从而进行残差连接。The structure of the Multihead Self-Attention layer of the TRM module is shown in Figure 5. For the input X, it first passes through the linear layer Linear to map X into three different representations Q, K, and V respectively. Q represents query, K represents key, and V represents value. In this embodiment, the output dimension of the linear layer is 390. The number of multi-head attention heads in this embodiment is set to 5. The input X is transformed into 5 different self-attention mapping representations through 5 different linear transformations, which are concatenated as new feature codes. Therefore, X is divided into 5 parts according to the feature dimension order after the three mappings Q, K, and V of the Linear layer, as shown in a1, a2, a3, a4, and a5 in Figure 5. The feature dimension of each a is 78 (390/5), and it contains three representations of Q, K, and V. These three representations are then input into the Self-Attention layer. The calculation process of the Self-Attention representation in Figure 5 is Where d _k represents the feature dimension of Q, K, V, which is 78 in this embodiment. Through this calculation, the similarity of each frame of the input multi-time frame data can be calculated with all frames respectively, and the similarity is used as the weight to perform weighted summation on all frames to obtain the association representation of each frame with respect to all input frames, i.e. b1-b5 in FIG5 . The more correlated the frames are, the greater the influence on each other's association representation is, which is manifested as a greater weight representing the similarity. Concatenate b1-b5 to obtain the output Y. The feature dimension of Y is still 390, which is consistent with the input X, so as to perform residual connection.

声学模型的另一个重要组成模块为BiGRU。一般的门控循环单元网络(GRU)中，其隐含层状态的传递方向是从前往后单向传播的，即位置的状态值仅与从位置0到位置i的输入有关，与位置i+1到结束的输入都没有关系，也就是当前状态仅仅与“上文”内容有关。但是在语音识别的任务中，当前位置的状态往往需要结合“上下文”的信息才更有效。双向门控循环单元网络(Bi-directional GRU，BiGRU)的基本思想是将两个单向的门控循环单元网络上下进行叠加。BiGRU的结构如图6所示，从下往上第一排通过右向箭头连接的圆圈表示的是前向GRU，从下往上第二排通过左向箭头相连的圆圈表示后向GRU。同一条训练序列[x1，x2，x3，x4，x5]，从前往后依次输入到前向GRU中得到序列[b1，b2，b3，b4，b5]，再从后往前输入到后向GRU中得到序列[a1，a2，a3，a4，a5]，将两个序列对应拼接在一起，例如b1与a1拼接为y1，得到输出序列[y1，y2，y3，y4，y5]，则这个输出序列就可以给每个时刻的状态提供完整的上下文信息，并且BiGRU的输出维度是单向GRU的两倍，故BiGRU与单向GRU相比具有更强的表达能力。Another important component of the acoustic model is BiGRU. In a general gated recurrent unit network (GRU), the transmission direction of its hidden layer state is unidirectional from front to back, that is, the state value of the position is only related to the input from position 0 to position i, and has nothing to do with the input from position i+1 to the end, that is, the current state is only related to the "previous" content. However, in the task of speech recognition, the state of the current position often needs to be combined with "context" information to be more effective. The basic idea of the bidirectional gated recurrent unit network (Bi-directional GRU, BiGRU) is to superimpose two unidirectional gated recurrent unit networks up and down. The structure of BiGRU is shown in Figure 6. The first row of circles connected by right-pointing arrows from bottom to top represents the forward GRU, and the second row of circles connected by left-pointing arrows from bottom to top represents the backward GRU. The same training sequence [x1, x2, x3, x4, x5] is input into the forward GRU from front to back to obtain the sequence [b1, b2, b3, b4, b5], and then input into the backward GRU from back to front to obtain the sequence [a1, a2, a3, a4, a5]. The two sequences are concatenated together, for example, b1 and a1 are concatenated into y1, and the output sequence is [y1, y2, y3, y4, y5]. This output sequence can provide complete context information for the state at each moment, and the output dimension of BiGRU is twice that of unidirectional GRU, so BiGRU has stronger expression ability than unidirectional GRU.

BiGRU模块中的单向GRU的原理结构如图7所示，Xt表示t时刻的输入，Ht表示t时刻的隐状态，Ht-1表示t-1时刻的隐状态，带星号的圆圈表示Hadamard乘法操作，带加号的圆圈表示加操作。GRU在原理上是通过一个更新门和一个重置门来实现长期时序依赖。更新门通过公式Zt＝f(Wz[Xt，Ht-1])来计算，其中f为Sigmoid函数，得到一个0-1之间的值，其决定了当前时间步和过去时间步的多少信息要继续传递。重置门通过公式Rt＝f(Wr[Xt，Ht-1])来计算，其中f同样为Sigmoid函数，得到一个0-1之间的值，其决定了要将多少过去时间步的信息遗忘，虽然重置门和更新门具有相同的计算公式，但具有不同的参数从而实现不同的功能。当前步记忆内容通过公式H`t＝tanh(W[Ht，Rt*Ht-1])计算得到，而t时刻的隐状态通过公式Ht＝H`t*Zt+(1-Zt)*Ht-1来计算，其中H`t*Zt表示t时间步更新的信息，(1-Zt)*Ht-1表示过去时间步继续传递的信息，两者结合起来得到经过t时间步时的输出。The principle structure of the unidirectional GRU in the BiGRU module is shown in Figure 7. Xt represents the input at time t, Ht represents the hidden state at time t, Ht-1 represents the hidden state at time t-1, the circle with an asterisk represents the Hadamard multiplication operation, and the circle with a plus sign represents the addition operation. In principle, GRU realizes long-term temporal dependency through an update gate and a reset gate. The update gate is calculated by the formula Zt=f(Wz[Xt, Ht-1]), where f is the Sigmoid function, and a value between 0-1 is obtained, which determines how much information of the current time step and the past time step should continue to be transmitted. The reset gate is calculated by the formula Rt=f(Wr[Xt, Ht-1]), where f is also the Sigmoid function, and a value between 0-1 is obtained, which determines how much information of the past time step should be forgotten. Although the reset gate and the update gate have the same calculation formula, they have different parameters to achieve different functions. The memory content of the current step is calculated by the formula H`t=tanh(W[Ht, Rt*Ht-1]), and the hidden state at time t is calculated by the formula Ht=H`t*Zt+(1-Zt)*Ht-1, where H`t*Zt represents the information updated in the t time step, and (1-Zt)*Ht-1 represents the information continued to be transmitted in the past time step. The two are combined to obtain the output after t time steps.

在本实施例中，对声学模型的有关参数进行如下设置：TRM中Multihead Self-Attention层的多头机制的头数设为5，Linear层参数为390，则每个头的参数为78(390/5)，Dropout比例为0.3；TRM中Feed Forward层的两个前馈子层的隐藏节点个数分别为1024和390，Dropout的比例为0.3；BiGRU模块中的前向与后向GRU的神经元节点数均为256，激活函数为Tanh激活函数，Dropout的比例设为0.3；BiGRU之后的第一个全连接层FC的隐藏节点设为256，激活函数为ReLU激活函数，Dropout的比例为0.3；由于声学模型最终识别目标为中文字符，ATC数据集中共有745个中文字符类，故连接CTC的全连接层FC的隐藏节点设为746(745+blank)且无激活函数、无Dropout；损失函数使用CTC损失函数。模型的训练采用Adam优化器来进行网络参数更新，其初始学习率设置为0.0005，Adam中的动量值分别取0.9和0.99。在训练过程中，每个批次选择12条音频的MFCC特征作为输入，由于每个批次的音频长度不同，而神经网络每个批次的输入数据要求对齐，故对每个批次的12条音频中的较短的11条语音的MFCC特征进行补零。In this embodiment, the relevant parameters of the acoustic model are set as follows: the number of heads of the multi-head mechanism of the Multihead Self-Attention layer in TRM is set to 5, the parameter of the Linear layer is 390, then the parameter of each head is 78 (390/5), and the Dropout ratio is 0.3; the number of hidden nodes of the two feedforward sublayers of the Feed Forward layer in TRM are 1024 and 390 respectively, and the Dropout ratio is 0.3; the number of neuron nodes of the forward and backward GRUs in the BiGRU module are both 256, the activation function is the Tanh activation function, and the Dropout ratio is set to 0.3; the hidden nodes of the first fully connected layer FC after BiGRU are set to 256, the activation function is the ReLU activation function, and the Dropout ratio is 0.3; since the final recognition target of the acoustic model is Chinese characters, there are 745 Chinese character classes in the ATC data set, so the hidden nodes of the fully connected layer FC connected to CTC are set to 746 (745+blank) and there is no activation function and no Dropout; the loss function uses the CTC loss function. The Adam optimizer is used to update the network parameters for model training. The initial learning rate is set to 0.0005, and the momentum values in Adam are 0.9 and 0.99 respectively. During the training process, the MFCC features of 12 audios are selected as input in each batch. Since the audio lengths of each batch are different, and the input data of each batch of the neural network are required to be aligned, the MFCC features of the shorter 11 voices in each batch of 12 audios are padded with zeros.

在得到训练好的声学模型以后，就可以对待识别的ATC语音数据进行识别。如图8识别流程图所示，先通过空管通话设备采集到ATC语音数据，并确保其格式为WAV格式，若不是WAV格式则需先行转换为WAV格式。然后去除静音段，再经过预加重、分帧、加窗后提取其MFCC特征，进行左右拼帧操作后输入训练好的声学模型中，声学模型的输出经过CTC解码后即得到对应的语音内容预测文本。CTC解码采用集束搜索(Beam Search)的方式。BeamSearch解码可以认为是保留次优解的广度优先搜索，对于一般的广度优先搜索，过程中保留了所有的历史路径，而Beam Search只保留了TOP-N(称为集束宽度Beam_width)的历史路径。本实施例中解码时的beam_width设为5。假设词表大小为100，当生成第一词的时候，由于beam_width等于5，所以从词表中选择概率最大的5个词，当生成第二个单词时，上一个词的可能序列为选中的5个词中的任意一个，分别与词表中的词进行组合，得到5*100个新的序列，然后从其中选择10条置信度最高的，当做当前的序列，后面到选择第三个词时也会有5*100种可能的序列，依然从中选出置信度最高的5条序列，之后就不断重复上述过程直到遇到终止符为止，最终选出5个得分最高的序列。Beam Search的方法属于贪心算法的思想，不一定能够达到全局最优解。但是考虑语音的帧数非常多，对应的字符类数也比较多，如果想得到全局最优解的话，搜索空间和路径就会无比巨大，搜索效率就会非常低，所以在本实施例中采用Beam Search得到的虽然是一个相对局部最优解，但在工程效果上也是可以接受的。After obtaining the trained acoustic model, the ATC voice data to be recognized can be recognized. As shown in the recognition flow chart of Figure 8, the ATC voice data is first collected through the air traffic control communication equipment, and its format is ensured to be in WAV format. If it is not in WAV format, it needs to be converted to WAV format first. Then remove the silent segment, extract its MFCC features after pre-emphasis, framing, and windowing, and input it into the trained acoustic model after left and right frame splicing operations. The output of the acoustic model is decoded by CTC to obtain the corresponding voice content prediction text. CTC decoding adopts the beam search method. BeamSearch decoding can be considered as a breadth-first search that retains the suboptimal solution. For general breadth-first search, all historical paths are retained in the process, while Beam Search only retains the historical path of TOP-N (called beam width Beam_width). In this embodiment, the beam_width during decoding is set to 5. Assuming that the vocabulary size is 100, when generating the first word, since beam_width is equal to 5, the 5 words with the highest probability are selected from the vocabulary. When generating the second word, the possible sequence of the previous word is any one of the selected 5 words, which are combined with the words in the vocabulary to obtain 5*100 new sequences, and then 10 sequences with the highest confidence are selected from them as the current sequence. When selecting the third word, there will be 5*100 possible sequences, and the 5 sequences with the highest confidence are still selected from them. After that, the above process is repeated until the terminator is encountered, and finally the 5 sequences with the highest scores are selected. The Beam Search method belongs to the idea of greedy algorithm and may not be able to achieve the global optimal solution. However, considering that the number of speech frames is very large and the corresponding number of character classes is also relatively large, if you want to get the global optimal solution, the search space and path will be extremely huge, and the search efficiency will be very low. Therefore, although the Beam Search obtained in this embodiment is a relatively local optimal solution, it is acceptable in terms of engineering effect.

下面对本发明的效果进行验证The effect of the present invention is verified as follows

中文语音识别常用的评估指标：字错误率(CharacterError Rate，CER)。字错误率的计算方式为：为了使识别出序列和正确的序列之间保持一致，需要进行替换，删除或者是插入某些字符，这些插入、替换、删除的字的总个数，占正确序列中字符的总个数的百分比，即为CER，其计算公式如下：The commonly used evaluation indicator for Chinese speech recognition is the Character Error Rate (CER). The character error rate is calculated as follows: In order to make the recognized sequence consistent with the correct sequence, some characters need to be replaced, deleted or inserted. The total number of these inserted, replaced or deleted characters as a percentage of the total number of characters in the correct sequence is the CER. The calculation formula is as follows:

语音识别技术领域的现有技术有很多，但应用在中文空中交通管制语音识别领域的现有技术主要是“CLDNN”结构，即由多层CNN、多层LSTM、多层全连接神经网络组成的深度学习架构(来自申请公布号为CN 110335609 A的专利“一种基于语音识别的地空通话数据分析方法及系统”)。由于在中文空中交通管制语音识别领域使用的语音数据集一般都是自建数据库，其语音音质、时长、采集设备和录制环境等等都有所区别，所以无法通过直接对比不同数据集上的准确率来评估识别方法的优劣。故本发明仅与“CLDNN”进行了对比实验，并针对ATC数据集的特点调整了其结构。本发明实施例中所提及的“CLDNN”的具体模型结构如下：两层CNN层，卷积核大小为3*3，步长为1，滤波器个数依次为32和64，第一层CNN后接最大池化层，池化窗口为2*2，窗口不重叠，第二层CNN后不接池化层；卷积层的输出作为全连接网络层的输入，以降低维度，全连接网络层的神经元个数为512；全连接网络层后接三层LSTM，神经元个数均为256，其后接一层全连接网络层和一层softmax层，神经元个数分别为256和字符类别数。对语音数据的预处理与本发明实施例相同，均提取语音的MFCC特征并进行左7帧和右7帧的拼帧，优化器同样采用Adam优化器，初始学习率为0.005，Adam中的动量值分别取0.9和0.99。此“CLDNN”模型对数据量的要求比较高，受限于数据规模，当其结构中的LSTM层数过大时，会出现严重的梯度消失问题，且受限于硬件性能故此实验中将LSTM层数设置为3。There are many existing technologies in the field of speech recognition technology, but the existing technologies applied in the field of Chinese air traffic control speech recognition are mainly the "CLDNN" structure, that is, a deep learning architecture composed of multi-layer CNN, multi-layer LSTM, and multi-layer fully connected neural network (from the patent "A method and system for analyzing ground-to-air call data based on speech recognition" with application publication number CN 110335609 A). Since the speech data sets used in the field of Chinese air traffic control speech recognition are generally self-built databases, their speech quality, duration, acquisition equipment, recording environment, etc. are different, so it is impossible to evaluate the advantages and disadvantages of the recognition method by directly comparing the accuracy of different data sets. Therefore, the present invention only conducted a comparative experiment with "CLDNN" and adjusted its structure according to the characteristics of the ATC data set. The specific model structure of the "CLDNN" mentioned in the embodiment of the present invention is as follows: two layers of CNN layers, the convolution kernel size is 3*3, the step size is 1, the number of filters is 32 and 64 respectively, the first layer of CNN is followed by a maximum pooling layer, the pooling window is 2*2, the windows do not overlap, and the second layer of CNN is not followed by a pooling layer; the output of the convolution layer is used as the input of the fully connected network layer to reduce the dimension, and the number of neurons in the fully connected network layer is 512; the fully connected network layer is followed by three layers of LSTM, the number of neurons is 256, followed by a fully connected network layer and a softmax layer, the number of neurons is 256 and the number of character categories respectively. The preprocessing of the speech data is the same as the embodiment of the present invention, and the MFCC features of the speech are extracted and the left 7 frames and the right 7 frames are spliced. The optimizer also uses the Adam optimizer, the initial learning rate is 0.005, and the momentum values in Adam are taken as 0.9 and 0.99 respectively. This "CLDNN" model has high requirements for the amount of data. Due to the limitation of data scale, when the number of LSTM layers in its structure is too large, serious gradient vanishing problem will occur. And due to the limitation of hardware performance, the number of LSTM layers is set to 3 in this experiment.

此“CLDNN”模型已经针对ATC数据集进行了参数调整，降低了参数规模。尽管如此，其模型参数个数经统计为10050163，而本实施例的模型参数个数为3837483，对比之下，本实施例的参数规模较小，模型保持成本较低。This "CLDNN" model has been parameterized for the ATC dataset, reducing the parameter scale. However, the number of model parameters is 10,050,163, while the number of model parameters in this embodiment is 3,837,483. In contrast, the parameter scale of this embodiment is smaller and the model maintenance cost is lower.

表1为声学模型“TRM-BiGRU-CTC”在测试集上的识别效果，表中还列出了BiGRU、CLDNN的字错率进行对比：Table 1 shows the recognition effect of the acoustic model "TRM-BiGRU-CTC" on the test set. The table also lists the word error rates of BiGRU and CLDNN for comparison:

表1Table 1

由上述结果可知，在采取同样的数据预处理方法的情况下，本发明在ATC数据集上的识别效果优于“CLDNN”，也优于基准模型BiGRU。It can be seen from the above results that, under the same data preprocessing method, the recognition effect of the present invention on the ATC data set is better than that of "CLDNN" and also better than the benchmark model BiGRU.

本发明具有以下优势：The present invention has the following advantages:

(1)ATC语音优势：本发明针对ATC语音的特点而专门设计了ATC空管语音数据集。(1) ATC speech advantage: The present invention specifically designs an ATC air traffic control speech dataset based on the characteristics of ATC speech.

(2)模型优势：在本发明提供的声学模型结构中，TRM模块能够将输入的语音特征进行编码，通过自注意力机制分别计算每一帧特征与输入语音所有帧数据的相似度，充分考虑输入语音帧与帧之间发音和语义的相互联系，重新计算得到一种关联了上下文语音信息的特征表示。BiGRU是将双向循环神经网络与门控循环单元网络相结合的产物，兼具两者的优点，既可以同门控循环单元网络一样处理时序依赖关系，又可以与双向循环神经网络一样具有上下文信息。CTC是用来解决输入序列和输出序列难以一一对应的问题，而语音就是一个典型的输入序列与标签序列不对齐的问题，CTC正是针对这样的问题，使深度学习模型自动学习对齐，从而实现端到端的语音识别。综上所述，该声学模型结构具有合理性，同时，其主体结构仅由TRM和BiGRU层组成，不容易出现梯度消失和梯度爆炸等问题，模型训练过程容易收敛，并且对数据量要求相对较低，数据集标注成本低。与现有技术相比，本发明在数据量相对较少的ATC空管语音数据集上达到了更优的识别效果。(2) Model advantages: In the acoustic model structure provided by the present invention, the TRM module can encode the input speech features, calculate the similarity of each frame feature with all the frame data of the input speech through the self-attention mechanism, fully consider the pronunciation and semantics of the input speech frames, and recalculate a feature representation associated with contextual speech information. BiGRU is a product of combining a bidirectional recurrent neural network with a gated recurrent unit network, which has the advantages of both. It can process temporal dependencies like a gated recurrent unit network, and has contextual information like a bidirectional recurrent neural network. CTC is used to solve the problem that it is difficult to correspond one-to-one between input sequences and output sequences, and speech is a typical problem of misalignment between input sequences and label sequences. CTC is aimed at such problems, so that the deep learning model automatically learns alignment, thereby realizing end-to-end speech recognition. In summary, the acoustic model structure is reasonable. At the same time, its main structure is composed of only TRM and BiGRU layers, and it is not easy to have problems such as gradient disappearance and gradient explosion. The model training process is easy to converge, and the data volume requirement is relatively low, and the data set annotation cost is low. Compared with the prior art, the present invention achieves better recognition effect on the ATC air traffic control speech data set with relatively small data volume.

实施例2Example 2

参见图9，本实施例提供了一种中文民航空中交通管制语音识别系统，该系统包括：Referring to FIG. 9 , this embodiment provides a Chinese civil aviation air traffic control speech recognition system, the system comprising:

语音特征数据获取模块901，用于获取语音特征数据，所述语音特征数据为基于语音信号提取得到的时序特征信息；The speech feature data acquisition module 901 is used to acquire speech feature data, where the speech feature data is time series feature information extracted based on the speech signal;

语音识别模块902，用于将所述语音特征数据输入经训练的声学模型，得到识别结果，所述识别结果表示所述语音信号对应的空中交通管制中文术语文字；所述声学模型包括：依次连接的TRM模块、BiGRU模块、全连接层FC和CTC模块，所述TRM模块包括依次连接的多头自注意力层、第一残差连接和层标准化层、前馈层以及第二残差连接和层标准化层，所述BiGRU模块包括双向门控循环单元网络，所述CTC模块包括连接时序分类层，所述声学模型由带有中文文字标签的空管指令术语语音样本训练得到。The speech recognition module 902 is used to input the speech feature data into the trained acoustic model to obtain a recognition result, wherein the recognition result represents the Chinese terminology of air traffic control corresponding to the speech signal; the acoustic model comprises: a TRM module, a BiGRU module, a fully connected layer FC and a CTC module connected in sequence, the TRM module comprises a multi-head self-attention layer, a first residual connection and a layer normalization layer, a feedforward layer and a second residual connection and a layer normalization layer connected in sequence, the BiGRU module comprises a bidirectional gated recurrent unit network, the CTC module comprises a connection time series classification layer, and the acoustic model is trained by speech samples of air traffic control instruction terminology with Chinese character labels.

作为本实施例的一种实施方式，所述中文民航空中交通管制语音识别系统还包括：As an implementation of this embodiment, the Chinese civil aviation air traffic control speech recognition system further includes:

本说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似部分互相参见即可。对于实施例公开的系统而言，由于其与实施例公开的方法相对应，所以描述的比较简单，相关之处参见方法部分说明即可。In this specification, each embodiment is described in a progressive manner, and each embodiment focuses on the differences from other embodiments. The same or similar parts between the embodiments can be referred to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant parts can be referred to the method part.

本文中应用了具体个例对本发明的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本发明的方法及其核心思想；同时，对于本领域的一般技术人员，依据本发明的思想，在具体实施方式及应用范围上均会有改变之处。综上所述，本说明书内容不应理解为对本发明的限制。This article uses specific examples to illustrate the principles and implementation methods of the present invention. The above examples are only used to help understand the method and core ideas of the present invention. At the same time, for those skilled in the art, according to the ideas of the present invention, there will be changes in the specific implementation methods and application scope. In summary, the content of this specification should not be understood as limiting the present invention.

Claims

1. A method for speech recognition of civil aviation air traffic control, characterized in that it comprises:

Acquire speech feature data, where the speech feature data is time series feature information extracted based on a speech signal;

The speech feature data is input into a trained acoustic model to obtain a recognition result, wherein the recognition result represents the Chinese terminology of air traffic control corresponding to the speech signal; the acoustic model comprises: a TRM module, a BiGRU module, a fully connected layer and a CTC module connected in sequence, the TRM module comprises a multi-head self-attention layer, a first residual connection and a layer normalization layer, a feedforward layer and a second residual connection and a layer normalization layer connected in sequence, the feedforward layer consists of a first linear layer and a second linear layer, the first linear layer adds a ReLU activation function without Dropout, and the second linear layer has no activation function; the number of heads of the multi-head attention is set to 5, and the input is subjected to 5 different linear transformations to obtain 5 Different self-attention mapping representations are concatenated and used as new feature codes; the BiGRU module includes a bidirectional gated recurrent unit network, in which the same training sequence [x1, x2, x3, x4, x5] is input into the forward GRU from front to back to obtain a sequence [b1, b2, b3, b4, b5], and then input into the backward GRU from back to front to obtain a sequence [a1, a2, a3, a4, a5], and the two sequences are concatenated together to obtain an output sequence [y1, y2, y3, y4, y5]; the CTC module includes a connection temporal classification layer, and the acoustic model is trained by air traffic control instruction terminology speech samples with Chinese character labels.

2. The method for speech recognition of civil aviation air traffic control according to claim 1, characterized in that before acquiring speech feature data, it also includes:

Performing a frame operation on the speech signal to obtain a plurality of speech frames;

The speech feature data is determined according to the speech frame; each piece of speech feature data corresponds to a plurality of continuous speech frames.

3. The Chinese civil aviation air traffic control speech recognition method according to claim 2 is characterized in that each of the speech feature data corresponds to a reference speech frame and a set number of speech frames before the reference speech frame and a set number of speech frames after the reference speech frame.

4. The Chinese civil aviation air traffic control speech recognition method according to claim 3 is characterized in that when the reference speech frame is the first m frames or the last n frames of the speech signal, the speech feature data belonging to the reference speech frame are padded with zeros in front or behind respectively so that the data lengths of each of the speech feature data are the same, wherein m and n are both positive integers.

5. The method for recognizing speech of civil aviation air traffic control according to any one of claims 2 to 4, characterized in that the step of determining the speech feature data according to the speech frame specifically comprises:

Sampling the speech frame to obtain a plurality of sampling points;

Based on the sampling points, the speech feature data are determined, each of the speech feature data corresponds to a sampling point in a plurality of continuous speech frames.

6. The Chinese civil aviation air traffic control speech recognition method according to claim 5 is characterized in that the speech feature data is the Mel-frequency cepstral coefficient of the speech.

7. The method for recognizing speech of civil aviation air traffic control according to claim 2, characterized in that before the speech signal is framed, it further comprises:

The voice signal is de-mute processed.

8. The Chinese civil aviation air traffic control speech recognition method according to claim 2 is characterized in that adjacent speech frames in the speech signal have an overlapping area of a set ratio.

9. A Chinese civil aviation air traffic control speech recognition system, characterized by comprising:

A speech feature data acquisition module, used to acquire speech feature data, wherein the speech feature data is time series feature information extracted based on speech signals;

A speech recognition module is used to input the speech feature data into a trained acoustic model to obtain a recognition result, wherein the recognition result represents the Chinese terminology of air traffic control corresponding to the speech signal; the acoustic model comprises: a TRM module, a BiGRU module, a fully connected layer and a CTC module connected in sequence, the TRM module comprises a multi-head self-attention layer, a first residual connection and a layer normalization layer, a feedforward layer and a second residual connection and a layer normalization layer connected in sequence, the feedforward layer consists of a first linear layer and a second linear layer, the first linear layer adds a ReLU activation function without Dropout, and the second linear layer has no activation function; the number of heads of the multi-head attention is set to 5, and the input is subjected to 5 different linear The transformation obtains 5 different self-attention mapping representations, which are spliced together as new feature codes; the BiGRU module includes a bidirectional gated recurrent unit network, in which the same training sequence [x1, x2, x3, x4, x5] is input into the forward GRU from front to back to obtain the sequence [b1, b2, b3, b4, b5], and then input into the backward GRU from back to front to obtain the sequence [a1, a2, a3, a4, a5], and the two sequences are spliced together to obtain the output sequence [y1, y2, y3, y4, y5]; the CTC module includes a connection time series classification layer, and the acoustic model is trained by air traffic control instruction terminology speech samples with Chinese character labels.

10. The Chinese civil aviation air traffic control speech recognition system according to claim 9, characterized in that the Chinese civil aviation air traffic control speech recognition system further comprises:

A de-mute module, used for performing de-mute processing on the voice signal;

A framing module, used for performing a framing operation on the speech signal to obtain a plurality of speech frames, wherein adjacent speech frames have an overlapping area of a set ratio;

The speech feature data determination module is used to determine the speech feature data according to the speech frame; each of the speech feature data corresponds to a plurality of continuous speech frames, and the speech feature data is the Mel-frequency cepstral coefficient of the speech.