CN115206293A

CN115206293A - Multi-task air traffic control voice recognition method and device based on pre-training

Info

Publication number: CN115206293A
Application number: CN202211118845.1A
Authority: CN
Inventors: 张子宸; 林毅; 张建伟
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2022-09-15
Filing date: 2022-09-15
Publication date: 2022-10-18
Anticipated expiration: 2042-09-15
Also published as: CN115206293B

Abstract

The invention relates to the technical field of artificial intelligence, in particular to a multi-task air traffic control voice recognition method and device based on pre-training. The method comprises the steps of obtaining air traffic control voice data and preprocessing the air traffic control voice data to obtain a training sample data set which is divided into a first-stage pre-training data set and a second-stage training data set; secondly, constructing an empty pipe speech coding model; inputting the pre-training data set of the first stage into an empty pipe speech coding model for pre-training; constructing a multi-task air traffic control voice recognition model after the pre-trained air traffic control voice coding model; establishing a loss function of a multitask empty pipe voice recognition model; training the multi-task air traffic control voice recognition model through a loss function and a second-stage training data set; and finally, inputting the empty pipe voice data segmented by sentences into the trained multi-task empty pipe voice recognition model and outputting a result. The invention realizes the speech recognition with higher speed and higher accuracy by training based on fewer label samples.

Description

Multi-task air traffic control voice recognition method and device based on pre-training

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a multi-task air traffic control voice recognition method and device based on pre-training.

Background

In reality, a certain relation exists among a plurality of related problems, and the multi-task learning is to improve the generalization capability of the model by utilizing the related information hidden in a plurality of related tasks, so that the model can learn better feature representation, and the performance of each task is improved. Meanwhile, because the network parameters can be shared among tasks in the multi-task learning, the results of a plurality of tasks can be obtained through one-time reasoning, the data quantity and the model parameter quantity required by training can be obviously reduced, and the model can be more efficient during reasoning.

In recent years, more and more artificial intelligence fields begin to pay attention to an unsupervised pre-training mode, unsupervised pre-training can use a large amount of unlabeled data to train a network model which is general in the field and has strong generalization capability, then fine adjustment is performed on a small amount of labeled data according to different downstream tasks, and finally fewer labeled samples are used to obtain more excellent performance.

In the field of air traffic control intellectualization, the air traffic control voice with multiple attribute labels can provide more information sources for air traffic control safety auxiliary measures and provide more information for post analysis. At present, no good mode is available for simultaneously providing text transcription and carrying out multiple attribute classification for the air traffic control voice, so that the application provides a pre-training-based multi-task air traffic control voice recognition method and device to improve the task effect in the field of air traffic control voice recognition and carry out multiple attribute classification on the air traffic control voice.

Disclosure of Invention

The invention aims to: aiming at the problem that the prior art does not have a good mode to carry out text recognition and multiple attribute classification on air-to-ground and air-to-air communication in real time, a multi-task air-to-air voice recognition method and device based on pre-training are provided.

In order to achieve the purpose, the invention adopts the technical scheme that:

a multi-task empty pipe voice recognition method based on pre-training comprises the following steps:

the method comprises the following steps that S1, air traffic control voice data are obtained and preprocessed, and a training sample data set is obtained and comprises a first-stage pre-training data set and a second-stage training data set which is used for manually carrying out text labeling and auxiliary task attribute labeling;

s2, constructing an air traffic control speech coding model based on pre-training;

s3, inputting the pre-training data set of the first stage into an empty pipe speech coding model for pre-training;

s4, constructing a multi-task air traffic control voice recognition model based on the pre-trained air traffic control voice coding model;

s5, establishing a loss function of the multi-task air traffic control voice recognition model;

s6, training the multi-task air traffic control voice recognition model based on a loss function of the multi-task air traffic control voice recognition model and a second-stage training data set;

and S7, inputting the air traffic control voice data segmented according to the sentences into the trained multi-task air traffic control voice recognition model to obtain a text recognition result and an auxiliary task recognition result.

As a preferred scheme of the present invention, a pre-training-based multitask empty pipe speech recognition method, wherein in step S1, hollow pipe speech data is a chinese-english speech signal without a text label, comprising the following steps:

s11, after carrying out voice emphasis and framing pretreatment on the air traffic control voice data, segmenting the pretreated air traffic control voice data according to sentences;

s12, taking all the segmented air traffic control voice data as a pre-training data set in a first stage, wherein each training sample only comprises a single-sentence voice audio file;

and S13, selecting the partially segmented air traffic control voice data to manually perform text labeling and auxiliary task attribute labeling to serve as a second-stage training data set, wherein each training sample comprises a single-sentence voice audio file, a corresponding text label and an attribute classification label.

As a preferred scheme of the invention, the multitask empty pipe speech recognition method based on pre-training comprises the following steps of:

s21, establishing a convolution module consisting of a one-dimensional convolution layer and an activation function layer, and extracting the voice characteristics of the training sample by using the convolution module;

s22, establishing a context extraction module consisting of a deep neural network, extracting context information of the voice features by using the context extraction module, and recording the context information as:

wherein the content of the first and second substances,cis the output of the convolution module and is,hfor the hidden layer feature output by each layer of neural network,h _i is as followsiThe hidden layer characteristics output by the layer neural network,Nis the total number of layers of the deep neural network,Tin order to input the length of the speech,fin order to be a dimension of the hidden layer feature,Rthe number is a real number set,R ¹ × ^T × ^f the representation feature dimension is (1,T, f)；

s23, establishing an output module, and extracting the last context in the moduledAnd stacking the output of the layer hidden layer as the output of the empty pipe speech coding model, and recording as:

wherein the content of the first and second substances,Yis the multi-layer feature output of the encoder, hfor the hidden layer feature output by each layer of neural network,dthe number of layers of the deep neural network,Nis the total number of layers of the deep neural network,Tin order to input the length of the speech,fis the dimension of the hidden layer feature(s),Rthe number is a real number set,R ^d × ^T × ^f represents a characteristic dimension of (d, T, f)。

As a preferred scheme of the invention, the step S4 of the multi-task air traffic control speech recognition model construction comprises the following steps:

s41, constructing a multi-attention module, and constructing an auxiliary task classifier based on the multi-attention module, wherein the auxiliary task classifier comprises a speaker role classifier, an instruction language classifier, a speaker gender classifier and an instruction intention classifier;

and S42, constructing a multi-attention module, and constructing a voice recognition classifier based on the multi-attention module.

As a preferred scheme of the invention, the construction of the multi-attention module comprises the following steps:

constructing a level attention module, performing attention operation in a level dimension according to the multi-layer characteristic output of the encoder to obtain an attention matrix of the level dimension, multiplying the attention matrix by the multi-layer characteristic output of the encoder to obtain a result, and recording the result as:

wherein the content of the first and second substances,LRfor the output of the hierarchical attention module,Yfor the multi-layer feature output of the encoder,

a calculation formula for a hierarchical dimensional attention operation,dthe number of layers of the deep neural network,Tin order to input the length of the speech,fin order to be a dimension of the hidden layer feature, Rthe number is a real number set,R ^d × ^T × ^f represents a characteristic dimension of (d, T, f)，R ¹ × ^T × ^f The representation feature dimension is (1,T, f)；

constructing an attention module of a time sequence dimension and a frequency dimension, respectively carrying out attention operation on the time sequence dimension and the frequency dimension according to the output of the hierarchy attention module to obtain an attention moment matrix of the time sequence dimension and an attention matrix of the frequency dimension, multiplying the results of the time sequence dimension and the attention moment matrix of the frequency dimension by the results of the hierarchy attention module and outputting the results, and recording the results as:

wherein the content of the first and second substances,LTFRthe output of the attention module in the timing dimension and the frequency dimension,LRfor the output of the hierarchical attention module,

a calculation formula for attention operations in the time series dimension,

a calculation formula for attention operation in the frequency dimension,Tin order to input the length of the speech,fis the dimension of the hidden layer feature(s),Rthe number is a real number set,R ¹ × ^T × ^f the representation feature dimension is (1,T, f)。

as a preferred scheme of the present invention, in a pre-training-based multi-task empty pipe speech recognition method, the construction of the auxiliary task classifier in S41 further includes:

s411, inputting output results of the multiple attention modules into a voice recognition classifier;

and S412, inputting the output result of the multi-attention module into a full connection layer to obtain an auxiliary task classification result.

As a preferred embodiment of the present invention, the constructing of the speech recognition classifier in S42 further includes:

s421, adding the output result of the multi-attention module and the output of the multi-attention module of all the auxiliary task classifiers to obtain the voice characteristics containing various voice information, and recording as:

wherein the content of the first and second substances,X _ASR in order to include the speech characteristics of a variety of speech information,LTFR _ASR for the multi-attention module output of the speech recognition classifier,LTFR _{aux_i} is as followsiThe multi-attention module output of each auxiliary task classifier,iis shown asiAn auxiliary task classifier for classifying the task of the task,nindicates the number of the auxiliary task classifiers,Tin order to input the length of the speech,fin order to be a dimension of the hidden layer feature,Rthe set of real numbers is a set of real numbers,R ¹ × ^T × ^f the representation feature dimension is (1,T, f)；

s422, inputting the voice characteristics containing various voice information into the full connection layer to obtain a text recognition result.

As a preferred scheme of the present invention, in step S5, a loss function of the multitask air traffic control speech recognition model is constructed by a weighted summation of a loss function of the speech recognition classifier and a loss function of the auxiliary task classifier, and a weight occupied by each task loss is adjusted as a parameter in the model training process, wherein the loss function of the speech recognition classifier adopts a connection timing sequence classification loss, the loss functions of the auxiliary task classifiers all adopt cross entropy loss, and the loss function of the multitask air traffic control speech recognition model adopts a loss function of a cross entropy lossLAs is noted above, the number of the channels,

wherein, the first and the second end of the pipe are connected with each other,

and

respectively representing a speech recognition classifier andithe loss value of each of the secondary task classifiers,

and

respectively representing the speech recognition loss and the secondiIndividual assistantThe weight that the mission aid is losing is,nindicating the number of auxiliary task classifiers.

As a preferred scheme of the present invention, the training in step S6 is loop iteration training, and the single loop iteration training process is:

s61, selecting a group of training samples in the second-stage training data set;

s62, inputting the training sample into a multi-task air traffic control voice recognition model, and outputting a text recognition result and an auxiliary task classification result;

and S63, carrying out parameter adjustment on the parameters of the multitask empty pipe voice recognition model based on the loss function of the multitask empty pipe voice recognition model.

The device based on the pre-training multitask empty pipe voice recognition method comprises at least one processor and a memory which is in communication connection with the at least one processor; the memory is configured to store instructions which can be directed to by the at least one processor, the instructions being executable by the at least one processor to ensure that the at least one processor is capable of performing the method of any of the above.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

1. by utilizing the associated information among a plurality of voice related tasks, the neural network can learn more voice related shared characteristics, so that the performance of each task is improved. Meanwhile, due to the fact that multi-task learning shares network parameters among related tasks, training and prediction of the model are more efficient. And finally, the loss of each task is balanced by using a learning mode, the training speed is accelerated, and the accuracy of each task can be further improved.

2. The method adopts a mechanism of pre-training the empty-pipe speech coding model, trains on as much speech data as possible in a self-supervision learning mode, and extracts the common characteristics of the empty-pipe speech data as much as possible, so that the speech coding model can learn better speech characteristic representation, and the generalization capability of the model and the accuracy rate of downstream tasks are improved. Meanwhile, better effect can be realized by only needing fewer labeled empty pipe voice samples.

3. By adopting the multi-attention module, hidden layer characteristic information of different layers in the voice coding process can be fully utilized, more important time sequence and frequency dimension information can be captured, more effective voice representation information is provided for downstream tasks, and the performance of each task is improved.

In conclusion, the method has the advantages of higher speed and accuracy of the recognition of the blank pipe voice, less label samples required during training, capability of providing classified information such as the roles, languages, instruction intentions and the like of speakers corresponding to the blank pipe voice in real time, capability of providing more information sources for the blank pipe safety auxiliary measures and capability of providing more information for the post analysis.

Drawings

FIG. 1 is a model structure diagram of the empty pipe speech coding method of the present invention.

FIG. 2 is a block diagram of a multi-attention module according to the present invention.

FIG. 3 is a model structure diagram of the multitask empty pipe speech recognition method of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings.

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Example 1

In order to perform text transcription and multiple attribute classification for an air-to-ground air call in real time, the embodiment provides a pre-training-based multitask air-to-air speech recognition method and device, wherein the device comprises at least one processor and a memory in communication connection with the at least one processor.

The multi-task air traffic control voice recognition method based on pre-training comprises the following steps:

s1, acquiring air traffic control voice data, and preprocessing to obtain a training sample data set, wherein the training sample data set comprises a pre-training data set in a first stage and a training data set in a second stage, wherein text labeling and auxiliary task attribute labeling are carried out manually;

specifically, firstly, a Chinese-English voice signal without a text label under the ground-air communication environment is obtained through an air traffic control internal call system, chinese-English air traffic control ground-air communication voice is recorded in real time from a ground-air communication voice recorder by using a multi-channel voice signal acquisition device, and the voice is filtered, sampled and PCM encoded to form air traffic control voice data with 8K sampling rate and 16bit sampling precision.

Secondly, preprocessing the acquired air traffic control voice data in real time, including voice pre-emphasis, framing and the like, manually segmenting the preprocessed air traffic control voice data into instruction voice segments according to sentences, wherein each segment of voice only contains instructions of a single speaker, and storing the voice segments into a memory in a wav file format; constructing a pre-training data set in the first stage by using all the voice files, wherein each training sample only comprises a single-sentence voice audio file;

finally, randomly selecting empty pipe voice data for about 50 hours in a pre-training data set in the first stage, manually performing text labeling and attribute labeling of a plurality of auxiliary tasks, storing labeling results into a json file, organizing voice and label files to form a training data set in the second stage, wherein each training sample comprises a single-sentence voice audio file, a corresponding text label and a corresponding multi-task classification label;

the auxiliary tasks comprise speaker role classification, instruction language classification, speaker gender classification and instruction intention classification; the results of the speaker character classification include air traffic controllers and airplane drivers, the results of the instruction language classification include Chinese and English, the results of the speaker gender classification include males and females, and the results of the instruction intention classification include altitude or course change instructions such as ascending, descending, left-turning, right-turning and the like.

specifically, the empty pipe speech coding model structure is shown in fig. 1, and is composed of 1 convolution module, 1 context extraction module and 1 output module, and includes the following:

s21, a convolution module which consists of 7 one-dimensional convolution layers (Conv 1d Layer) and an activation function Layer (GELU) and is used for extracting the voice characteristics of the input training sample;

wherein, the convolution layer adopts convolution kernels with the size of 1 multiplied by 3, the number of the convolution kernels is 512, and the step length is 2;

s22, a context extraction module which is composed of a deep neural network and used for extracting context information of the voice features, and the context information is recorded as:

wherein, each hidden layer adopts an Encoder structure of a Transformer.

S23, an output module for extracting the last context in the context extraction module

The output of the layer hidden layer is stacked as output and is used as the input of a downstream speech recognition classifier and all auxiliary task classifiers, so that each classifier can obtain available information with more dimensions, and the available information is recorded as:

wherein the content of the first and second substances,Yfor the multi-layer feature output of the encoder, the output features are obtained bydThe output of the individual hidden layers is composed of a stack,dthe number of layers of the deep neural network,Nis the total number of layers of the deep neural network,hfor the hidden layer feature output by each layer of neural network,Tin order to input the length of the speech,fin order to be a dimension of the hidden layer feature,Rthe number is a real number set,R ^d × ^T × ^f represents a characteristic dimension of (d, T, f)。

And S3, inputting the pre-training data set in the first stage to pre-train the air traffic control voice coding model, training the air traffic control voice coding model in a self-supervision learning mode, and extracting the common characteristics of the air traffic control voice data, so that the voice coding model can learn better voice characteristic representation, and the generalization capability of the model and the accuracy rate of downstream tasks are improved. Meanwhile, better effect can be realized by only needing fewer labeled empty pipe voice samples.

Specifically, the pre-training method may refer to wav2vec 2.0, the pre-training is loop iteration training, and the steps executed in the single loop iteration training process are as follows:

s31, in a pre-training data set in the first stage, selecting a group of training samples to be input into an empty pipe speech coding model, and extracting hidden layer characteristics of the training samples by a convolution module of the empty pipe speech coding model;

s32, mapping the hidden layer characteristics obtained in the S31 into quantized hidden layer characteristics through a Gumbel softmax quantization module;

s33, randomly masking the hidden layer characteristics obtained in the S31, inputting the masked hidden layer characteristics to a context extraction module and outputting the masked hidden layer characteristics;

s34, constructing contrast learning loss, wherein negative samples are context features generated by positions added with masks in the output of the S33, and positive samples are quantization features of the same positions in the quantization hidden layer features obtained in the S32;

and S35, updating the parameters through back propagation.

Further, the pre-training method can also adopt wav2vec, vq-wav2vec and the like.

S4, constructing a multi-task air traffic control voice recognition model based on the air traffic control voice coding model;

the structure of the multi-task empty pipe speech recognition model is shown in fig. 2, and the multi-task empty pipe speech recognition model consists of an empty pipe speech coding model, a plurality of auxiliary task classifiers and a speech recognition classifier, wherein the auxiliary task classifiers and the speech recognition classifier share the empty pipe speech coding model, the output of the empty pipe speech coding model is used as the input of each classifier, and the specific steps are as follows:

constructing a multiple attention module (LTFAtt), as shown in fig. 3, the construction method comprises the following steps:

firstly, a level attention module is constructed, attention operation is carried out in a level dimension according to the multi-layer characteristic output of the encoder, an attention matrix of the level dimension is obtained, and the result obtained by multiplying the attention matrix of the level dimension and the multi-layer characteristic output of the encoder is recorded as:

wherein, the first and the second end of the pipe are connected with each other,LRfor the output of the hierarchical attention module,Yfor the multi-layer feature output of the encoder,

a calculation formula for hierarchical dimensional attention operations;dthe number of layers of the deep neural network,Tin order to input the length of the speech,fin order to be a dimension of the hidden layer feature, Rthe number is a real number set,R ^d × ^T × ^f represents a characteristic dimension of (d, T, f)，R ¹ × ^T × ^f The representation feature dimension is (1,T, f)；

a neural network structure is adopted, and the neural network structure comprises two full connection layers and a Sigmoid activation function.

Secondly, constructing a time sequence and frequency dimension attention module, respectively performing attention operation on the time sequence dimension and the frequency dimension according to the result of the level attention module to obtain an attention moment array of the time sequence dimension and an attention matrix of the frequency dimension, multiplying the results of the time sequence dimension and the frequency dimension by the results of the level attention module, outputting the results, and recording the results as:

a calculation formula for attention operations in the time series dimension,

a calculation formula for attention operation in the frequency dimension,Tin order to input the length of the speech,fis the dimension of the hidden layer feature(s),Rthe number is a real number set,R ¹ × ^T × ^f the representation feature dimension is (1,T, f)；

a neural network structure is adopted, and the neural network structure comprises a global average pooling layer, two full-connection layers and a Sigmoid activation function;

By adopting the multi-attention module, hidden layer characteristic information of different layers in speech coding is fully utilized, and time sequence and frequency dimension information is captured at the same time, so that more effective speech representation information is provided for downstream tasks.

S41, after the trained empty pipe speech coding model, constructing an auxiliary task classifier based on a multi-attention module, wherein the auxiliary task classifier comprises a speaker role classifier, an instruction language classifier, a speaker gender classifier and an instruction intention classifier; all the auxiliary task classifiers have the same structure but independent parameters, and specifically comprise:

firstly, constructing a multi-attention module, learning more important information for a recognition result from an attention mechanism module combining hierarchy, time sequence and frequency, and optimizing an attention parameter in a learning mode; secondly, inputting the attention results of the multiple attention modules into a voice recognition classifier to provide internal representation of various tasks for voice recognition; furthermore, the attention result of the multi-attention module is input into the full-connection layer, and the category with the highest probability is used as the auxiliary task classification result.

S42, after the trained empty pipe speech coding model, constructing a speech recognition classifier based on multiple attention modules, wherein the speech recognition classifier specifically comprises the following steps:

similarly, firstly, a multi-attention module is constructed, information more important for the recognition result is learned from an attention mechanism module combining hierarchy, time sequence and frequency, and attention parameters are optimized in a learning mode; secondly, adding the attention result of the multi-attention module and the output of all the auxiliary task classifiers to obtain the voice characteristics containing various voice information, and recording the voice characteristics as:

wherein the content of the first and second substances,X _ASR in order to include the speech characteristics of a variety of speech information,LTFR _ASR for the multi-attention module output of the speech recognition classifier,LTFR _{aux_i} is as followsiThe multi-attention module output of each auxiliary task classifier,idenotes the firstiAn auxiliary task classifier for classifying the task of the task,nindicates the number of the auxiliary task classifiers,Tin order to input the length of the speech,fis the dimension of the hidden layer feature(s),Rthe number is a real number set,R ¹ × ^T × ^f the representation feature dimension is (1,T, f)；

and finally, inputting the voice characteristics of various voice information into the full-connection layer to obtain a text recognition result corresponding to each voice frame.

S5, establishing a loss function of the multi-task empty-pipe speech recognition model which simultaneously considers speech recognition and comparison learning;

in particular, loss of speech recognition classifierThe function uses connection time sequence classification losses CTCLOss, the loss functions of the auxiliary task classifiers all use cross entropy loss Cross Entrophoploss, the loss function of the multi-task air traffic control speech recognition model is constructed in a mode of weighted summation of the loss function of the speech recognition classifier and the loss function of the auxiliary task classifier, the weight occupied by each task loss is adjusted in the model training process as a parameter, and the loss function of the multi-task air traffic control speech recognition modelLIs recorded as:

wherein the content of the first and second substances,

and

and

respectively representing the speech recognition loss and the secondiThe weight that the individual secondary tasks take up is lost,nrepresenting the number of the auxiliary task classifiers; the weights taken up by the speech recognition penalty and all ancillary task penalties are determined in a learning manner and are optimized together during the model training process.

Further, each weight is defined by a corresponding uncertainty variable

It is determined that each uncertainty variable is a scalar and will be updated and optimized during the training process, and the uncertainty variables will determine the corresponding loss weights in the following manner

：

S6, training the multi-task air traffic control voice recognition model until the network converges on the basis of a loss function of the multi-task air traffic control voice recognition model and a second-stage training data set; adopting a loop iteration training mode, and executing the following operations in the single loop iteration training process:

s61, randomly selecting a group of training samples from the second-stage training data set;

s62, inputting the training sample into a multi-task air traffic control voice recognition model, and outputting a voice recognition classification result and each auxiliary task classification result;

and S63, performing parameter adjustment on the relevant parameters of the multitask empty pipe voice recognition model by using the loss function of the multitask empty pipe voice recognition model.

And S7, acquiring a Chinese and English voice signal without a text label under a real-time ground-to-air communication environment, segmenting the Chinese and English voice signal into empty pipe voice data according to sentences, inputting the empty pipe voice data into the trained multi-task empty pipe voice recognition model, and obtaining a text recognition result and a multi-task attribute classification result of the empty pipe voice data.

Specifically, the empty pipe voice data is input into a trained multi-task voice recognition model, the model outputs multi-task label probability, the category with the highest probability is used as an auxiliary task classification result, and further, the model predicts the text label probability corresponding to the voice frame according to the output; and decoding and outputting the instruction text according to the maximum probability.

In conclusion, the invention simultaneously introduces mechanisms such as self-supervision pre-training, multi-attention and multi-task learning, designs the multi-task air traffic control voice recognition method and model based on end-to-end Chinese and English mixing of deep learning, improves the voice recognition accuracy under the air traffic control scene, can perform multi-task attribute classification in real time, and provides more available information for air traffic control post analysis or other downstream applications.

Example 2

The feasibility and the performance of the technical scheme of the embodiment 1 are verified:

firstly, data preparation is carried out, a data acquisition scheme provided in embodiment 1 is adopted, chinese and English voice signals without text labels under the ground-air conversation environment are obtained through an air traffic control internal conversation system, a first-stage pre-training data set and a second-stage training data set are obtained, and a training set, a verification set or a test set is formed by a random selection strategy.

Wherein, the pre-training data set in the first stage is as follows:

the training set comprises 774083 data in total for 640.40 hours, and the verification set comprises 7749 data in total for 6.40 hours;

the second stage training dataset is:

the training set comprises 58432 pieces of data for 53.56 hours, wherein 43178 pieces of Chinese data for 37.00 hours and 15254 pieces of English data for 16.56 hours; the test set comprises 1603 pieces of data for 1.45 hours, wherein the Chinese data are 1202 pieces of data for 1.01 hours, and the English data are 401 pieces of data for 0.44 hours; in the second stage of training, the vocabulary has a total of 668 characters, including 641 Chinese characters, 26 English letters and spaces.

The test results of example 2 were obtained by performing speech recognition on the test set.

Secondly, establishing a baseline model: in the embodiment, a wav2vec 2.0 model is used as an empty pipe speech coding model, a speech recognition classifier only comprising a full connection layer is connected behind the empty pipe speech coding model to be used as a baseline model to verify validity, and the model input is an original waveform of a speech file.

The baseline model and the technical method described in embodiment 1 are implemented using a pytorech framework, and the hyper-parameter configuration for model training is described as follows:

learning rate: setting the initial learning rate to be 1e-5, using a three-stage learning rate adjustment method (tri-stage lr schedule), carrying out learning rate warm-up (warming) on the first 10% of updates, keeping the learning rate on the next 40%, and carrying out linear decay (linear decay) on the rest;

batch training size: 8.

the hardware environment adopted by the experiment is as follows: the CPU is 2 multiplied by Intel Xeon E5-2680 v4, the video card is 1 multiplied by NVIDIA GeForce RTX 2080Ti, the video memory is 1 multiplied by 11GB, the memory is 128GB, and the operating system is Ubuntu 16.04;

under the above training data and configuration, a total of 8 experiments A1-A8 were performed, as follows:

a1: training the baseline model only on the second stage training data set to complete the speech recognition task;

a2: a pre-training learning mechanism is added during the training of the baseline model, firstly, the voice coding model part of the baseline model is pre-trained, and then, the baseline model is trained in the second stage so as to complete the voice recognition task;

a3: a multi-task learning mechanism is added during the training of the baseline model, and training is only carried out on the training data set of the second stage so as to complete the voice recognition and multi-attribute classification tasks;

a4: when the baseline model is trained, a multi-attention module is added, and only the training is carried out on the second-stage training data set to complete the voice recognition task, wherein the number of hidden layers used in the voice coding model is 6;

a5: a multi-task learning mechanism and a multi-attention module are added during the training of the baseline model, and training is only carried out on a second-stage training data set to complete the voice recognition and multi-attribute classification tasks, wherein the number of hidden layers used in the voice coding model is 6;

a6: pre-training and multi-task learning mechanisms are added during base line model training, firstly, pre-training is carried out on a voice coding model part of a base line model, and then second-stage training is carried out on the base line model so as to complete voice recognition and multi-attribute classification tasks;

a7: a pre-training learning mechanism and a multi-attention module are added during the training of the baseline model, firstly, the voice coding model part of the baseline model is pre-trained, and then the baseline model is trained in the second stage to complete the voice recognition task, wherein the number of hidden layers used in the voice coding model is 6;

a8: the method comprises the steps that a pre-training, multi-task learning mechanism and a multi-attention module are added during base line model training, firstly, a voice coding model part of a base line model is pre-trained, then, a second stage of training is carried out on the base line model to complete voice recognition and multi-attribute classification tasks, wherein the number of hidden layers used in the voice coding model is 6;

the auxiliary task result in the experimental result is measured by accuracy, namely the proportion of the correctly classified samples to the total samples, and the correctness of the speech recognition is the character error rate based on Chinese characters and English lettersCER(charcter error rate) as follows:

wherein the content of the first and second substances,Nfor the length of the real-text label,I、D、Srepresenting the insertion, deletion and replacement operands required to convert the predictive text label to a real label, respectively.

In summary, the technical scheme of the invention verifies that only the performance of the acoustic model is considered, the language model processing and optimization are not involved, and the final result pair is shown in table 1.

TABLE 1

According to experimental results, compared with a baseline model, the pre-training learning mechanism, the multi-task learning mechanism and the multi-attention module provided by the scheme can improve the performance of the voice recognition model on the data set of the embodiment; compared with a method without introducing a pre-training learning mechanism, the introduction of the pre-training learning mechanism can achieve greater performance improvement on the data set of the embodiment, which shows that on the air traffic control data set, the pre-training learning can learn a better speech feature representation with stronger robustness, and finally supports the air traffic control speech recognition research; furthermore, a multi-task learning and multi-attention module is introduced, so that the voice recognition performance can be improved to a certain degree; and by introducing a pre-training mechanism, a multi-task learning mechanism and a multi-attention module, the baseline model obtains the optimal speech recognition performance on the data set of the embodiment.

In conclusion, the method adopts a pre-training and multi-task learning mechanism and a multi-attention module, plays a great role in promoting the performance of the air traffic control speech recognition model, and can improve the convergence efficiency of the model.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A multi-task empty pipe voice recognition method based on pre-training is characterized by comprising the following steps:

s2, constructing a pre-training-based air traffic control speech coding model;

s3, inputting the pre-training data set of the first stage into the empty pipe speech coding model for pre-training;

s5, establishing a loss function of the multitask empty pipe voice recognition model;

and S7, inputting the real-time ground-air communication voice data segmented according to the sentences into the trained multi-task air traffic control voice recognition model to obtain a text recognition result and an auxiliary task recognition result.

2. The pre-training-based multitask empty pipe speech recognition method according to claim 1, wherein the empty pipe speech data in step S1 is a Chinese-English speech signal without text labels, comprising the following steps:

s11, after voice emphasis and framing pretreatment are carried out on the air traffic control voice data, the pretreated air traffic control voice data is segmented according to sentences;

3. The method for multi-task air traffic control speech recognition based on pre-training according to claim 1, wherein the step S2 of constructing the pre-training-based air traffic control speech coding model comprises:

wherein the content of the first and second substances,cis the output of the convolution module and is,hfor each layer of neural network output hidden layer features,h _i is as followsiThe hidden layer characteristics output by the layer neural network,Nis the total number of layers of the deep neural network,Tin order to input the length of the speech,fin order to be a dimension of the hidden layer feature,Rthe number is a real number set,R ¹ × ^T × ^f the representation feature dimension is (1,T, f)；

s23, establishing an output moduleBlock, last in context extraction moduledAnd stacking the output of the layer hidden layer as the output of the empty pipe speech coding model, and recording as:

wherein the content of the first and second substances,Yfor the multi-layer feature output of the encoder,hfor the hidden layer feature output by each layer of neural network,dthe number of layers of the deep neural network,Nis the total number of layers of the deep neural network,Tin order to input the length of the speech,fin order to be a dimension of the hidden layer feature,Rthe number is a real number set,R ^d × ^T × ^f represents a characteristic dimension of (d, T, f)。

4. The method for multitask empty pipe speech recognition based on pre-training as claimed in claim 1, wherein the step S4 of constructing the multitask empty pipe speech recognition model comprises:

5. The pre-training-based multitask empty pipe speech recognition method according to claim 4, wherein the construction of the multiple attention module comprises the following steps:

constructing a level attention module, performing attention operation on a level dimension according to the multi-layer characteristic output of the encoder to obtain an attention matrix of the level dimension, multiplying the attention matrix by the multi-layer characteristic output of the encoder to obtain a result, and marking as:

wherein, the first and the second end of the pipe are connected with each other,LRfor the output of the hierarchical attention module,Yis the multi-layer feature output of the encoder,

a calculation formula for a hierarchical dimensional attention operation,dthe number of layers of the deep neural network,Tin order to input the length of the speech,fin order to be a dimension of the hidden layer feature,Rthe set of real numbers is a set of real numbers,R ^d × ^T × ^f represents a characteristic dimension of (d, T, f)，R ¹ × ^T × ^f The representation feature dimension is (1,T, f)；

a calculation formula for attention operations in the time series dimension,

a calculation formula for attention operation in the frequency dimension,Tin order to input the length of the speech,fin order to be a dimension of the hidden layer feature,Rthe number is a real number set,R ¹ × ^T × ^f the representation feature dimension is (1,T, f)。

6. the pre-training-based multitask empty pipe speech recognition method according to claim 4, wherein the construction of the auxiliary task classifier in S41 further comprises:

s411, inputting output results of the multiple attention modules into the voice recognition classifier;

7. The method according to claim 4, wherein the constructing of the speech recognition classifier in S42 further comprises:

and S421, adding the output result of the multi-attention module and the output of the multi-attention module of all the auxiliary task classifiers to obtain the voice characteristics containing various voice information, and recording the voice characteristics as:

wherein the content of the first and second substances,X _ASR in order to include the speech characteristics of a variety of speech information,LTFR _ASR for the multi-attention module output of the speech recognition classifier,LTFR _{aux_i} is as followsiThe multi-attention module output of each auxiliary task classifier,iis shown asiAn auxiliary task classifier for classifying the task of the task,nindicates the number of the auxiliary task classifiers,Tin order to input the length of the speech,fin order to be a dimension of the hidden layer feature,Rthe number is a real number set,R ¹ × ^T × ^f the representation feature dimension is (1,T, f)；

8. The method as claimed in claim 1, wherein in step S5, the loss function of the multitask empty pipe speech recognition model is constructed by weighted summation of the loss function of the speech recognition classifier and the loss function of the auxiliary task classifier, and the weight occupied by each task loss is adjusted as a parameter in the model training process, wherein the loss function of the speech recognition classifier is connected with the time sequence classification loss, the loss functions of the auxiliary task classifiers are all cross entropy loss, and the loss function of the multitask empty pipe speech recognition model is loss function of the multitask empty pipe speech recognition modelLAs is noted above, the number of the channels,

wherein the content of the first and second substances,

and

and

respectively representing the speech recognition loss and the secondiThe weight that each secondary task takes to lose,nindicating the number of auxiliary task classifiers.

9. The pre-training-based multitask empty pipe speech recognition method according to claim 1, wherein the training in step S6 is loop iteration training, and the single loop iteration training process is as follows:

s62, inputting the training sample into the multi-task air traffic control voice recognition model, and outputting a text recognition result and an auxiliary task classification result;

10. A pre-training based multitask, empty pipe speech recognition device comprising at least one processor and a memory communicatively coupled to said at least one processor; the memory is configured to store instructions that are pointable by the at least one processor, the instructions being executable by the at least one processor to ensure that the at least one processor is capable of performing the method of any of claims 1-9.