CN111785257A

CN111785257A - Empty pipe voice recognition method and device for small amount of labeled samples

Info

Publication number: CN111785257A
Application number: CN202010663698.0A
Authority: CN
Inventors: 林毅; 杨波; 张建伟
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2020-07-10
Filing date: 2020-07-10
Publication date: 2020-10-16
Anticipated expiration: 2040-07-10
Also published as: CN111785257B

Abstract

The invention relates to the field of civil aviation air traffic control and voice recognition, in particular to an air traffic control voice recognition method and device aiming at a small number of labeled samples. The method and the device train the recognition model backbone network by using the unlabeled data based on the neural network, can obtain the air traffic control voice recognition model with good recognition accuracy and high efficiency under the condition of labeling a small amount of samples, can accurately and quickly output corresponding air traffic control instruction text information based on the air traffic control voice recognition model after inputting the air traffic control voice, and improve the usability of the application of the air traffic control voice recognition technology and the expandability under a new scene.

Description

Empty pipe voice recognition method and device for small amount of labeled samples

Technical Field

The invention relates to the field of civil aviation air traffic control and voice recognition, in particular to an air traffic control voice recognition method and device aiming at a small number of labeled samples.

Background

Under the existing air traffic control system, an air traffic controller makes a control decision based on traffic situation information (including monitoring information, flight plans, meteorological information and the like) provided by an automation system. The controller communicates with the pilot by voice over a Very High Frequency (VHF) radio to direct flights within its sector of responsibility to fly safely and in order. The managed call is a centralized embodiment of a human-in-the-loop (HITL) in an air-traffic closed loop, and real-time monitoring is very necessary to improve the safety of management and flight operation. Therefore, as a connection bridge between a controller and an automatic system, the air traffic control voice recognition research has great practical significance.

As a typical supervised learning task, the speech recognition performance is highly dependent on the labeled training corpus in a specific application scenario. The number, feature diversity, and vocabulary coverage of training samples in the corpus greatly affect the final performance of the speech recognition model. In general speech recognition studies, the corpus of existing models is typically in the thousands of hours. However, due to the problem of civil aviation safety, enough control call voice data are not easy to collect to support air traffic control voice recognition research. In addition, the civil aviation control process requires that a controller must use special standardized terms, which include common air traffic control knowledge, such as pronunciation meaning (0- > hole, 7- > crutch), professional terms (modified sea pressure, RNAV), airlines (national aviation, Sichuan), and altitude layers, so that the marking of the air traffic control speech recognition sample depends on the knowledge in the air traffic control field, and has certain requirements on the marking personnel, and the marking is more time-consuming and labor-consuming than the general speech recognition marking. Therefore, the lack of corpus, especially the lack of labeled corpus samples, is a real problem facing the study of empty-pipe speech recognition.

The air traffic control comprises a plurality of stages of passing, tower, approach and district control and the like, and the control conversation voice of each stage uses similar professional terms (such as digital pronunciation) to converse, and has more uniqueness related to regions and control services. For example, the terminology associated with regulated areas: "PIKAS" can only appear in the policed call of the junior district policing center; terms related to policing traffic: "take off" can only occur during the tower control phase of the control call. Due to the particularity of the air traffic control voice sample, a large amount of manpower, material resources and financial resources are needed for the training sample enough for marking all the regions and services to support the voice recognition research, and the method is not practical in the practical application process. Therefore, the model migration technology based on the subdomain knowledge is a necessary research content for realizing high-performance air traffic control speech recognition.

In view of the above problems, it is very necessary to research the blank pipe speech recognition method and model under sample shortage and the migration problem thereof in different accents, areas and regulated services, and improve the usability and scalability of the blank pipe speech recognition technology in application and engineering.

Disclosure of Invention

The invention aims to: aiming at the problems of poor accuracy and low efficiency of empty pipe speech recognition under the condition of a small amount of labeled samples in the prior art, the method and the device for empty pipe speech recognition for the small amount of labeled samples are provided, wherein the labeled samples are samples containing instruction text information.

In order to achieve the purpose, the invention adopts the technical scheme that:

an empty pipe speech recognition method aiming at a small amount of labeled samples comprises the following steps:

a: collecting empty pipe voice and preprocessing the empty pipe voice to obtain a Mel frequency cepstrum coefficient characteristic diagram;

b: inputting the mel frequency cepstrum coefficient feature map into a pre-established empty pipe speech recognition model;

c: outputting instruction text information corresponding to the air traffic control voice;

the marked sample is a sample containing instruction text information, and the air traffic control speech recognition model comprises a backbone network and a full-connection prediction layer; the main network is obtained by adopting a noise reduction self-encoder model network to perform unsupervised pre-training; the fully-connected prediction layer is used for optimizing model parameters. Based on the data compression idea, the method can obtain the air traffic control speech recognition model with good recognition accuracy and high efficiency under the condition of marking a small number of samples, can accurately and quickly output corresponding instruction text information based on the air traffic control speech recognition model after the air traffic control speech is input, and improves the usability of the application of the air traffic control speech recognition technology and the expandability under a new scene.

As a preferred embodiment of the present invention, the training of the empty pipe speech recognition model includes the following steps:

s1: collecting unmarked corpus data and acquiring blank pipe voice in the unmarked corpus data, and preprocessing the blank pipe voice to obtain a Mel frequency cepstrum coefficient characteristic diagram; the unmarked corpus data comprises continuous original empty pipe voice;

s2: establishing a backbone network; the trunk network comprises a convolutional neural network module and a long-time and short-time memory module;

s3: inputting the Mel frequency cepstrum coefficient characteristic diagram into a noise reduction self-encoder network, and performing unsupervised pre-training on the main network by adopting the noise reduction self-encoder network to obtain a first air traffic control speech recognition model;

s4: establishing a full-connection prediction layer on the first empty pipe voice recognition model to construct a second empty pipe voice recognition model;

s5: and carrying out supervised training on the second empty pipe voice recognition model, and outputting the empty pipe voice recognition model. The method is based on the idea of data compression, adopts unlabelled voice data to pre-train the backbone network of the air traffic control voice recognition model, can realize the pre-training of the air traffic control voice recognition model only on the basis of corpus collection and simple preprocessing, and simultaneously can learn the voice characteristic representation of a specific data set under the condition of no labeled data based on the backbone network of the pre-trained air traffic control voice recognition model, thereby accelerating the research process of air traffic control voice recognition; and the parameters of the model are optimized by adopting a full-connection prediction layer to complete the establishment of the empty pipe speech recognition model. Therefore, a practical and usable speech recognition model training method is provided for air traffic control research on the basis of small-scale labeled data.

As a preferred aspect of the present invention, the preprocessing of the empty pipe speech includes the steps of:

step 1: dividing the empty pipe voice into a plurality of voice segments, wherein the voice segments comprise voice instructions of a single speaker;

step 2: screening the voice segments to remove mute and noise data;

and step 3: performing framing processing on the voice segments according to the frame length of T1 milliseconds and the frame length of T2 milliseconds to obtain T voice frames;

and 4, step 4: and converting the T voice frames into a 13-dimensional Mel frequency cepstrum coefficient feature map, calculating first and second derivatives of the Mel frequency cepstrum coefficient feature map, and obtaining a 39-dimensional Mel frequency cepstrum coefficient feature map, wherein the dimensionality of the Mel frequency cepstrum coefficient feature map is (T, 39).

As a preferable embodiment of the present invention, the step S1 includes:

s11: inputting unmarked corpus data, and dividing original blank pipe voice in the unmarked corpus data into a plurality of voice segments, wherein the voice segments comprise voice instructions of a single speaker;

s12: screening the voice segments to remove mute and noise data;

s13: performing framing processing on the voice segments according to the frame length of T1 milliseconds and the frame length of T2 milliseconds to obtain T voice frames;

s14: and converting the voice frame into a 13-dimensional Mel frequency cepstrum coefficient feature map, calculating first and second derivatives of the Mel frequency cepstrum coefficient feature map, and obtaining a 39-dimensional Mel frequency cepstrum coefficient feature map, wherein the dimensionality of the Mel frequency cepstrum coefficient feature map is (T, 39).

As a preferred embodiment of the present invention, in the noise reduction self-encoder network in step S3, the backbone network is used as an encoder, a mirror structure of the backbone network is used as a decoder, and a residual connection is established between hidden layers corresponding to the encoder and the decoder, so as to form a noise reduction self-encoder network.

As a preferable embodiment of the present invention, the step S3 includes the steps of:

s31: taking the Mel frequency cepstrum coefficient feature map as the input and output of the noise reduction self-encoder network to carry out model training on the backbone network;

s32: using a random mask prediction strategy on the mel-frequency cepstrum coefficient feature map;

s33: calculating a loss function of model training to obtain a first empty pipe voice recognition model;

the noise reduction self-encoder network takes the backbone network as an encoder, takes the mirror image structure of the backbone network as a decoder, and establishes residual connection between the encoder and a hidden layer corresponding to the decoder to form the noise reduction self-encoder network.

As a preferred embodiment of the present invention, the calculation formula of the loss function in step S33 is:

where N is the number of training samples processed in a batch, F_i ^*For the speech feature of the ith sample,

to calculate a mask in error, where T_iFor the number of speech frames, when the jth frame is selected for masking,

is 1, otherwise is 0, j ∈ [1, T_i]

As a preferable embodiment of the present invention, the step S32 includes:

s321: selecting one voice segment, selecting 15% voice frames from the voice segments for mask processing, and keeping the characteristic values of the rest parts unchanged;

s322: the voice frame processed by the selected mask is processed according to the following piecewise function;

wherein p is a random probability, and p ∈ [0, 1]，f_tRepresenting the original speech features at a time scale t,

representing the speech feature after masking at time scale t, ξ is random noise, satisfying ξ∈ (μ,) and the mean function is the averaging operation.

As a preferable embodiment of the present invention, the step S4 includes:

s41: importing a labeled corpus and a corresponding vocabulary list;

s42: establishing a full-connection prediction layer after the last long-time memory module layer of the first empty-pipe speech recognition model, and applying a TimeDistributed mechanism to the full-connection prediction layer, wherein the number of neurons of the full-connection prediction layer is consistent with the number of vocabularies in the vocabulary;

and the fully-connected prediction layer is used for mapping abstract features extracted from the Mel frequency cepstrum coefficient feature map by the backbone network to the vocabulary, predicting the probability of the abstract features belonging to each vocabulary, and normalizing the probability by taking a softmax function as an activation function in a corresponding time frame to obtain the most possible vocabulary selection.

As a preferred embodiment of the present invention, in step S41, a sample enhancement is further performed on the labeled corpus, and the sample enhancement process includes the following steps:

s411: migrating a public sample to the labeled corpus based on common empty management knowledge, and improving the diversity and vocabulary coverage of the labeled corpus;

s412: randomly selecting part of the marked training corpus in the marked corpus set to adjust the speed of speech;

s413: and randomly selecting part of the marked training corpus in the marked corpus set to perform random noise adding processing. After the labeled training corpora are processed, the method optimizes the voice recognition model on the data sets of different empty management scenes by using the sub-domain migration idea. The common empty management knowledge in the corpus set can have richer diversity and vocabulary coverage through sub-domain adaptation; the number of training samples and the diversity of features can be improved through sample enhancement; therefore, the voice recognition optimization efficiency and the final recognition performance are improved, and more efficient and reasonable empty-pipe voice recognition research training corpora are formed.

As a preferable embodiment of the present invention, the step S5 includes:

s51: respectively taking the Mel frequency cepstrum coefficient characteristic diagram and the call text as the input and the output of a second empty pipe voice recognition model to carry out model training;

s52: and taking a CTC function as a loss function of model training, performing parameter optimization on the backbone network and the fully-connected prediction layer, and outputting an empty pipe voice recognition model.

An empty pipe speech recognition device for a small number of labeled samples, comprising at least one processor, and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any of the methods described above.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

1. based on the data compression idea, the method can obtain the air traffic control speech recognition model with good recognition accuracy and high efficiency under the condition of marking a small number of samples, can accurately and quickly output corresponding instruction text information based on the air traffic control speech recognition model after the air traffic control speech is input, and improves the usability of the application of the air traffic control speech recognition technology and the expandability under a new scene.

2. The method is based on the idea of data compression, adopts unlabelled voice data to pre-train the backbone network of the air traffic control voice recognition model, can realize the pre-training of the air traffic control voice recognition model only on the basis of corpus collection and simple preprocessing, and simultaneously can learn the voice characteristic representation of a specific data set under the condition of no labeled data based on the backbone network of the pre-trained air traffic control voice recognition model, thereby accelerating the research process of air traffic control voice recognition; and the parameters of the model are optimized by adopting a full-connection prediction layer to complete the establishment of the empty pipe speech recognition model. Therefore, a practical and usable speech recognition model training method is provided for air traffic control research on the basis of small-scale labeled data.

3. After the labeled training corpora are processed, the method optimizes the voice recognition model on the data sets of different empty management scenes by using the sub-domain migration idea. The common empty management knowledge in the corpus set can have richer diversity and vocabulary coverage through sub-domain adaptation; the number of training samples and the diversity of features can be improved through sample enhancement; therefore, the voice recognition optimization efficiency and the final recognition performance are improved, and more efficient and reasonable empty-pipe voice recognition research training corpora are formed.

Drawings

Fig. 1 is a schematic flowchart of an empty pipe speech recognition method for a small number of labeled samples according to embodiment 1 of the present invention;

FIG. 2 is a block diagram of an LSTM of the empty-pipe speech recognition method for a small number of labeled samples according to embodiment 2 of the present invention;

fig. 3 is a structure diagram of a backbone network of an air traffic control speech recognition method for a small number of labeled samples according to embodiment 2 of the present invention;

fig. 4 is a configuration table of a backbone network of an air traffic control speech recognition method for a small number of labeled samples according to embodiment 2 of the present invention;

FIG. 5 is a pre-training flowchart of a method for empty pipe speech recognition for a small number of labeled samples according to embodiment 2 of the present invention;

FIG. 6 is a schematic diagram of the prediction probability of the empty-pipe speech recognition method for a small number of labeled samples according to embodiment 2 of the present invention;

fig. 7 is a flowchart of parameter migration training of an empty pipe speech recognition method for a small number of labeled samples according to embodiment 2 of the present invention;

FIG. 8 is a general training flowchart of a method for empty-pipe speech recognition for a small number of labeled samples according to embodiment 2 of the present invention;

FIG. 9 is a comparison graph of the effects of the empty pipe speech recognition method for a small number of labeled samples according to embodiment 3 of the present invention;

fig. 10 is a structural diagram of an empty pipe speech recognition model training device based on a deep neural network according to embodiment 4 of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings.

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Example 1

the marked sample is a sample containing instruction text information, and the air traffic control speech recognition model comprises a backbone network and a full-connection prediction layer; the main network is obtained by adopting a noise reduction self-encoder model network to perform unsupervised pre-training; the fully-connected prediction layer is used for optimizing model parameters.

The training of the empty pipe speech recognition model comprises the following steps:

s5: and carrying out supervised training on the second empty pipe voice recognition model, and outputting the empty pipe voice recognition model.

Wherein, the detailed flow of each step is as follows:

the preprocessing of the empty pipe voice comprises the following steps:

step 2: screening the voice segments to remove mute and noise data;

The step S2 includes:

constructing a main network of an air traffic control speech recognition model as an encoder of a noise reduction self-encoder network; the main body of the backbone network is at least one convolutional neural network module and at least one long-term memory module;

the convolutional neural network module is used for extracting abstract voice characteristics from the Mel frequency cepstrum coefficient characteristic diagram;

the long-time and short-time memory module is used for mining the time sequence correlation among the voice frame characteristics, outputting the probability that the extracted sequence characteristics are predicted to belong to the vocabulary aiming at each voice frame, and reconstructing the voice characteristics by the noise reduction self-encoder network to achieve the aim of model pre-training;

in the step S3, the denoising self-encoder network uses the backbone network as an encoder, uses the mirror image structure of the backbone network as a decoder, and establishes residual connection between the encoder and the hidden layer corresponding to the decoder to form a denoising self-encoder network

The step S3 includes the following steps:

s31: taking the Mel frequency cepstrum coefficient characteristic graph as the input and output of the noise reduction self-encoder network to carry out model training;

s32: a random mask prediction strategy is used on the Mel frequency cepstrum coefficient feature map for guiding a model to learn more robust high-order features;

s33: by the formula

Calculating Loss function Loss of model training, wherein N is the number of training samples processed in batch, F_i ^*Is the speech feature of the ith sample, m_iFor calculating the mask when the error occurs, the calculation formula is as follows:

wherein T is_iFor the number of speech frames, when the jth frame is selected for masking,

is 1, otherwise is 0, j ∈ [1, T_i]。

Wherein the step S32 includes:

s321: selecting one voice segment, selecting 15% of voice frames from the voice segments to carry out mask processing, and keeping the characteristic values of the rest parts unchanged;

presentation time stampFor the speech feature after t-time masking, ξ is random noise, and ξ∈ (μ,) is satisfied, and the mean function is the averaging operation.

The step S4 includes:

model optimization training is further included after the unsupervised pre-training and after the supervised training, iterative training is performed by adopting a back propagation algorithm through the model optimization training, and the method comprises the following steps:

(1): according to

Calculating a loss function of a model to be optimized based on the current training parameters, and acquiring an overall loss value of the model to be optimized; wherein W is the weight of the neural network layer, b is the paranoid parameter of the neural network layer, m is the number of training samples, C (-) is a loss function,

is a predicted value of a sample, y⁽ⁱ⁾Is the true value of the sample;

(2): according to the formula:

calculating the overall loss value, calculating the parameter update values of the weight W and the paradoxical parameter b, wherein α is the learning rate, W^*Update values for the weights, b^*Updating values for paranoia parameters

(3): and (4) according to the derivative chain rule, reversely propagating the loss error, the weight W and the paranoia parameter b of the neural network from the last hidden layer to a hidden layer, and realizing the parameter optimization of the neural network model.

The step S4 includes:

s41: importing a labeled corpus and a corresponding vocabulary list;

The step S5 includes:

s51: respectively taking the Mel frequency cepstrum coefficient feature map and the call text as input and output of a voice recognition model to carry out model training;

s52: performing feature fusion on the backbone network and the optimized model, and performing parameter optimization on a pre-trained backbone network and a prediction layer in a training process;

s53: taking a CTC function as a loss function of model training, and finally expressing the probability of the input text sequence pi as

Wherein F ═<f₁,…,f_T>In order to input the speech characteristics,

representing the probability that the T-th speech frame is the k-th vocabulary, T being the number of speech frames and A being the vocabulary.

Example 2

The difference between this embodiment and embodiment 1 is that this embodiment further includes optimization of the backbone network and the labeled corpus, and the specific steps of this embodiment are as follows.

Step 1: the method for collecting and preprocessing the specific unmarked corpus data comprises the following steps:

step 1-1: aiming at collected managed call voice data, firstly, utilizing voice activity detection technology (VAD) to divide continuous original empty managed voice into short sentences, wherein each short sentence only comprises single speaker voice, namely single managed instruction content;

step 1-2: simply screening the segmented voice to remove mute and noise data;

step 1-3: the segmented speech is framed in 20 ms and 8 ms frames and converted into a 13-dimensioner frequency cepstrum coefficient (MFCC) profile. The number of frames after the framing processing of the voice signal with the duration d is as follows: t ═ d × 1000-20 ÷ 8+ 1;

step 1-4: and calculating the first and second derivatives of the MFCC features to finally form a 39-dimensional two-dimensional MFCC feature map. The feature map dimension is (T, 39), where T is the number of speech frames.

Step 2: constructing an air traffic control voice recognition model backbone network; the method comprises the following steps:

step 2-1: a main network of an air traffic control speech recognition model with a Convolutional Neural Network (CNN) and a long-term memory module (LSTM) as main bodies is constructed to serve as an encoder of the DAE network. The structure of the backbone network comprises:

1) a CNN module: abstract speech features are extracted from MFCC feature map extraction by adopting two-dimensional convolution kernels, and abstract representations of the speech features can be learned at different space-time resolutions by using convolution kernel configurations with different scales. The CNN is calculated as:

the convolution kernel size is (m, n), (i, j) represents a frequency segment position corresponding to a certain frame in the MFCC characteristic diagram, X and W respectively represent input data and related trainable weight parameter values, and the convolution operation is performed.

2) LSTM module: and (3) mining the time sequence correlation among the speech frame characteristics by adopting a bidirectional LSTM neural network layer, and outputting the probability that the extracted sequence characteristics are predicted to belong to the vocabulary for each time step (speech frame). The structure of LSTM is shown in FIG. 2, and its calculation formula is:

I^t＝f(W_ixx^t+W_ihh^t-1+W_icC^t-1+b_i)

F^t＝f(W_fxx^t+W_fhh^t-1+W_fcC^t-1+b_f)

C^t＝F^t·C^t-1+I^t·g(W_cxx^t+W_chh^t-1+b_c)

O^t＝f(W_oxx^t+W_ohh^t-1+W_ocC^t-1+b_o)

h^t＝O^t·g(C^t)

wherein t represents the time step of prediction and calculation, I, F, C and O respectively represent the function response values of an input gate, a forgetting gate, a cell and an output gate of the LSTM unit, and the final response of the hidden unit is h^t. In the formula W_ixWeights representing the nodal connections between the function response values of the input gates and the current input, the remaining weight parameter W_**The same meaning holds true for b_*Representing the bias values of the inner parts. a and b represent an a, b vector inner product operation.

The structure of the backbone network is shown in fig. 3, and the specific configuration of each module is shown in fig. 4.

Step 2-2: taking a backbone network as an encoder structure of the DAE network;

step 2-3: the mirror image structure of the main network is used as a decoder of the DAE network, and the aim is to reconstruct the compressed characteristics to original characteristic data;

and 2-4, establishing residual connection between hidden layers corresponding to the encoder and the decoder of the DAE, and establishing a complete DAE network. Residual error connection provides information interaction of the speech feature diagram among different layers, can guide model training and improve the trainability of the model, and is structurally shown in FIG. 5.

And step 3: training a DAE network using unlabeled speech signal features, comprising the steps of:

step 3-1: performing model training by taking the MFCC characteristic diagram as the input and the output of the DAE network;

step 3-2: the random mask prediction strategy (input to the DAE network) is used on the input MFCC profile, and the random mask prediction strategy used by the present invention is described as follows:

step 3-2-1: selecting 15% of voice frames in a single voice file for mask processing, and keeping the original characteristic values unchanged;

step 3-2-2: the voice frame processed by the selected mask is processed according to the following piecewise function;

representing the speech feature after masking on time scale t, ξ being random noise, randomly sampled from a Gaussian distribution satisfying ξ∈ (μ,), the mean and variance of the Gaussian distribution being calculated from the original speech feature, the mean function being an averaging operation, 10% of the selected speech having a probability of being a null vector (second branch), 10% of the probability remaining unchanged (third branch), the remainder being in accordance with the formula

Calculate (first branch):

step 3-3: the Mean Square Error (MSE) is used as a Loss function Loss of model training, and the calculation formula is as follows:

where N is the number of training samples processed in a batch, F_i ^*Is the speech feature of the ith sample (with the dimension T × 39, m)_iFor calculating the mask when the error occurs, the calculation formula is as follows:

wherein T is_iFor the number of speech frames, when the j frame is selected to be maskedWhen the utility model is taken care of,

is 1, otherwise is 0, j ∈ [1, T_i]。

And 4, step 4: and updating the model parameters by using a back propagation algorithm, and iteratively training to reduce the loss function of the model training until the output error is stable. The description is as follows:

step 4-1: according to the formula

Calculating a loss function, namely mean square error, of the main network based on the current training parameters, wherein W is the weight of the neural network layer, b is the bias execution parameter of the neural network layer, m is the number of training samples, and C (·) is the loss function,

is a predicted value of a sample, y⁽ⁱ⁾The true value of the sample.

Step 4-2: the parameter update operation of the gradient support model, which calculates the weights and bias parameters from the overall loss values, is represented as follows:

step 4-3: updating the neural network model parameters according to the gradient obtained by calculating the current parameters:

wherein α is the learning rate, W^*Update values for the weights, b^*Updating a value for the paranoia parameter;

step 4-4: and (4) according to the derivation chain rule, reversely transmitting the loss error and the gradient of the neural network from the last hidden layer to a hidden layer, and realizing the parameter optimization of the neural network model.

And 5: aiming at a specific labeled corpus and a vocabulary table corresponding to the corpus, an FC prediction layer is designed, and the method specifically comprises the following steps:

step 5-1: designing an FC prediction layer after the last LSTM layer of the backbone network, mapping the abstract features extracted by the LSTM to a vocabulary table, wherein the number of neurons of the prediction layer is consistent with the number of the neurons in the vocabulary table of the specific language materials;

step 5-2: as shown in fig. 6, the time distributed mechanism is used to predict the probability of belonging to each vocabulary based on the abstract feature of each LSTM time step, and normalize the output probability with softmax as the activation function in each frame, and the formula can be expressed as:

wherein V is the number of vocabularies, V_iFor words in the vocabulary, T is the number of speech frames, f_iIs a speech frame feature.

Step 6: the method for realizing parameter sharing of the air traffic control speech recognition model based on the recognition model of the pre-trained backbone network or other data sets specifically comprises the following steps as shown in fig. 7:

step 6-1: loading parameters of a backbone network model from a pre-training model;

step 6-2: if there is an identification model optimized based on other empty pipe data sets, parameters of the backbone network are loaded from the identification model.

And 7: and carrying out knowledge migration and enhancement operation on the samples in the marked corpus set, so that the quantity and diversity of training samples are improved. In the training process, combining the specific marking corpora with the marking corpora of the baseline model to form a final empty-pipe speech recognition optimized marking corpora set; and aiming at the final training corpus, the following processing is carried out:

step 7-1: sub-domain adaptation, namely migrating common air traffic control knowledge (such as flight call numbers, numbers and the like), and if the vocabulary in the baseline model training sample exists in the specific labeled corpus, keeping the original vocabulary; if the position information does not exist (such as landmark points related to the region), the position information is set to "< UNK >". The processing can guide the model to establish the sequence classification relation between the voice characteristics and the vocabularies in the region, and the probability relation between the vocabularies which do not belong to the region and the voice characteristics is lost, so that more robust model parameters can be learned;

step 7-2: sample enhancement: aiming at enhancing the marked training samples in the region, the proposed sampling speech rate adjustment and random noise adding strategy is specifically described as follows:

step 7-2-1: and (3) speed adjustment strategy: the method randomly selects 20% of the marked training samples in the region to adjust the speech speed. Specifically, the speech rate of the speech sample is adjusted based on the sox tool, wherein the speech rate multiple of 10% of the speech sample adjustments is 0.95, and the other 10% is 1.02;

step 7-2-2: random noise adding strategy: the method randomly selects 20% of labeled training samples in the region to perform random noise adding processing. Specifically, a gaussian white noise signature (dimension identical to the MFCC signature graph) with a mean value of 0 and a variance of 1 is randomly generated and applied directly to the original MFCC signature graph using an "element-by-element addition operation";

after the labeled training corpus is processed, common empty management knowledge in corpus set can have richer diversity and vocabulary coverage through sub-domain adaptation; the number of training samples and the diversity of features can be improved through sample enhancement; form more high-efficient reasonable empty pipe speech recognition research training corpus. On the basis, pre-training and parameter migration are comprehensively considered, and the empty pipe speech recognition research training scheme of the invention is formed, as shown in fig. 8, wherein arrows with different colors respectively represent different knowledge migration types.

And 8: optimizing an empty pipe speech recognition model based on the labeled corpus comprises the following steps:

step 8-1: performing model training by taking the MFCC characteristic diagram and the call text as the input and the output of a speech recognition model respectively;

step 8-2: and performing feature fusion on the pre-trained backbone network and the optimized model, performing parameter optimization on the pre-trained backbone network and the prediction layer only in the training process, and not performing parameter updating operation on the baseline model. In the invention, average operation is used for the characteristics of a pre-training network and an optimized network, and the result is used as the characteristic calculation loss of an optimized backbone network;

step 8-3: with a CTC function (connectionist temporal classification) as a loss function of model training, for any speech input, the probability of an input text sequence pi is represented as:

wherein F ═<f₁,…,f_T>In order to input the speech characteristics,

Further: the final predicted text probability formula is:

and xi, wherein xi is a set of all possible number text sequences of the input speech features, and the final output text can be obtained by removing the weight and the spacers. For example, assuming "_" represents a spacer, the output sequences "a _ bb _ c" and "_ ab _ c _" both correspond to the final output string "abc".

And step 9: and updating the model parameters by using a back propagation algorithm, and iteratively training to reduce the loss function of the model training until the output error is stable.

Now, the technical effect results generated by the present embodiment are analyzed and explained:

example 3

This example is a specific implementation of the solution of the invention and comparison with the prior art.

The following are data conditions for verifying the feasibility and performance of the technical scheme adopted by the embodiment:

1. baseline model and corpus: the baseline model was trained using managed call voice data (303 hours of total annotated training data) collected from the prefecture control center, and the performance obtained on the 13.5 hour validation set was: a Character Error Rate (CER) of 2.7%, i.e. the recognition accuracy is 97.3%;

2. pre-training and migrating optimization models and corpora: verification data is control conversation voice data collected from a Chengdu double-flow international airport tower control center, 134 hours of unlabelled sample data, 23.7 hours of labeled sample data and 1.5 hours of labeled data finally used for testing are obtained after preprocessing, 907 words are counted, and 24 words are added.

The following are specific software and hardware parameters and implementation processes of the implementation scheme of the embodiment:

the technical scheme of the invention is realized by using Keras encapsulated CNN, LSTM and FC neural network layers and related loss functions and optimizer functions, and the main network structure is as described above. The training hyper-parameter configuration for pre-training and migration training is described as follows:

1. pre-training: the initial learning rate is 0.001, the learning rate decay rate is 0.9, and the number of samples in each batch during training is 96;

2. migration optimization: initial learning is 0.00005, learning rate decay rate is 0.99, and the number of samples per batch at training is 160.

The hardware environment adopted by the experiment is as follows: the CPU is 2 multiplied by Intel Core i7-6800K, the display card is 2 multiplied by NVIDIAGeForce RTX 2080Ti, the display card is 2 multiplied by 11GB, the memory is 64GB, and the operating system is Ubuntu Linux 16.04.

Under the training data and configuration conditions, 4 groups of experiments are performed, namely experiments without pre-training and migration optimization, experiments with pre-training and migration optimization, and experiments with pre-training and migration optimization. Sampling experiment results based on CER measurement of Chinese characters and English letters and calculating the measurement result through a calculation formula

Calculating, wherein N is the length of the real text label, and I, D and S respectively representThe insertion, deletion and replacement operands required to convert the predictive text label to a real label.

And (4) verification result: the technical scheme of the invention verifies that only the performance of the acoustic model is considered, the language model processing and optimization are not involved, and the final result pair is shown in figure 9. According to experimental results, the two purposes of the invention both play a great role in promoting the performance of the empty pipe speech recognition model under the condition of small-scale data labeling, and simultaneously can also improve the convergence efficiency of the model. The specific comparison results are as follows:

1. comparative experiments a and B: the pre-training method of the technical scheme of the invention can greatly improve the final recognition performance under the condition of no baseline model and small-scale labeled data, and the CER is reduced from 11.2% to 7.4%. Meanwhile, as the pre-training method learns the feature representation of the speech signal from the unlabeled data, the convergence rate of model optimization is also improved, i.e. fewer training events are needed to obtain higher recognition performance.

2. Comparative experiments a and C: the migration optimization method of the technical scheme of the invention can obtain excellent identification performance under the condition of a baseline model, and the CER is 5.4%. More importantly, as the baseline model is an optimized model, the convergence rate is further improved, and excellent identification performance can be obtained only by 5 training epochs. The analysis reason can be obtained, and through the common empty management knowledge in the labeled training sample of the migration baseline model, the quantity, diversity and coverage of the training corpora are greatly improved, so that the voice recognition efficiency is improved.

3. From experiment 4, it can be known that, by combining the advantages of the two methods of the technical scheme of the present invention, on the basis of only 23.7 hours of labeled data, a 3.3% character error rate can be obtained on the test data of a new data, and rapid, efficient and high-performance model migration is realized.

Namely, the scheme of the invention can greatly improve the accuracy of the empty pipe speech recognition training under the condition of a small amount of labeled samples, and simultaneously effectively improve the training efficiency.

Example 4

As shown in fig. 10, an empty pipe speech recognition apparatus for a small number of labeled samples includes at least one processor, and a memory communicatively connected to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of empty pipe speech recognition for a small number of labeled samples as described in the previous embodiments. The input and output interface can comprise a display, a keyboard, a mouse and a USB interface and is used for inputting and outputting data; the power supply is used for supplying electric energy to the electronic equipment.

Those skilled in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as a removable Memory device, a Read Only Memory (ROM), a magnetic disk, or an optical disk.

When the integrated unit of the present invention is implemented in the form of a software functional unit and sold or used as a separate product, it may also be stored in a computer-readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. An empty pipe speech recognition method aiming at a small amount of labeled samples is characterized by comprising the following steps:

the air traffic control speech recognition model comprises a backbone network and a full-connection prediction layer; the main network is obtained by adopting a noise reduction self-encoder model network to perform unsupervised pre-training; the fully-connected prediction layer is used for optimizing model parameters.

2. The method for empty pipe speech recognition on a small number of labeled samples according to claim 1, wherein the training of the empty pipe speech recognition model comprises the following steps:

3. The method for empty pipe speech recognition on a small number of labeled samples according to claim 2, wherein the preprocessing of the empty pipe speech comprises the following steps:

step 2: screening the voice segments to remove mute and noise data;

and step 3: pressing the voice segment by t₁Millisecond frame length sum t₂Performing framing processing on the millisecond frame shift to obtain T voice frames;

4. The empty pipe speech recognition method for a small number of labeled samples according to claim 3, wherein the step S3 comprises the following steps:

5. The method for empty pipe speech recognition on a small number of labeled samples according to claim 4, wherein the formula for calculating the loss function in step S33 is as follows:

is 1, otherwise is 0, j ∈ [1, T_i]。

6. The empty pipe speech recognition method for a small number of labeled samples according to claim 4, wherein the step S32 comprises:

7. The empty pipe speech recognition method for a small number of labeled samples according to claim 2, wherein the step S4 comprises:

s41: importing a labeled corpus and a corresponding vocabulary list;

8. The empty-pipe speech recognition method for a small number of labeled samples according to claim 7, wherein the labeled corpus is further sample-enhanced in step S41, and the sample enhancement process includes the following steps:

s411: migrating a public sample to the labeled corpus based on common empty management knowledge;

s413: and randomly selecting part of the marked training corpus in the marked corpus set to perform random noise adding processing.

9. The method for empty pipe speech recognition for a small number of labeled samples according to claim 8, wherein the step S5 includes:

s51: respectively taking the Mel frequency cepstrum coefficient feature map and the instruction text information as the input and the output of a second empty pipe voice recognition model to carry out model training;

10. An empty pipe speech recognition device for a small number of labeled samples, comprising at least one processor, and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 9.