CN110415683A

CN110415683A - A kind of air control voice instruction recognition method based on deep learning

Info

Publication number: CN110415683A
Application number: CN201910619285.XA
Authority: CN
Inventors: 王耀彬
Original assignee: Shanghai Matu Information Technology Co Ltd
Current assignee: Shanghai Matu Information Technology Co Ltd
Priority date: 2019-07-10
Filing date: 2019-07-10
Publication date: 2019-11-05

Abstract

The invention discloses a kind of air control voice instruction recognition method based on deep learning, comprising the following steps: obtain voice signal to be identified, and be converted into the PCM audio data of 16bit 16kHz；Establish depth network model；Speech recognition engine is obtained using training data instruction depth network model；Phonetic segmentation is carried out to the audio data；Effective audio fragment that phonetic segmentation is obtained inputs speech recognition engine, output character recognition result.Wherein, depth network model uses convolution module as feature extractor, and is handled by reshape layers and full articulamentum the characteristic of extraction, carries out Sequence Learning using gating cycle unit, classification learning and decision are carried out eventually by full articulamentum, obtains prediction result.The present invention is used using artificial intelligence deep learning engine as core, has extremely strong professional applicability and accent generalization ability, data volume degree of dependence is lower a little, and universal phonetic identifying system is significantly better than in the identification of blank pipe voice.

Description

A kind of air control voice instruction recognition method based on deep learning

The present invention relates to voice processing technology field more particularly to a kind of languages in the air control field based on deep learning Voice recognition method.

Background technique

With civil aviation fast growing, all increase a large amount of aircraft and flight every year.However air control personnel are long-term There are notch, conservative estimation also has as many as thousands.Even if blank pipe relevant unit implements serial of methods, such as 4+1 to this The schemes such as cultivation mechanism, but blank pipe personnel still remain the phenomenon that being largely lost.Simultaneously again because of newly insufficient into personnel's experience, training The problems such as shortage of instruction time and resource, lead to not play corresponding personnel's benefit.Blank pipe tradesman's anxiety results in sky The problem of pipe personnel work overloadingly, causing air traffic, there are potential safety problem and efficiencies.The sky of China Middle traffic control is still the high-intensitive mental labour based on controller's subjective decision, and aircraft shift flourishes with civil aviaton And increase significantly, and understaffed blank pipe can only rely on controller at present and carry out absorbed intensive work high-intensitive for a long time, Human error is unavoidable.According to statistics, human error cause aviation accident accounted for the 80% of aviation accident total amount, at For the major reason for influencing aviation safety.By taking 1011 Hongqiao Airport passenger plane collision events in 2016 as an example, just because of tower Platform controller forgets aircraft dynamic, just causes such serious accident (runway intrusion).Therefore, it is necessary to introduce speech recognition system System sends instruction and reply voice with record controller and pilot in real time, to reduce situations such as understanding ambiguity and forgeing.

2016, Jing Zhun observation and control technology Co., Ltd, Guilin City carried out voice using controller's sound bank trained in advance Identification, this method are limited to existing speech database, for that cannot exactly match the voice messaging recognition effect of rule not Good, recognition accuracy is not high.2018, the constructed sound based on continuous Hidden Markov CHMM of science and technology group, China Electronics 15 It learns model to be used to identify voice, the recognition accuracy of this model is not as good as neural network model；Civil Aviation University of China uses feature The DNN-HMM model of enhancing further reduced error rate, but DNN is easy over-fitting, and is easily trapped into local optimum, Recognition accuracy is still not as good as CNN-GRU neural network model.

Summary of the invention

The present invention is directed in view of the above shortcomings of the prior art, construct a set of voice dedicated for air control instruction to know Other system.This system is constructed using depth learning technology, based on artificial intelligent voice identification engine and external information amendment system System, can identify a large amount of specialized vocabulary, special pronunciation and area name in blank pipe voice in high accuracy, realize higher blank pipe Speech recognition accuracy.

In order to realize that said effect, technical solution provided by the invention are as follows:

The present invention the following steps are included:

S1: obtaining voice signal to be identified, including at least one of real-Time Speech Signals and history voice signal, and by its Be converted to the PCM audio data of 16bit 16kHz；

S2: carrying out phonetic segmentation to the audio data, effective audio fragment that obtains that treated；

S3: depth network model is established；

S4: speech recognition engine is obtained using training data training depth network model；

S5: effective audio fragment is inputted into speech recognition engine, and output character recognition result.

The phonetic segmentation comprises the following steps:

S2.1: input audio data carries out Fast Fourier Transform (FFT) to audio frame (using 1024 sampled points for a frame), obtains Spectrum sequence M (x) only retains the part of x=1 ~ 256；

S2.2: setting adjusts threshold value f=- 30dB, and the size for adjusting threshold value f can adjust according to the actual situation, if M(x) > f note Record is 1, is then recorded as 0 less than -30dB, forms new sequence M0(x)；

S2.3: setting voice threshold v=0.2, the size of voice threshold v can adjust according to the actual situation, sum to M0, if M0/ 256 > v then thinks that the frame is active frame, that is, there is voice；

S2.4: continuously active frame is more than 8 frames, then it is assumed that the audio of continuously active frame is effective audio fragment.

The depth network model, using the identical convolution module of one or more structures as feature extraction Device, each convolution module include two convolutional layers and a pond layer, using reshape layers with full articulamentum to the feature of extraction Data are handled, and carry out Sequence Learning using gating cycle unit (), carry out classification learning using at least two full articulamentums With decision, prediction result is obtained.Voice data can obtain prediction knot by convolution module, gating cycle unit and full articulamentum Fruit realizes a complete forward-propagating process.

The depth network model is also provided with dropout layers in module junction, and gating cycle unit uses GRU Neural network includes positive Sequence Learning module and reverse sequence study module.

The air control voice instruction recognition method based on deep learning, using training data it is trained the depth Network model is spent to speech recognition engine, comprising the following specific steps

S4.1: it obtains raw empty pipe and commands audio data；

S4.2: mark voice data: S4.1 audio data obtained is labeled to obtain training data using text, is obtained Training data include voice data and labeled data；

S4.3: training data is divided per the group in voice data and labeled data in pairs；

S4.4: by back-propagation algorithm, being trained the depth network model that S3 is established using adadelta optimizer, and Form adaptable speech recognition deep learning network.

Audio data is inputted into speech recognition engine, trained identification engine will be converted into audio data text knot Fruit output.

Text results will be used to export, save or other application uses.

The invention has the following advantages:

The convolution layer model that depth network model of the present invention uses has the advantages that local sensing, weight are shared, can Data characteristics is effectively extracted under relatively small number of parameter amount；Pond layer has the advantages that feature invariance and Feature Dimension Reduction, energy Enough further compressed datas and parameter amount, prevent model over-fitting.Data pass through four convolution modules, until number is extracted in lower and Shangdi According to feature, learns recognition data and information, meet the information extraction principle of the mankind.

Secondly, gating cycle unit (GRU) wherein included is a kind of special circulation mind of simulation human mind system It through network (RNN), is made of update door and resetting door two parts, controls the memory of previous moment status information respectively and forgets journey Degree, to realize the study of language sequence, i.e., context is interrelated.The GRU of speech recognition engine of the present invention is respectively set The Sequence Learning of positive and negative both direction, understanding of the further assurance model to context.Data are in the Sequence Learning by GRU Afterwards, classification learning and decision are carried out into two layers of full articulamentum, obtains more accurate prediction result.In addition, module junction Dropout layers are additionally provided with, can prevent model from over-fitting occurs.

Since blank pipe is professional, regional differences and personnel's complexity, there are a large amount of professional terms, uniqueness in blank pipe voice Area name, Chinese and English mixes and accent difference, this is a huge challenge for speech recognition system.This is Construction in a systematic way is based on the speech recognition engine of artificial intelligence technology, the identification for blank pipe voice.Draw compared to traditional voice identification It holds up, not only recognition accuracy has the promotion (promoted amplitude about 30% ~ 60%) of matter to the speech recognition engine based on artificial intelligence, but also Model structure is substantially simplified, trained high with service efficiency.

The present invention is used using artificial intelligence deep learning engine as core, and realization is complete, specialized, specifically for sky The speech recognition system of pipe voice particularity.Compared to the universal phonetic identifying system of current major Internet enterprises, this system In speech recognition engine be all trained using true blank pipe voice, have extremely strong professional applicability and accent extensive Ability, scene height is specific, and data volume degree of dependence is lower, and universal phonetic identification system is significantly better than in the identification of blank pipe voice System.

Detailed description of the invention

Fig. 1 is the flow diagram of voice instruction recognition method of the present invention；

Fig. 2 is the connected mode schematic diagram of depth network model of the present invention；

Fig. 3 is the flow diagram of speech recognition engine training process.

Specific embodiment

Make clear and complete explanation to the embodiment of the present invention below in conjunction with the attached drawing in the present embodiment.The embodiment described Only a part of the embodiments of the present invention, instead of all the embodiments, based on the embodiments of the present invention, this field is common Technical staff's every other embodiment obtained without creative efforts belongs to the model that the present invention protects It encloses.

As shown in Figure 1, a kind of air control voice instruction recognition method based on deep learning, specifically uses following step It is rapid:

S1: obtaining voice signal to be identified, is converted into the PCM audio data of 16bit 16kHz.

Voice signal to be identified will by real-time voice stream or by least one of history voice flow in the form of read It takes.History voice flow refers to that, by stored good audio file, the form for being converted into byte serial is read out, the lattice of byte stream Formula follows 16bit 16kHz PCM format.Real-time voice stream, which refers to, converts number for analog audio signal by equipment such as sound cards Word information, digital signal are similarly continuous byte serial 16bit 16kHz PCM format.

PCM, that is, pulse code modulation (Pulse Code Modulation), be converted to PCM there are two types of mode, it is a kind of It is that analog audio signal by sound card/audio collection card, is converted into digital byte string signal.Second, be by other audios Format is converted to PCM format.Here it is converted in Linux system using ffmpeg tool.Conversion method executes life It enables as follows:

ffmpeg -i “inputfile” -f wav -acodec pcm_s16le -ar 16000 "outputfile.wav"；

Other kinds of audio file can be converted to the PCM audio data of 16bit 16kHz by mentioned order.

S2: carrying out phonetic segmentation to the audio data, effective audio fragment that obtains that treated.

Concrete mode are as follows:

Input audio data carries out Fast Fourier Transform (FFT) to audio frame (using 1024 sampled points for a frame), obtains frequency spectrum Sequence M (x) only retains the part of x=1 ~ 256；Setting adjusts threshold value f=- 30dB, and the size for adjusting threshold value f can be according to reality Situation adjustment, if M(x) > f is recorded as 1, be recorded as 0 less than -30dB, form new sequence M0(x)；Voice threshold v is set The size of=0.2, voice threshold v can adjust according to the actual situation, sum to M0, think the frame for activity if M0/256 > v That is, there is voice in frame；；If continuously active frame is more than 8 frames, then it is assumed that the audio of continuously active frame is effective audio fragment, The segment will be passed in speech recognition engine.

Preferably, the range of threshold value f is generally -40dB and arrives -10dB, depends primarily on noise intensity.F should be greater than noise Mean intensity.Since control voice noise is smaller, -30dB usually can effectively distinguish silence clip and have the piece of sound Section.And then realize cutting.When f value is too small, all audios all can be considered as the audio for having voice；And f value is excessive When, all audios all can be considered as the audio of no voice, be unable to complete cutting.

For threshold value v, value range is generally between 0.1 ~ 0.9.Its effect is beginning and the knot for judging audio fragment Beam.When its value is too small, any small audio fluctuation can all be taken as audio to start, even and if audio active and Stopped, but be also difficult be considered as segment end.；When current value is excessive, even if there is very strong audio active, It will not be considered audio to start, and activity intensity slightly reduces, will be considered audio fragment terminates.Therefore, it should choose Value appropriate, so that lesser audio active will not trigger audio fragment and start, and biggish activity, will not easily it stop Only.

S3: depth network model is established.

The frame of depth network model is as shown in Figure 2.Voice data is divided into audio and mark text two parts, sound intermediate frequency Part is passed to model as input data in the form of sound spectrograph, and mark word segment is converted to corresponding digital sequence according to specific dictionary Column are used as desired output valve.Input data sequentially passes through the identical convolution module of one or more structures first (CNN).Preferably, the present embodiment uses the identical convolution mould of 4 structures, and each convolution module includes two convolutional layers and one A pond layer.

Preferably, the embodiment of convolution module is as follows:

layer_h1 = Conv2D(32, (3, 3), use_bias=True, activation='relu', padding=' Same', kernel_initializer=' he_normal') (input_data) # convolutional layer

layer_h2 = Conv2D(32, (3, 3), use_bias=True, activation='relu', padding=' Same', kernel_initializer=' he_normal') (layer_h1) # convolutional layer

layer_h3 = MaxPooling2D(pool_size=2, strides=None, padding="valid") (layer_h2) pond # layer

After the feature extraction by convolution module, extracted characteristic carries out whole by reshape layers and full articulamentum Shape and synthesis subsequently enter gating cycle unit (GRU) and carry out Sequence Learning.

Preferably, reshape layers of embodiment is as follows:

Reshape layers of (layer_h12) # of layer_h13=Reshape ((200,3200))

GRU is a kind of special Recognition with Recurrent Neural Network (RNN) for simulating human mind system, by update door and resetting door two parts Composition controls the memory of previous moment status information and forgets degree, respectively to realize the study of language sequence, i.e. context It is interrelated.The GRU of speech recognition engine of the present invention is respectively provided with the Sequence Learning of positive and negative both direction, further ensures Understanding of the model to context.

Preferably, the embodiment of two-way GRU gating cycle unit is as follows:

layer_h15 = Bidirectional(GRU(256, return_sequences=True, return_state= False), merge_mode='concat')(layer_h14)

Data carry out classification learning and decision after the Sequence Learning by GRU, into two layers of full articulamentum, obtain prediction knot Fruit.In addition, dropout layers (abandoning layer) are provided in the module junction of model framework, to prevent model over-fitting.

layer_h6 = Dropout(0.1)(layer_h6)

Preferably, dropout layers can be respectively provided in the junction of each module of depth network model.

The training of numerous parameters needs backpropagation in model, and essence is the minimum process of model loss function, this The speech recognition engine of invention uses advanced CTC loss function, preferably:

The loss function of CTC is expressed as follows:

Wherein p(z | given input x x) is represented, the probability of output sequence z, S is training set.Loss function can be explained are as follows: give The product of the probability of correct label is exported after random sample sheet.Wherein p(z | it can x) be converted into following literary style.

Wherein x is given input, i.e., by neural network by audio frequency characteristics sequence transformation at symbolic feature sequence.y It is given output, the i.e. corresponding correct letter symbol sequence of audio.| z ' | indicate the length of sequence z.

For an appeal formula, it be output time is t that we, which define forward variable α (t, u), and output sequence z The sum of the forward direction probability in path follows following recurrence relation:

Wherein, u indicates that character position, lu ' indicate the label of position u.

First define a reversed variable β (t, u), it is meant that, it is all can reach the T moment export be space or right Answer the sum of the probability of " residue " path π ' of label.Here residual paths refer to the road other than α (t, u) is described Diameter.

In above-mentioned formula,Indicate t+1 moment, the probability of output label li '.

Specific implementation is then realized using the ctc loss function of keras, the method is as follows:

from keras import backend as K

def ctc_lambda_func(self, args):

y_pred, labels, input_length, label_length = args

y_pred = y_pred[:, :, :]

return K.ctc_batch_cost(labels, y_pred, input_length, label_length)

The ctc_lambda_func function is that ctc loss function calculates function.Wherein y_pred is neural network meter Calculate as a result, labels indicate correct result, input_length indicate prediction batch length, label_length table Show the length of correct result batch.

S4: speech recognition engine is obtained using training data training depth network model.

Concrete mode are as follows:

It obtains raw empty pipe and commands audio data: obtaining the blank pipe voice for needing to identify.In addition to using true raw empty pipe to refer to Audio data is waved, artificial intelligence synthesis can also be used to intend true instruction voice to be made as raw empty pipe commander's audio data With；

Mark voice data: voice obtained is labeled to obtain training data using text, obtained training data packet Include voice data and labeled data；Can establishment officer carry out learning professional blank pipe knowledge, and blank pipe voice is labeled.It realizes Voice corresponds to the mark of the form of the text of corresponding language.For example (corresponding sound bite corresponds to corresponding marked content: " east Boat three nine-day periods after the winter solstice eight or eight rises to 900 holdings ")；

Be divided per the group training data in pairs, it is preferable that it is one group that the size of division group, which is 10000 pairs, can be needed according to training by Data are divided into the data group with public energy, according to every group of the determination of the function of training data and concrete condition of specific number Amount；

By back-propagation algorithm, depth network model is trained using adadelta optimizer, obtains trained language Sound identifies engine.

The embodiment of preferably adadelta optimizer is as follows:

ada_d = Adadelta(lr = 0.01, rho = 0.95, epsilon = 1e-06)

model.compile(loss={'ctc': lambda y_true, y_pred: y_pred}, optimizer = ada_d)

Text region result will be used to export, save or other application uses.

The method of the embodiment of the present invention has used convolution layer model to have by carrying out structure optimization to deep neural network Have the advantages that local sensing, weight are shared, relative to the DNN connected entirely, can effectively be extracted under relatively small number of parameter amount Data characteristics reduces the probability of over-fitting；The study of language sequence not only may be implemented in two-way GRU model, also further protects Hinder understanding of the model to context；The present invention additionally uses CTC method, eliminates the inconvenience to its voice and text, improves The efficiency of post-processing.

This programme preferred embodiment is shown above.It should be pointed out that it should be understood by those skilled in the art that we Case is not restricted to the described embodiments, any those skilled in the art within the technical scope of the present disclosure, according to Technical solution of the present invention and inventive concept are equal or approximation is replaced or changed, and also should be regarded as protection scope of the present invention.

Claims

1. a kind of air control voice instruction recognition method based on deep learning, which comprises the following steps:

S1: obtaining voice signal to be identified, is converted into the PCM audio data of 16bit 16kHz；

S3: depth network model is established；

2. the air control voice instruction recognition method according to claim 1 based on deep learning, which is characterized in that institute Stating voice signal described in step S1 includes real-Time Speech Signals and/or history voice signal.

3. the air control voice instruction recognition method according to claim 1 based on deep learning, which is characterized in that The phonetic segmentation that the step S2 takes includes the following steps:

4. the air control voice instruction recognition method according to claim 1 based on deep learning, which is characterized in that institute Depth network model described in step S3 is stated using the identical convolution module of one or more structures as feature extraction Device, each convolution module include two convolutional layers and a pond layer, using reshape layers with full articulamentum to the feature of extraction Data are handled, and carry out Sequence Learning using using the gating cycle unit of two-way GRU neural network, complete using at least two Articulamentum obtains output result.

5. the air control voice instruction recognition method according to claim 4 based on deep learning, which is characterized in that institute The module junction for stating depth network model is provided with dropout layers.

6. the air control voice instruction recognition method according to claim 1 based on deep learning, which is characterized in that institute Step S4 is stated to specifically include:

S4.1: it obtains blank pipe and commands audio data: obtaining the blank pipe voice for needing to identify；

S4.2: the instruction for obtaining training data, obtaining mark voice data: is labeled to S4.1 voice obtained using text Practicing data includes voice data and labeled data；

S4.3: training data is divided per the group in pairs；

S4.4: by back-propagation algorithm, the depth network model that S3 is established is trained using Adadelta optimizer, is obtained To trained speech recognition engine.