CN112420024A - Full-end-to-end Chinese and English mixed air traffic control voice recognition method and device - Google Patents

Full-end-to-end Chinese and English mixed air traffic control voice recognition method and device Download PDF

Info

Publication number
CN112420024A
CN112420024A CN202011147669.5A CN202011147669A CN112420024A CN 112420024 A CN112420024 A CN 112420024A CN 202011147669 A CN202011147669 A CN 202011147669A CN 112420024 A CN112420024 A CN 112420024A
Authority
CN
China
Prior art keywords
english
voice
chinese
traffic control
air traffic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011147669.5A
Other languages
Chinese (zh)
Other versions
CN112420024B (en
Inventor
林毅
杨波
张建伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202011147669.5A priority Critical patent/CN112420024B/en
Publication of CN112420024A publication Critical patent/CN112420024A/en
Application granted granted Critical
Publication of CN112420024B publication Critical patent/CN112420024B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention relates to the field of civil aviation air traffic control and voice recognition, in particular to a full-end-to-end Chinese and English hybrid air traffic control voice recognition method and device. The voice features are extracted in advance through the feature learning module, so that the Chinese and English mixed air traffic control voice recognition model can extract voice features with higher identifiability and better adapt to voice signals under different scenes; in the processing paradigm from an original voice signal to a readable instruction text, a unified framework is used for solving the problem of Chinese and English mixed voice recognition, so that the language attribute judgment link in the existing independent recognition system can be avoided, the system architecture of mixed voice recognition is simplified, voice characteristics can be more reasonably and effectively applied to the recognition of the model, the pronunciation and the word meaning are accurately judged, and the mixed voice recognition performance and the practicability are improved.

Description

Full-end-to-end Chinese and English mixed air traffic control voice recognition method and device
Technical Field
The invention relates to the field of civil aviation air traffic control and voice recognition, in particular to a full-end-to-end Chinese and English hybrid air traffic control voice recognition method and device.
Background
In the field of civil aviation air traffic control, a controller and a pilot communicate and coordinate in real time in a voice communication mode through a radio station to ensure the safety of local air traffic operation. In the current regulation system, the regulated call voice is transmitted through VHF (Very High Frequency), and the reliability thereof greatly affects the quality of the regulated call voice, and further affects the performance of voice recognition. In addition, due to limited communication resources, regulators typically talk to multiple regulators within their regulated sector on the same communication frequency. Therefore, the speaker, the communication device error, and the transmission environment in the same communication frequency (channel) are always in a changing state, which also causes the voice characteristics in the communication channel to be in a changing state. The characteristics of the air traffic control speech provide great challenges for a characteristic engineering method of speech recognition, and the air traffic control speech feature is required to be capable of extracting a robust characteristic support speech recognition model under different transmission conditions. In summary, solving the problem of speech feature representation in a complex air traffic control environment is a key step for improving speech recognition performance.
Meanwhile, according to the relevant regulations of the international civil aviation organization, English is a general language for air traffic control. Due to historical development reasons of civil aviation control in China, controllers generally use Chinese conversation when commanding domestic flights and English conversation when commanding international flights. In addition, in the civil aviation control process of China, a large number of vocabularies such as landmark points named in English, channel numbers and the like exist, and the vocabularies also need to be expressed by using English conversation in the control process. That is to say, in the civil aviation control process of China, the situation of Chinese and English mixed conversation can occur in the same control instruction. For example, "echo echo eight november charlie alpha two first-class sailing four-five-two". Because Chinese and English belong to different languages, pronunciation and vocabulary thereof present completely different characteristics. Therefore, researching Chinese and English acoustic modeling with the same scale is a key step for realizing Chinese and English hybrid recognition; solving the problem of uneven distribution of Chinese and English words is also a necessary means for improving the empty-pipe speech recognition performance; and Chinese and English mixed recognition is also a key technical problem which needs to be solved by the air traffic control speech recognition. The existing voice recognition method generally recognizes single-language voice, and the obtained voice signal has poor quality and dispersed characteristics, and simultaneously, the scales of pronunciation and word meaning are difficult to accurately judge in Chinese-English mixed recognition.
In view of the above problems, there is an urgent need to research a chinese-english hybrid speech recognition method, a model structure and training problems thereof in an air traffic control scenario, solve the problems of poor speech signal quality, feature dispersion and pronunciation and word meaning scale in chinese-english hybrid recognition in the prior art, and improve usability and expandability of the air traffic control speech recognition technology in application and engineering.
Disclosure of Invention
The invention aims to solve the problems of poor voice signal quality, dispersed characteristics and difficulty in accurately judging the scales of pronunciation and word meaning in Chinese-English mixed recognition in the prior art, and provides a full-end-to-end Chinese-English mixed empty-pipe voice recognition method and device.
In order to achieve the above purpose, the invention provides the following technical scheme:
a full end-to-end Chinese and English mixed air traffic control voice recognition method is characterized by comprising the following steps:
a: collecting empty pipe voice and preprocessing the empty pipe voice; the empty pipe voice is audio data mixed by Chinese and English;
b: inputting the empty pipe voice into a pre-established Chinese and English mixed empty pipe voice recognition model;
c: outputting instruction information corresponding to the air traffic control voice;
the Chinese and English mixed air traffic control voice recognition model comprises a feature learning module and a voice recognition module; the feature learning module is used for extracting the voice features of the air traffic control voice in advance, and the voice recognition module is used for converting the extracted voice features into computer-readable instruction text information. The voice features are extracted in advance through the feature learning module, so that the Chinese and English mixed air traffic control voice recognition model can extract voice features with higher identifiability and better adapt to voice signals under different scenes; in the processing paradigm from an original voice signal to a readable instruction text, a unified framework is used for solving the problem of Chinese and English mixed voice recognition, so that the language attribute judgment link in the existing independent recognition system can be avoided, the system architecture of mixed voice recognition is simplified, voice characteristics can be more reasonably and effectively applied to the recognition of the model, the pronunciation and the word meaning are accurately judged, and the mixed voice recognition performance and the practicability are improved.
As a preferable scheme of the invention, the construction of the Chinese-English hybrid empty pipe speech recognition model comprises the following steps:
s1: inputting a voice training sample, preprocessing the voice training sample, and acquiring an unmarked original voice signal and a single voice signal after segmentation and marking;
s2: constructing a feature learning module based on a convolutional neural network, a cyclic neural network and a full connection layer, using the unmarked original voice signal, training the feature learning module in a self-supervision learning mode until the model error is stable, and extracting voice features from the unmarked original voice signal;
s3: constructing a voice recognition module based on a cyclic neural network and a full connection layer, using the voice features, and cascading with the feature learning module to obtain a Chinese and English hybrid empty pipe voice recognition model;
s4: and training the Chinese and English mixed air traffic control speech recognition model by using the single voice signal after the segmentation and the labeling and the corresponding instruction text data, reducing the model error, and outputting the Chinese and English mixed air traffic control speech recognition model.
As a preferred solution of the present invention, the feature learning module includes a hidden spatial feature encoder and a context feature decoder, and is configured to learn robust speech features from unlabeled original speech in an autonomous manner;
the encoder is used for obtaining speech features at a speech frame level, and the decoder is used for obtaining context sequence speech features of the speech signals according to the context correlation of the speech signals. The two speech features can be used for subsequent speech processing tasks, including but not limited to speech recognition, voiceprint recognition, language classification, etc. According to the method, a deep learning model is designed in an automatic supervision learning mode, feature representation of the speech signal under the complex scene is learned from an unlabeled sample so as to support speech recognition research, robustness of the model facing feature representation under different scenes is improved, and recognition performance of the Chinese and English hybrid air traffic control speech recognition model is greatly improved under the condition that labeling cost is not increased.
As a preferred scheme of the present invention, the trunk networks of the implicit spatial feature encoder and the context feature decoder include a convolutional neural network unit, a long-time and short-time memory unit, and a fully-connected prediction unit;
the convolutional neural network unit is used for acquiring voice features from the original voice signal, learning audio features with discriminativity, discarding interfering voice features and performing data compression;
the long-time and short-time memory unit is used for acquiring time sequence characteristics from the original voice signal and establishing a mapping relation among the voice signal, the voice characteristics and the instruction text;
and the full-connection prediction unit is used for predicting the voice characteristics of the subsequent voice signals according to the time sequence characteristics and finishing the self-supervision training of the characteristic learning module.
As a preferred embodiment of the present invention, the forward inference rule of the convolutional neural network unit is:
Figure BDA0002740224270000041
wherein, X and S are input and output characteristic graphs respectively, W is a trainable weight parameter, W is a convolution operation, (h, k) is a convolution kernel size, and (i, j) is a characteristic position index in the characteristic graph.
As a preferred embodiment of the present invention, the calculation formula of the long-time and short-time memory unit is:
it=sigmoid(Wxixt,Whiht-1)
ft=sigmoid(Wxfxt,Whfht-1)
ot=sigmoid(Wxoxt,Woiht-1)
ct=ft⊙ct-1+it⊙tanh(Wxcxt+Whcht-1)
ht=ot⊙tanh(ct)
the superscript t is a prediction calculation time step, i, f, o and c respectively represent function response values of an input gate, a forgetting gate, an output gate and a cell of the long-time and short-time memory unit, and htFor the final hidden-unit response, WixA weight indicating the node connection between the current input gate and the input, and the remaining weight parameter w. have similar meanings, as an element-by-element multiplication operation symbol.
As a preferred embodiment of the present invention, the forward inference rule of the fully-connected prediction unit is: z ═ σ (WX + b), where σ is the nonlinear activation function; x is an input, z is an output, W is a trainable weight parameter, and b is an offset.
As a preferred scheme of the invention, the training process of each model is as follows:
a: selecting a model for training, and selecting corresponding input data, output data and a corresponding training loss function according to the model;
b: setting a training hyper-parameter and selecting a training strategy; the super parameters comprise a learning rate, a batch size and a maximum iteration number, and the strategies comprise a learning rate attenuation strategy, a verification strategy and a training termination strategy;
c: the model parameters are updated by the gradient descent method and the back propagation algorithm until the error is stable. .
As a preferred embodiment of the present invention, in step S2, the input and output of the feature learning module during training are both the unlabeled original speech signal, and a contrast predictive coding is used as a loss function;
in the step S3, the speech recognition module takes the speech features extracted in the step S2 as input and corresponding instruction text data as output during training, and adopts associative semantic time series classification as a loss function;
in the step S4, during training of the chinese-english hybrid air traffic control speech recognition model, the single speech signal after the segmentation labeling is used as input, the corresponding instruction text data is used as output, and the join-meaning time-series classification is used as a loss function.
As a preferred embodiment of the present invention, the chinese-english hybrid air traffic control speech recognition model further includes a chinese-english instruction vocabulary library, and the training process of the chinese-english instruction vocabulary library includes:
aiming at Chinese voice, taking Chinese characters as a labeling vocabulary unit;
aiming at English voice, a BPE algorithm is adopted for learning and generating a labeled vocabulary unit;
and combining the obtained English vocabulary library and the Chinese character library to obtain a final Chinese and English instruction vocabulary library. The invention designs the vocabulary library which takes uniform pronunciation as a target as the acoustic modeling unit for the air traffic control mixed speech recognition, can unify the pronunciation scale of Chinese and English languages and enhance the balance degree between vocabularies, improves the efficiency of model training and application, can carry out transfer training and enhanced learning when facing to a new application scene, so as to save the consumption of computing resources and accelerate the research process of the air traffic control speech recognition.
As a preferred embodiment of the present invention, the learning optimization of the english vocabulary library includes the following steps:
1) inputting a labeling sample, and acquiring an English sub-word unit;
2) acquiring the number of English sub-word units, the pronunciation matching degree of the English sub-word units and the vocabulary balance degree of the labeled sample; the vocabulary balance degree is the appearance frequency of each English sub-word unit obtained after the English word label in the label sample is converted into an English sub-word unit sequence;
3) and optimizing the optimized objective function by adopting a BPE algorithm, and improving the objective optimized function value.
The invention learns the substrings (English sub-word units) of the frequent character strings from the input corpus by adopting a BPE algorithm, and collects the English sub-word units with the frequent character strings to form an English dictionary table. The dictionary table comprises character-level vocabularies and word-level vocabularies, so that the advantages of the character-level vocabularies and the word-level vocabularies on the acoustic modeling of the voice recognition can be integrated, and the accuracy of the acoustic modeling and the final performance of the voice recognition are improved.
As a preferred embodiment of the present invention, the optimization objective function is:
Figure BDA0002740224270000061
wherein, aiThe weight parameters for the different optimization objectives,
Figure BDA0002740224270000071
represents a different optimization objective function that is different,
Figure BDA0002740224270000072
number of units of subword representing output, VbpeAnd V represents the number of the output sub-word banks and the number of the input English words respectively;
Figure BDA0002740224270000073
representing the degree of balance of frequency of occurrence of different words, etaiAnd ηmaxRespectively representing the occurrence frequency and the maximum frequency of the vocabularies;
Figure BDA0002740224270000074
measuring the number V of monosyllabic vocals in an output thesaurussAnd is used for expressing the matching degree of the pronunciation and the Chinese pronunciation. The invention trains the English vocabulary library by optimizing the target function so as to make the proposed target optimization function value as large as possible, thereby improving the occurrence frequency of vocabularies, effectively relieving the imbalance among different vocabularies and finally realizing the high-performance Chinese and English mixed empty-pipe speech recognition.
A full end-to-end Chinese and English mixed air management voice recognition device comprises at least one processor and a memory which is in communication connection with the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any of the methods described above.
Compared with the prior art, the invention has the beneficial effects that:
1. the voice features are extracted in advance through the feature learning module, so that the Chinese and English mixed air traffic control voice recognition model can extract voice features with higher identifiability and better adapt to voice signals under different scenes; in the processing paradigm from an original voice signal to a readable instruction text, a unified framework is used for solving the problem of Chinese and English mixed voice recognition, so that the language attribute judgment link in the existing independent recognition system can be avoided, the system architecture of mixed voice recognition is simplified, voice characteristics can be more reasonably and effectively applied to the recognition of the model, the pronunciation and the word meaning are accurately judged, and the mixed voice recognition performance and the practicability are improved.
2. According to the method, a deep learning model is designed in an automatic supervision learning mode, feature representation of the speech signal under the complex scene is learned from an unlabeled sample so as to support speech recognition research, robustness of the model facing feature representation under different scenes is improved, and recognition performance of the Chinese and English hybrid air traffic control speech recognition model is greatly improved under the condition that labeling cost is not increased.
3. The invention designs the vocabulary library which takes uniform pronunciation as a target to be used as an acoustic modeling unit for the air traffic control mixed speech recognition, can unify pronunciation scales of Chinese and English languages, relieves the imbalance among the air traffic control vocabularies, improves the efficiency of model training and application, can carry out transfer training and reinforcement learning when facing a new application scene, so as to save the consumption of computing resources and accelerate the air traffic control speech recognition research process.
4. The invention learns the substrings (English sub-word units) of the frequent character strings from the input corpus by adopting a BPE algorithm, and collects the English sub-word units with the frequent character strings to form an English dictionary table. The dictionary table comprises character-level vocabularies and word-level vocabularies, so that the advantages of the character-level vocabularies and the word-level vocabularies on the acoustic modeling of the voice recognition can be integrated, and the accuracy of the acoustic modeling and the final performance of the voice recognition are improved.
5. The invention trains the English vocabulary library by optimizing the target function so as to make the proposed target optimization function value as large as possible, thereby improving the occurrence frequency of vocabularies, effectively relieving the imbalance among different vocabularies and finally realizing the high-performance Chinese and English mixed empty-pipe speech recognition.
Drawings
Fig. 1 is a schematic flow chart of a full-end-to-end chinese-english hybrid air traffic control speech recognition method according to embodiment 1 of the present invention;
fig. 2 is a processing flow chart of a full end-to-end chinese-english hybrid air traffic control speech recognition method according to embodiment 2 of the present invention;
fig. 3 is an example of an acoustic modeling unit of a full-end-to-end chinese-english hybrid air traffic control speech recognition method according to embodiment 2 of the present invention;
fig. 4 is a schematic diagram of a model structure of a full-end-to-end chinese-english hybrid air traffic control speech recognition method according to embodiment 2 of the present invention;
fig. 5 is a schematic diagram of an experimental result of a full-end-to-end chinese-english hybrid air traffic control speech recognition method according to embodiment 4 of the present invention;
fig. 6 is a schematic diagram of an experimental result of a full-end-to-end chinese-english hybrid air traffic control speech recognition method according to embodiment 5 of the present invention;
fig. 7 is a schematic structural diagram of a full-end-to-end chinese-english hybrid air traffic control speech recognition device according to embodiment 6 of the present invention.
Detailed Description
The present invention will be described in further detail with reference to test examples and specific embodiments. It should be understood that the scope of the above-described subject matter is not limited to the following examples, and any techniques implemented based on the disclosure of the present invention are within the scope of the present invention.
Example 1
As shown in fig. 1, a full-end-to-end chinese-english hybrid air traffic control speech recognition method includes the following steps:
a: collecting empty pipe voice and preprocessing the empty pipe voice; the empty pipe voice is audio data mixed by Chinese and English;
b: inputting the empty pipe voice into a pre-established Chinese and English mixed empty pipe voice recognition model;
c: outputting instruction information corresponding to the air traffic control voice;
the Chinese and English mixed air traffic control voice recognition model comprises a feature learning module and a voice recognition module; the feature learning module is used for extracting voice features of the air traffic control voice, and the voice recognition module is used for optimizing model parameters and outputting corresponding instruction information.
The training of the Chinese and English hybrid air traffic control speech recognition model comprises the following steps:
s1: inputting a voice training sample, preprocessing the voice training sample, and acquiring an unmarked original voice signal and a single voice signal after segmentation and marking;
s2: constructing a feature learning module based on a convolutional neural network, a cyclic neural network and a full connection layer, using the unmarked original voice signal, training the feature learning module in a self-supervision learning mode until the model error is stable, and extracting voice features from the unmarked original voice signal;
the feature learning module comprises a hidden spatial feature encoder and a context feature decoder and is used for learning the features of the speech from the unlabeled original speech in a self-supervision mode; the encoder of the hidden space characteristic is used for obtaining the speech characteristics of the speech frame level, and the decoder of the context characteristic is used for obtaining the context sequence speech characteristics of the speech signal according to the context correlation of the speech signal.
The convolutional neural network unit is used for acquiring voice features from the original voice signal, learning audio features with discriminativity, discarding interfering voice features and performing data compression; the forward inference rule of the convolutional neural network unit is as follows:
Figure BDA0002740224270000101
wherein, X and S are input and output characteristic graphs respectively, W is a trainable weight parameter, W is a convolution operation, (h, k) is the convolution kernel size, and (i, j) is the position coordinate in the characteristic graph.
The long-time and short-time memory unit is used for acquiring time sequence characteristics from the original voice signal and establishing a mapping relation among the voice signal, the voice characteristics and the instruction text; the calculation formula of the long-time and short-time memory unit is as follows:
it=sigmoid(Wxixt,Whiht-1)
ft=sigmoid(Wxfxt,Whfht-1)
ot=sigmoid(Wxoxt,Woiht-1)
ct=ft⊙ct-1+it⊙tanh(Wxcxt+Whcht-1)
ht=ot⊙tanh(ct)
the superscript t is a prediction calculation time step, i, f, o and c respectively represent function response values of an input gate, a forgetting gate, an output gate and a cell of the long-time and short-time memory unit, and htFor the final hidden-unit response, WixA weight indicating the node connection between the current input gate and the input, and the remaining weight parameter w. have similar meanings, as an element-by-element multiplication operation symbol.
And the full-connection prediction unit is used for predicting the voice characteristics of the subsequent voice signals according to the time sequence characteristics so as to realize the self-supervision training of the characteristic learning module. The forward inference rule of the full-connection prediction unit is as follows: z ═ σ (WX + b), where σ is the nonlinear activation function; x is an input, z is an output, W is a trainable weight parameter, and b is an offset.
The encoder of the hidden space characteristic is used for obtaining the speech characteristics of the speech frame level, and the decoder of the context characteristic is used for obtaining the context sequence speech characteristics of the speech signal according to the context correlation of the speech signal.
S3: constructing a voice recognition module based on a cyclic neural network and a full connection layer, and cascading the voice recognition module with the feature learning module to obtain a Chinese and English hybrid empty pipe voice recognition model;
s4: and training the Chinese and English mixed empty pipe speech recognition model by using a plurality of single speech signals and corresponding instruction text data, reducing model errors, and outputting the Chinese and English mixed empty pipe speech recognition model.
The training process of the model of each module is as follows:
a: selecting a model for training, and selecting corresponding input data, output data and a corresponding training loss function according to the model;
b: setting a training hyper-parameter and selecting a training strategy; the super parameters comprise a learning rate, a batch size and a maximum iteration number, and the strategies comprise a learning rate attenuation strategy, a verification strategy and a training termination strategy;
c: the model parameters are updated by the gradient descent method and the back propagation algorithm until the error is stable.
Specifically, in the step S2, the input and output of the feature learning module during training are the unmarked original speech signals, and a contrast predictive coding is used as a loss function;
in the step S3, the speech recognition module takes the speech features extracted in the step S2 as input and corresponding instruction text data as output during training, and adopts associative semantic time series classification as a loss function;
in the step S4, during the training of the chinese-english hybrid air traffic control speech recognition model, the single speech signal after the segmentation labeling is used as input, the corresponding instruction text data is used as output, and the join-meaning time-series classification is used as a loss function
Example 2
As shown in fig. 2, this embodiment is a detailed training process of the chinese-english hybrid empty pipe speech recognition model described in embodiment 1, and includes the following specific steps:
step 1: preprocessing is carried out on a speech recognition training sample, and the method comprises the following processes:
step 1-1: firstly, a Voice Activity Detection (VAD) technology is used to divide continuous original conversational voice into single audio files, each audio only contains single speaker voice, i.e. single control instruction content, and silence and noise data are removed.
Step 1-2: according to the empty management voice content related to the scheme, the readable instruction text corresponding to the audio is labeled by Chinese characters and English words, and an unlabeled original voice signal and a single voice signal after segmentation labeling are output.
Step 2: and (3) constructing a feature learning module based on a deep neural network, learning the feature representation of the speech from the unlabeled original speech signal in a self-supervision learning mode, and training the feature representation learning model until the error of the model is stable. Comprises the following steps:
step 2-1: the model structure of the feature learning module comprises two parts, namely a hidden spatial feature encoder and a context feature decoder.
The said latent space characteristic encoder, extract the speech characteristic representation of the frame level from the primitive speech signal (one-dimensional sampling point) through the learning mechanism, include 1-10 CNN and 1-5 LSTM neural network layers;
the context feature decoder is used for mining context sequence voice feature representation of the voice signal from frame level features according to the context correlation of the voice signal, and comprises 1-10 CNNs and 1-3 FC neural network layers;
the network structure of the implicit spatial feature encoder and the context feature decoder includes a Convolutional Neural Network (CNN), a Long Short Term Memory (LSTM), and a full connected prediction (FC). The method comprises the following specific steps:
CNN layer: using a plurality of two dimensionsThe convolutional layers extract feature representations of the convolutional layers in an input original audio signal, and each CNN layer uses convolutional kernels with different scales to mine voice feature representations on different time resolutions, learns audio features with discriminativity, discards interference features and has the function of compressing data volume. The forward inference rules for the CNN module are as follows:
Figure BDA0002740224270000131
the convolution kernel size is (h, k), (i, j) represents a certain position in the characteristic diagram, X and S represent input and output characteristic diagrams respectively, W represents trainable weight parameter values, and is convolution operation.
LSTM layer: after the CNN layer, the scheme adopts a Bidirectional LSTM (BLSTM) neural network module to mine the time sequence correlation of voice signals and establish a distribution mode inside a voice frame sequence and between the voice frame sequence and a vocabulary label sequence. The calculation formula for LSTM is as follows:
it=sigmoid(Wxixt,Whiht-1)
ft=sigmoid(Wxfxt,Whfht-1)
ot=sigmoid(Wxoxt,Woiht-1)
ct=ft⊙ct-1+it⊙tanh(Wxcxt+Whcht-1)
ht=ot⊙tanh(ct)
in the formula, the superscript t represents the time step of prediction and calculation, i, f, o and c respectively represent the input gate, forgetting gate, output gate and output correspondence of the LSTM unit, and the hidden unit is htT represents a time step, wherein WixA weight indicating the node connection between the current input and the input gate, and the remaining weight parameter w, which has similar meaning, represent a element-by-element multiplication operation.
FC layer: after the aforementioned convolutional neural network unit, the present scheme designs the FC layer as a predictor, the purpose of which is to predict the speech features of the subsequent frame from the features based on the speech learning continuously input from the history to complete the self-supervised learning of the present scheme design. The forward inference rule for the FC layer is as follows: z ═ σ (WX + b), where σ denotes a nonlinear activation function; x and z represent input and output, respectively; w and b represent trainable weight parameters and offsets, respectively.
Step 2-2: the self-supervision learning idea is applied to learn the feature representation of the speech from the unlabeled original speech signal, and the self-supervision learning task of the scheme is defined as: the feature of predicting the future speech frame signal based on the input historical speech frame makes the training sample (input and output) data from the supervised training both be speech signals and can be directly obtained according to the inherent timing characteristics of the speech signals.
Step 2-3: the learning model is trained and optimized using sample data until the model converges. The specific description is as follows:
step 2-3-1: taking an original voice signal as a training sample of self-supervision learning, namely, the input and the output are both the original voice signal;
step 2-3-2: and selecting a proper model structure comprising the number of hidden layers and the number of neurons. In the scheme, Coherent Predictive Coding (CPC) is adopted as a loss function of the self-supervision learning, and whether the predicted voice characteristics are extracted from real voice signals or not is distinguished through comparison training;
step 2-3-3: selecting proper training hyper-parameters including learning rate and attenuation strategy thereof, batch size, maximum iteration number, verification strategy, training termination strategy and the like;
step 2-3-4: and updating network model parameters by using a gradient descent method and a back propagation algorithm until the error is stable.
And step 3: constructing an acoustic modeling unit which takes uniform pronunciation as a target and is suitable for Chinese and English mixed recognition; the specific description is as follows:
step 3-1: an acoustic modeling unit of Chinese corpora in Chinese-English mixed speech recognition is constructed by using Chinese characters;
step 3-2: the scheme proposes to decompose English words into English sub-word units, and ensures that the English sub-word units are single syllable pronunciations in the decomposition process, and a related example is shown in figure 3;
step 3-3: combining with a Chinese and English acoustic recognition modeling unit to form an acoustic modeling vocabulary library for Chinese and English hybrid recognition in the scheme;
and 4, step 4: constructing a voice recognition module based on a cyclic neural network and a full connection layer, and cascading a feature learning module with the voice recognition module to form a final Chinese and English hybrid empty pipe voice recognition model; the method comprises the following specific steps:
step 4-1: a voice recognition module based on a circulating neural network and a full connection layer is constructed, the acoustic modeling vocabulary library of Chinese and English hybrid recognition established in the prior art is used as a modeling unit, and a network structure comprises 1-10 LSTM layers and 1-3 FC neural network layers;
step 4-2: and cascading the feature expression learning model and the voice recognition network model to form a final Chinese and English hybrid empty pipe voice recognition model of the scheme, wherein the model structure is shown in FIG. 4.
And 5: training a Chinese and English mixed empty pipe speech recognition model by using the single speech signal after the segmentation and the labeling and the readable instruction text data until the model converges; the specific description is as follows:
step 5-1: taking the data pair (original voice signal, instruction text sequence) as a training sample of the model, namely inputting and outputting a single voice signal and an instruction text sequence which are respectively marked by segmentation;
step 5-2: and selecting a proper model structure comprising the number of hidden layers and the number of neurons. The scheme adopts Connectionsist Temporal Classification (CTC) as a learning loss function;
step 5-3: selecting proper training hyper-parameters including learning rate and attenuation strategy thereof, batch size, maximum iteration number, verification strategy, training termination strategy and the like;
step 5-4: according to the training method, the network model parameters are updated until the error is stable.
Step 6: and training and outputting a corresponding Chinese and English mixed empty pipe voice recognition model based on the training steps.
Example 3
The difference between this embodiment and embodiments 1 and 2 is that the chinese-english hybrid empty pipe speech recognition model further includes a chinese-english instruction vocabulary library.
From the perspective of pronunciation, Chinese characters use monosyllabic pronunciation, while English words (partial letters) generally belong to polysyllabic pronunciation. From the perspective of linguistics, Chinese characters are basic morphological units of Chinese, but Chinese phrases can express complete meanings; for English, letters are the basic morphological units, and English words are the smallest language units with complete meaning. Generally, a Chinese phrase typically contains 2-4 Chinese characters, while an English word may contain up to 10 letters. It can be seen that the Chinese and English languages are not in the same scale from the perspective of pronunciation or language; therefore, a new Chinese and English instruction vocabulary library training construction method is needed.
The training process of the Chinese and English instruction vocabulary library comprises the following steps:
aiming at the Chinese voice of the Chinese-English vocabulary library, taking Chinese characters as a labeled vocabulary unit;
and aiming at the English voice of the Chinese and English vocabulary library, a BPE (byte Pair encoding) algorithm is adopted to learn the English voice label vocabulary unit necessary in the method.
The BPE algorithm aims to learn the frequent substrings (english sub-word units) from the input corpus, and collect the frequent english sub-word units to form an english dictionary table. The dictionary table comprises character-level vocabularies and word-level vocabularies, so that the advantages of the character-level vocabularies and the word-level vocabularies on the acoustic modeling of the voice recognition can be integrated, and the accuracy of the acoustic modeling and the final performance of the voice recognition are improved.
The method takes the label of the English sample related to the empty-pipe speech recognition as input, and utilizes the BPE algorithm to learn the English sub-word units with higher frequency, so that the English words can be decomposed into a label vocabulary sequence (list) formed by the English sub-word units, and the label vocabulary sequence is also the final English label sample of the speech recognition method. For example: the labeled vocabulary sequence of the English word echo is [ 'e', 'cho' ]. The whole English vocabulary library trains the following variable parameters:
1) number of units of output subwords: if more sub-word units are output, the discrimination between the sub-word units is large, but the repetition degree is low; otherwise, the repetition degree between the sub-word units is high but the distinction degree is low.
2) Pronunciation matching degree: the scheme aims at solving the problem of Chinese and English mixed empty pipe speech recognition, and the most important is to establish a Chinese and English unified pronunciation scale. Because Chinese characters are single syllable pronunciation, the pronunciation of English sub-word unit should be kept as unified scale with Chinese characters as possible.
3) And (3) annotating the balance of vocabularies in the sample: and converting the English word labels in all the label samples into English sub-word unit sequences, and counting the occurrence frequency of each sub-word unit. Generally, the higher the occurrence frequency of the vocabulary, the more beneficial the training and learning of the model; in addition, the balancing procedure of each vocabulary frequency is also very important for model training.
In summary, the training of the english vocabulary library is a process of optimizing the number of vocabularies, the balance and the pronunciation matching degree, and the scheme proposes the following optimization objective function to obtain the optimal english vocabulary library.
Figure BDA0002740224270000171
Wherein, aiThe weight parameters for the different optimization objectives,
Figure BDA0002740224270000172
representing different optimization objective functions. Wherein the content of the first and second substances,
Figure BDA0002740224270000173
indicating the number of sub-word units to be output, the larger the value, the better VbpeAnd V represents the number of the output sub-word banks and the number of the input English words respectively;
Figure BDA0002740224270000174
representing a balance of frequency of occurrence of different wordsDegree (. eta.) ofiAnd ηmaxRespectively representing the occurrence frequency and the maximum frequency of vocabularies), the larger the value is, the better the value is;
Figure BDA0002740224270000175
measuring the number V of monosyllabic vocals in an output thesaurussTo show the matching degree of the pronunciation and the Chinese pronunciation, the larger the value is, the better the value is. In conclusion, the training of the English vocabulary library is to train the BPE algorithm so as to make the proposed objective optimization function value as large as possible. The English words are decomposed into the learned English sub-word sequences, so that the occurrence frequency of the words can be improved, the imbalance among different words can be effectively relieved, and finally, the high-performance Chinese and English mixed empty-pipe speech recognition is realized.
Finally, combining the English vocabulary library obtained by training with the Chinese character library to form the final acoustic modeling vocabulary library for Chinese and English mixed air management voice recognition in the scheme
Example 4
The present embodiment is an actual application example of embodiment 1, and the following are data conditions for verifying the feasibility and performance of the technical solution adopted in the present embodiment:
1. and (3) corpus: the corpus of the embodiment of the present invention includes a feature Representation Learning Corpus (RLC), which is an unlabeled corpus not including a text label, and a Speech Recognition Corpus (SRC), which is a labeled corpus including a text label. The corpus of this embodiment is all collected from real empty pipe system, specifically as follows:
1) RLC: the total time is 1280 hours, wherein the Chinese is 1070 hours, and the English is 210 hours;
2) SRC: the total is 58 hours, with 40 hours in Chinese and 18 hours in English.
The test corpus of this example totals 6 hours, with 4 hours in Chinese and 2 hours in English.
2. Baseline model: in the embodiment, the DeepSpeech2 model is used as a baseline model to verify the effectiveness of the scheme, and the model input is MFCC characteristics.
The baseline model and the technical scheme of the invention are realized by using a Pythrch framework. The hyper-parameter configuration for feature representation learning training and speech recognition training is described as follows:
1. pre-training: the initial learning rate is 10-6, the learning rate is attenuated by a cosine law, and the voice is spliced into 10 seconds and input into the model during training;
2. migration optimization: initial learning was 0.0001, learning rate decay rate was 0.99, and the number of samples per batch during training was 160.
The hardware environment adopted by the experiment is as follows: the CPU is 2 multiplied by Intel Core i7-6800K, the display card is 2 multiplied by NVIDIA GeForce RTX 2080Ti, the display card is 2 multiplied by 11GB, the memory is 64GB, and the operating system is Ubuntu Linux 16.04.
Under the training data and configuration conditions, 6 groups of experiments are carried out to respectively prove the advantages of the Chinese and English mixed recognition and feature learning model of the scheme, which are specifically as follows:
a1: using the baseline model for independent Chinese speech recognition;
a2: using the baseline model for independent English speech recognition;
a3: using the baseline model for Chinese-English mixed speech recognition;
b1: the model of the scheme is used for independent Chinese speech recognition;
b2: the model of the scheme is used for independent English voice recognition;
b3: the model of the scheme is used for Chinese and English mixed speech recognition;
the experimental result is measured by using a Character Error Rate (CER) based on chinese characters and english letters, and the calculation mode is as follows:
Figure BDA0002740224270000191
where N is the length of the real text label and I, D, S represent the number of insertion, deletion and replacement operands required to convert the predicted text label to the real label, respectively.
The technical scheme of the invention verifies that only the performance of the acoustic model is considered, the language model processing and optimization are not involved, and the final result is shown in figure 5. According to experimental results, the two purposes of the invention both play a great role in promoting the performance of the air-tube speech recognition model and simultaneously improve the convergence efficiency of the model. Specifically, the method comprises the following steps:
1. chinese and English mixed speech recognition: compared with the experimental result, the performance of the Chinese-English hybrid recognition system is superior to that of the independent recognition system in both the scheme and the baseline model, and the hybrid recognition scheme can obtain greater performance improvement on the baseline model.
2. A characteristic learning model: compared with an experimental result, the characteristic model provided by the scheme has the advantages that the performance is obviously improved compared with that of a manual characteristic engineering, and even under the scene that Chinese and English mixed recognition is not used, the recognition rate higher than that of a baseline model can be obtained.
Example 5
The present embodiment is another practical application example of embodiment 1, and the following are data conditions for verifying the feasibility and performance of the technical solution adopted in the present embodiment:
1. and (3) corpus: the corpus of the embodiment of the present invention includes a feature Representation Learning Corpus (RLC), which is an unlabeled corpus not including a text label, and a Speech Recognition Corpus (SRC), which is a labeled corpus including a text label. The corpus of this embodiment is an open-source corpus, which is specifically as follows:
1) RLC: contains English open-source corpora LibriSpeech (train-360 and train-500) and Chinese open-source corpora AISHLL-2. The total time is 1843 hours, wherein the Chinese is 991 hours, and the English is 852 hours;
2) SRC: contains English open-source corpus LibriSpeech (train-100) and Chinese open-source corpus AISHELL-1. The total is 278 hours, where 178 hours in Chinese and 100 hours in English.
The test corpus of this example totals 10.5 hours, with 5 hours in Chinese and 5.5 hours in English.
2. Baseline model: in the embodiment, the DeepSpeech2 model is used as a baseline model to verify the effectiveness of the scheme, and the model input is MFCC characteristics.
The baseline model and the technical scheme of the invention are realized by using a Pythrch framework. The hyper-parameter configuration for feature representation learning training and speech recognition training is described as follows:
1. pre-training: the initial learning rate is 10-6, the learning rate is attenuated by a cosine law, and the voice is spliced into 10 seconds and input into the model during training;
2. migration optimization: initial learning was 0.0001, learning rate decay rate was 0.99, and the number of samples per batch during training was 160.
The hardware environment adopted by the experiment is as follows: the CPU is 2 multiplied by Intel Core i7-6800K, the display card is 2 multiplied by NVIDIA GeForce RTX 2080Ti, the display card is 2 multiplied by 11GB, the memory is 64GB, and the operating system is Ubuntu Linux 16.04.
Under the training data and configuration conditions, 6 groups of experiments are carried out to respectively prove the advantages of the Chinese and English mixed recognition and feature learning model of the scheme, which are specifically as follows:
c1: using the baseline model for independent Chinese speech recognition;
c2: using the baseline model for independent English speech recognition;
c3: using the baseline model for Chinese-English mixed speech recognition;
d1: the model of the scheme is used for independent Chinese speech recognition;
d2: the model of the scheme is used for independent English voice recognition;
d3: the model of the scheme is used for Chinese and English mixed speech recognition;
the experimental results were measured using a Character Error Rate (CER) based on chinese characters and english letters. The technical scheme of the invention verifies that only the performance of the acoustic model is considered, the language model processing and optimization are not involved, and the final result is shown in figure 6. According to experimental results, the two purposes related to the method play a great promoting role in improving the performance of the voice recognition model, and meanwhile, the convergence efficiency of the model can be improved. Specifically, the method comprises the following steps:
1. chinese and English mixed speech recognition: compared with the experimental result, the performance of the Chinese-English hybrid recognition system provided by the scheme is superior to that of an independent recognition system, and the hybrid recognition scheme can obtain greater performance improvement on a baseline model. Compared with the embodiment 1, the embodiment obtains a larger performance improvement due to the lower relevancy of the corpus.
2. A characteristic learning model: compared with an experimental result, the characteristic model provided by the scheme has higher performance improvement than an artificial characteristic engineering, and can obtain a higher recognition rate than a baseline model even in a scene without using Chinese and English mixed recognition. In addition, the performance improvement brought by feature learning is smaller than that of embodiment 1, because the corpus of the embodiment is near-field reading speech, the background noise is stable, and the artificial feature engineering can also well extract speech features to support speech recognition research.
Example 6
As shown in fig. 7, a full end-to-end chinese-english hybrid air traffic control speech recognition apparatus includes at least one processor, and a memory communicatively connected to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a full end-to-end hybrid Chinese and English air traffic control speech recognition method according to the foregoing embodiments. The input and output interface can comprise a display, a keyboard, a mouse and a USB interface and is used for inputting and outputting data; the power supply is used for supplying electric energy to the electronic equipment.
Those skilled in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as a removable Memory device, a Read Only Memory (ROM), a magnetic disk, or an optical disk.
When the integrated unit of the present invention is implemented in the form of a software functional unit and sold or used as a separate product, it may also be stored in a computer-readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (10)

1. A full end-to-end Chinese and English mixed air traffic control voice recognition method is characterized by comprising the following steps:
a: collecting empty pipe voice and preprocessing the empty pipe voice; the empty pipe voice is audio data mixed by Chinese and English;
b: inputting the empty pipe voice into a pre-established Chinese and English mixed empty pipe voice recognition model;
c: outputting instruction information corresponding to the air traffic control voice;
the Chinese and English mixed air traffic control voice recognition model comprises a feature learning module and a voice recognition module; the feature learning module is used for extracting the voice features of the air traffic control voice in advance, and the voice recognition module is used for converting the extracted voice features into computer-readable instruction text information.
2. The full end-to-end Chinese-English hybrid air traffic control speech recognition method according to claim 1, wherein the building of the Chinese-English hybrid air traffic control speech recognition model comprises the following steps:
s1: inputting a voice training sample, preprocessing the voice training sample, and acquiring an unmarked original voice signal and a single voice signal after segmentation and marking;
s2: constructing a feature learning module based on a convolutional neural network, a cyclic neural network and a full connection layer, using the unmarked original voice signal, training the feature learning module in a self-supervision learning mode until the model error is stable, and extracting voice features from the unmarked original voice signal;
s3: constructing a voice recognition module based on a cyclic neural network and a full connection layer, using the voice characteristics, training the voice recognition module in a supervised learning mode until the model error is stable, and cascading the voice recognition module with the characteristic learning module to obtain a Chinese and English hybrid empty pipe voice recognition model;
s4: and training the Chinese and English mixed air traffic control speech recognition model by using the single voice signal after the segmentation and the labeling and the corresponding instruction text data, reducing the model error, and outputting the Chinese and English mixed air traffic control speech recognition model.
3. The method according to claim 2, wherein the feature learning module comprises a hidden spatial feature encoder and a context feature decoder, and is configured to learn robust speech features from unlabeled original speech in a self-supervised manner;
the encoder is used for obtaining speech features at a speech frame level, and the decoder is used for obtaining context sequence speech features of the speech signals according to the context correlation of the speech signals.
4. The full-end-to-end Chinese-English hybrid air traffic control speech recognition method according to claim 3, wherein the trunk networks of the hidden spatial feature encoder and the context feature decoder comprise a convolutional neural network unit, a long-time and short-time memory unit and a full-connection prediction unit;
the convolutional neural network unit is used for acquiring voice features from the original voice signal, learning audio features with discriminativity, discarding interfering voice features and performing data compression;
the long-time and short-time memory unit is used for acquiring time sequence characteristics from the original voice signal and establishing a mapping relation among the voice signal, the voice characteristics and the instruction text;
and the full-connection prediction unit is used for predicting the voice characteristics of the subsequent voice signals according to the time sequence characteristics and finishing the self-supervision training of the characteristic learning module.
5. The full-end-to-end Chinese-English hybrid air traffic control speech recognition method according to claim 2, wherein the training process of each model is as follows:
a: selecting a model for training, and selecting corresponding input data, output data and a corresponding training loss function according to the model;
b: setting a training hyper-parameter and selecting a training strategy; the super parameters comprise a learning rate, a batch size and a maximum iteration number, and the strategies comprise a learning rate attenuation strategy, a verification strategy and a training termination strategy;
c: the model parameters are updated by the gradient descent method and the back propagation algorithm until the error is stable.
6. The full-end-to-end Chinese-English hybrid air traffic control speech recognition method according to claim 5, wherein the input and output of the feature learning module in the step S2 during training are the unlabeled original speech signals, and a contrast predictive coding is used as a loss function;
in the step S3, the speech recognition module takes the speech features extracted in the step S2 as input and corresponding instruction text data as output during training, and adopts associative semantic time series classification as a loss function;
in the step S4, during training of the chinese-english hybrid air traffic control speech recognition model, the single speech signal after the segmentation labeling is used as input, the corresponding instruction text data is used as output, and the join-meaning time-series classification is used as a loss function.
7. The full end-to-end Chinese-English hybrid air traffic control speech recognition method according to claim 1, wherein the Chinese-English hybrid air traffic control speech recognition model further comprises a Chinese-English instruction vocabulary library, and the training process of the Chinese-English instruction vocabulary library comprises:
aiming at Chinese voice, taking Chinese characters as a labeling vocabulary unit;
aiming at English voice, a BPE algorithm is adopted for learning and generating a labeled vocabulary unit;
and combining the obtained English vocabulary library and the Chinese character library to obtain a final Chinese and English instruction vocabulary library.
8. The chinese-english vocabulary library training process of claim 7, wherein the learning optimization of the english vocabulary library comprises the steps of:
1) inputting a labeling sample, and acquiring an English sub-word unit;
2) acquiring the number of English sub-word units, the pronunciation matching degree of the English sub-word units and the vocabulary balance degree of the labeled sample; the vocabulary balance degree is the appearance frequency of each English sub-word unit obtained after the English word label in the label sample is converted into an English sub-word unit sequence;
3) and optimizing the optimized objective function by adopting a BPE algorithm, and improving the objective optimized function value.
9. The chinese-english vocabulary library training process of claim 8, wherein the optimization objective function is:
Figure FDA0002740224260000041
wherein, aiThe weight parameters for the different optimization objectives,
Figure FDA0002740224260000042
represents a different optimization objective function that is different,
Figure FDA0002740224260000043
number of units of subword representing output, VbpeAnd V represents the number of the output sub-word banks and the number of the input English words respectively;
Figure FDA0002740224260000044
representing the degree of balance of frequency of occurrence of different words, etaiAnd ηmaxRespectively representing the occurrence frequency and the maximum frequency of the vocabularies;
Figure FDA0002740224260000045
measuring the number V of monosyllabic vocals in an output thesaurussAnd is used for expressing the matching degree of the pronunciation and the Chinese pronunciation.
10. The full-end-to-end Chinese and English hybrid air traffic control voice recognition device is characterized by comprising at least one processor and a memory which is in communication connection with the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 9.
CN202011147669.5A 2020-10-23 2020-10-23 Full-end-to-end Chinese and English mixed empty pipe voice recognition method and device Active CN112420024B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011147669.5A CN112420024B (en) 2020-10-23 2020-10-23 Full-end-to-end Chinese and English mixed empty pipe voice recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011147669.5A CN112420024B (en) 2020-10-23 2020-10-23 Full-end-to-end Chinese and English mixed empty pipe voice recognition method and device

Publications (2)

Publication Number Publication Date
CN112420024A true CN112420024A (en) 2021-02-26
CN112420024B CN112420024B (en) 2022-09-09

Family

ID=74840726

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011147669.5A Active CN112420024B (en) 2020-10-23 2020-10-23 Full-end-to-end Chinese and English mixed empty pipe voice recognition method and device

Country Status (1)

Country Link
CN (1) CN112420024B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113077781A (en) * 2021-06-04 2021-07-06 北京世纪好未来教育科技有限公司 Voice recognition method and device, electronic equipment and storage medium
CN113160798A (en) * 2021-04-28 2021-07-23 厦门大学 Chinese civil aviation air traffic control voice recognition method and system
CN114944148A (en) * 2022-07-09 2022-08-26 昆明理工大学 Streaming Vietnamese speech recognition method fusing external language knowledge
CN115206293A (en) * 2022-09-15 2022-10-18 四川大学 Multi-task air traffic control voice recognition method and device based on pre-training
CN115527526A (en) * 2022-11-28 2022-12-27 南方电网数字电网研究院有限公司 End-to-end far-field speech recognition system training method and device and computer equipment
CN116894427A (en) * 2023-09-08 2023-10-17 联通在线信息科技有限公司 Data classification method, server and storage medium for Chinese and English information fusion
CN117524193A (en) * 2024-01-08 2024-02-06 浙江同花顺智能科技有限公司 Training method, device, equipment and medium for Chinese-English mixed speech recognition system

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102084417A (en) * 2008-04-15 2011-06-01 移动技术有限责任公司 System and methods for maintaining speech-to-speech translation in the field
US20170256254A1 (en) * 2016-03-04 2017-09-07 Microsoft Technology Licensing, Llc Modular deep learning model
CN107408111A (en) * 2015-11-25 2017-11-28 百度(美国)有限责任公司 End-to-end speech recognition
CN108986791A (en) * 2018-08-10 2018-12-11 南京航空航天大学 For the Chinese and English languages audio recognition method and system in civil aviaton's land sky call field
WO2019198265A1 (en) * 2018-04-13 2019-10-17 Mitsubishi Electric Corporation Speech recognition system and method using speech recognition system
CN110390929A (en) * 2019-08-05 2019-10-29 中国民航大学 Chinese and English civil aviaton land sky call acoustic model construction method based on CDNN-HMM
CN110415683A (en) * 2019-07-10 2019-11-05 上海麦图信息科技有限公司 A kind of air control voice instruction recognition method based on deep learning
CN110428820A (en) * 2019-08-27 2019-11-08 深圳大学 A kind of Chinese and English mixing voice recognition methods and device
CN110491371A (en) * 2019-08-07 2019-11-22 北京悠数智能科技有限公司 A kind of blank pipe instruction translation method for improving semantic information
CN110782872A (en) * 2019-11-11 2020-02-11 复旦大学 Language identification method and device based on deep convolutional recurrent neural network
CN111710326A (en) * 2020-06-12 2020-09-25 携程计算机技术(上海)有限公司 English voice synthesis method and system, electronic equipment and storage medium
CN111785257A (en) * 2020-07-10 2020-10-16 四川大学 Empty pipe voice recognition method and device for small amount of labeled samples

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102084417A (en) * 2008-04-15 2011-06-01 移动技术有限责任公司 System and methods for maintaining speech-to-speech translation in the field
CN107408111A (en) * 2015-11-25 2017-11-28 百度(美国)有限责任公司 End-to-end speech recognition
US20170256254A1 (en) * 2016-03-04 2017-09-07 Microsoft Technology Licensing, Llc Modular deep learning model
WO2019198265A1 (en) * 2018-04-13 2019-10-17 Mitsubishi Electric Corporation Speech recognition system and method using speech recognition system
CN108986791A (en) * 2018-08-10 2018-12-11 南京航空航天大学 For the Chinese and English languages audio recognition method and system in civil aviaton's land sky call field
CN110415683A (en) * 2019-07-10 2019-11-05 上海麦图信息科技有限公司 A kind of air control voice instruction recognition method based on deep learning
CN110390929A (en) * 2019-08-05 2019-10-29 中国民航大学 Chinese and English civil aviaton land sky call acoustic model construction method based on CDNN-HMM
CN110491371A (en) * 2019-08-07 2019-11-22 北京悠数智能科技有限公司 A kind of blank pipe instruction translation method for improving semantic information
CN110428820A (en) * 2019-08-27 2019-11-08 深圳大学 A kind of Chinese and English mixing voice recognition methods and device
CN110782872A (en) * 2019-11-11 2020-02-11 复旦大学 Language identification method and device based on deep convolutional recurrent neural network
CN111710326A (en) * 2020-06-12 2020-09-25 携程计算机技术(上海)有限公司 English voice synthesis method and system, electronic equipment and storage medium
CN111785257A (en) * 2020-07-10 2020-10-16 四川大学 Empty pipe voice recognition method and device for small amount of labeled samples

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘娟宏等: "端到端的深度卷积神经网络语音识别", 《计算机应用与软件》 *
陈亚青等: "管制指令语音识别在模拟飞行界面的实现", 《计算机系统应用》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113160798B (en) * 2021-04-28 2024-04-16 厦门大学 Chinese civil aviation air traffic control voice recognition method and system
CN113160798A (en) * 2021-04-28 2021-07-23 厦门大学 Chinese civil aviation air traffic control voice recognition method and system
CN113077781B (en) * 2021-06-04 2021-09-07 北京世纪好未来教育科技有限公司 Voice recognition method and device, electronic equipment and storage medium
CN113077781A (en) * 2021-06-04 2021-07-06 北京世纪好未来教育科技有限公司 Voice recognition method and device, electronic equipment and storage medium
CN114944148A (en) * 2022-07-09 2022-08-26 昆明理工大学 Streaming Vietnamese speech recognition method fusing external language knowledge
CN114944148B (en) * 2022-07-09 2023-08-22 昆明理工大学 Streaming Vietnam voice recognition method integrating external language knowledge
CN115206293A (en) * 2022-09-15 2022-10-18 四川大学 Multi-task air traffic control voice recognition method and device based on pre-training
CN115206293B (en) * 2022-09-15 2022-11-29 四川大学 Multi-task air traffic control voice recognition method and device based on pre-training
CN115527526A (en) * 2022-11-28 2022-12-27 南方电网数字电网研究院有限公司 End-to-end far-field speech recognition system training method and device and computer equipment
CN116894427B (en) * 2023-09-08 2024-02-27 联通在线信息科技有限公司 Data classification method, server and storage medium for Chinese and English information fusion
CN116894427A (en) * 2023-09-08 2023-10-17 联通在线信息科技有限公司 Data classification method, server and storage medium for Chinese and English information fusion
CN117524193A (en) * 2024-01-08 2024-02-06 浙江同花顺智能科技有限公司 Training method, device, equipment and medium for Chinese-English mixed speech recognition system
CN117524193B (en) * 2024-01-08 2024-03-29 浙江同花顺智能科技有限公司 Training method, device, equipment and medium for Chinese-English mixed speech recognition system

Also Published As

Publication number Publication date
CN112420024B (en) 2022-09-09

Similar Documents

Publication Publication Date Title
CN112420024B (en) Full-end-to-end Chinese and English mixed empty pipe voice recognition method and device
Zia et al. Long short-term memory recurrent neural network architectures for Urdu acoustic modeling
Yao et al. An improved LSTM structure for natural language processing
CN108170749B (en) Dialog method, device and computer readable medium based on artificial intelligence
Tang et al. Question detection from acoustic features using recurrent neural network with gated recurrent unit
CN106484682B (en) Machine translation method, device and electronic equipment based on statistics
Bellegarda et al. State of the art in statistical methods for language and speech processing
CN111785257B (en) Empty pipe voice recognition method and device for small amount of labeled samples
CN111414481A (en) Chinese semantic matching method based on pinyin and BERT embedding
CN115019776A (en) Voice recognition model, training method thereof, voice recognition method and device
Lin et al. ATCSpeechNet: A multilingual end-to-end speech recognition framework for air traffic control systems
CN112541356A (en) Method and system for recognizing biomedical named entities
CN115357719A (en) Power audit text classification method and device based on improved BERT model
CN114722835A (en) Text emotion recognition method based on LDA and BERT fusion improved model
Lin et al. Towards multilingual end‐to‐end speech recognition for air traffic control
Zhang et al. Automatic repetition instruction generation for air traffic control training using multi-task learning with an improved copy network
CN114742069A (en) Code similarity detection method and device
Yao Attention-based BiLSTM neural networks for sentiment classification of short texts
CN117034961B (en) BERT-based medium-method inter-translation quality assessment method
CN114238605B (en) Automatic conversation method and device for intelligent voice customer service robot
Lu et al. Implementation of embedded unspecific continuous English speech recognition based on HMM
CN115510230A (en) Mongolian emotion analysis method based on multi-dimensional feature fusion and comparative reinforcement learning mechanism
Kipyatkova et al. Experimenting with attention mechanisms in joint CTC-attention models for Russian speech recognition
Aliyu et al. Stacked language models for an optimized next word generation
Miao et al. Multi-turn dialogue model based on the improved hierarchical recurrent attention network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant