CN113160798B - Chinese civil aviation air traffic control voice recognition method and system - Google Patents

Chinese civil aviation air traffic control voice recognition method and system Download PDF

Info

Publication number
CN113160798B
CN113160798B CN202110467893.0A CN202110467893A CN113160798B CN 113160798 B CN113160798 B CN 113160798B CN 202110467893 A CN202110467893 A CN 202110467893A CN 113160798 B CN113160798 B CN 113160798B
Authority
CN
China
Prior art keywords
voice
layer
characteristic data
air traffic
traffic control
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110467893.0A
Other languages
Chinese (zh)
Other versions
CN113160798A (en
Inventor
罗林开
俞涵
张松飞
彭洪
黄俊祥
江居旺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen University
Original Assignee
Xiamen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen University filed Critical Xiamen University
Priority to CN202110467893.0A priority Critical patent/CN113160798B/en
Publication of CN113160798A publication Critical patent/CN113160798A/en
Application granted granted Critical
Publication of CN113160798B publication Critical patent/CN113160798B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G5/00Traffic control systems for aircraft, e.g. air-traffic control [ATC]
    • G08G5/0095Aspects of air-traffic control not provided for in the other subgroups of this main group
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a voice recognition method and a voice recognition system for Chinese civil aviation air traffic control. The method comprises the following steps: acquiring voice characteristic data, wherein the voice characteristic data is time sequence characteristic information extracted based on voice signals; inputting the voice characteristic data into a trained acoustic model to obtain a recognition result, wherein the recognition result represents air traffic control Chinese terminology characters corresponding to voice signals; the acoustic model includes: the TRM module comprises a multi-head self-attention layer, a first residual error connection and layer standardization layer, a feedforward layer and a second residual error connection and layer standardization layer which are connected in sequence, the BiGRU module comprises a bidirectional gate control circulation unit network, the CTC module comprises a connection time sequence classification layer, and an acoustic model is obtained by training blank pipe instruction term voice samples with Chinese character labels. The invention has the advantage of high identification accuracy.

Description

Chinese civil aviation air traffic control voice recognition method and system
Technical Field
The invention relates to the technical field of voice recognition, in particular to a voice recognition method and a voice recognition system for Chinese civil aviation air traffic control.
Background
Air traffic control mainly commands and schedules ground taxis and airlines flying aircraft, is an important guarantee for air traffic safety and efficiency, and is extremely dependent on air traffic control personnel. The air-to-ground call between the air traffic controller and the crew is closely related to the flight safety, and it is necessary to convert the air-to-ground call into a text record and archive.
The existing voice recognition technology applied to the field of Chinese civil aviation air traffic control voice recognition mainly comprises a deep learning-based CLDNN neural network, and consists of a plurality of layers of CNNs, a plurality of layers of LSTM and a plurality of layers of full-connection neural networks, but the recognition accuracy of the prior art scheme is still to be improved.
Disclosure of Invention
The invention aims to provide a Chinese civil aviation air traffic control voice recognition method and system with high recognition accuracy.
In order to achieve the above object, the present invention provides the following solutions:
a Chinese civil aviation air traffic control voice recognition method comprises the following steps:
acquiring voice characteristic data, wherein the voice characteristic data is time sequence characteristic information extracted based on voice signals;
inputting the voice characteristic data into a trained acoustic model to obtain a recognition result, wherein the recognition result represents air traffic control Chinese terminology characters corresponding to the voice signals; the acoustic model includes: the TRM module comprises a multi-head self-attention layer, a first residual error connection and layer standardization layer, a feedforward layer and a second residual error connection and layer standardization layer which are sequentially connected, the BiGRU module comprises a bidirectional gating circulation unit network, the CTC module comprises a connection time sequence classification layer, and the acoustic model is obtained by training blank pipe instruction term voice samples with Chinese character labels.
Optionally, before the step of acquiring the voice feature data, the method further includes:
framing the voice signal to obtain a plurality of voice frames;
determining the voice characteristic data according to the voice frame; each voice feature data corresponds to a plurality of consecutive voice frames.
Optionally, each voice feature data corresponds to a reference voice frame, a set number of voice frames before the reference voice frame, and a set number of voice frames after the reference voice frame.
Optionally, when the reference voice frame is the first m frames or the last n frames of the voice signal, zero padding is performed before or after the voice feature data to which the reference voice frame belongs, so that the data length of each voice feature data is the same, where m and n are positive integers.
Optionally, the determining the voice feature data according to the voice frame specifically includes:
sampling the voice frame to obtain a plurality of sampling points;
and determining the voice characteristic data based on the sampling points, wherein each voice characteristic data corresponds to the sampling points in a plurality of continuous voice frames.
Optionally, the voice characteristic data is a mel frequency cepstral coefficient of voice.
Optionally, before framing the voice signal, the method further includes:
and carrying out de-silencing treatment on the voice signal.
Optionally, adjacent speech frames in the speech signal have overlapping regions in a set proportion.
The invention also provides a voice recognition system for the air traffic control of the Chinese civil aviation, which comprises the following steps:
the voice characteristic data acquisition module is used for acquiring voice characteristic data, wherein the voice characteristic data is time sequence characteristic information extracted based on voice signals;
the voice recognition module is used for inputting the voice characteristic data into the trained acoustic model to obtain a recognition result, and the recognition result represents the air traffic control Chinese term words corresponding to the voice signals; the acoustic model includes: the TRM module comprises a multi-head self-attention layer, a first residual error connection and layer standardization layer, a feedforward layer and a second residual error connection and layer standardization layer which are sequentially connected, the BiGRU module comprises a bidirectional gating circulation unit network, the CTC module comprises a connection time sequence classification layer, and the acoustic model is obtained by training blank pipe instruction term voice samples with Chinese character labels.
Optionally, the voice recognition system for controlling air traffic in Chinese civil aviation further comprises:
the de-muting module is used for performing de-muting processing on the voice signal;
the framing module is used for framing the voice signals to obtain a plurality of voice frames, and adjacent voice frames have overlapping areas with set proportions;
the voice characteristic data determining module is used for determining the voice characteristic data according to the voice frame; each voice characteristic data corresponds to a plurality of continuous voice frames, and the voice characteristic data is a mel frequency cepstrum coefficient of voice.
According to the specific embodiment provided by the invention, the following technical effects are disclosed: in the acoustic model structure provided by the embodiment of the invention, the TRM module can encode the input voice characteristics, and the interconnection between the input voice frames is realized through a self-attention mechanism, so that the characteristic representation of the associated context voice information is obtained. The BiGRU is a product combining the two-way circulation neural network with the gating circulation unit network, has the advantages of both the two, can process time sequence dependency relationship like the gating circulation unit network, and can have context information like the two-way circulation neural network. CTC solves the problem of misalignment of the voice input sequence and the tag sequence, thereby enabling end-to-end voice recognition. Based on the reasons, the voice recognition method for the Chinese civil aviation air traffic control provided by the embodiment of the invention has the advantage of high recognition accuracy.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a voice recognition method for Chinese civil aviation air traffic control provided by an embodiment of the invention;
FIG. 2 is a flowchart of a voice recognition method for controlling air traffic in a Chinese civil aviation according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an acoustic model according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a TRM module according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of the structure of the multi head Self-Attention in the TRM module according to the embodiment of the invention;
FIG. 6 is a schematic diagram of a BiGRU module according to an embodiment of the invention;
FIG. 7 is a schematic diagram of a GRU in a BiGRU module according to an embodiment of the invention;
FIG. 8 is a schematic diagram of an identification process according to an embodiment of the present invention;
fig. 9 is a schematic structural diagram of a voice recognition system for controlling air traffic in a chinese civil aviation according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention aims to provide a Chinese civil aviation air traffic control voice recognition method and system with high recognition accuracy.
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.
Example 1
Referring to fig. 1, the embodiment provides a voice recognition method for controlling air traffic in a chinese civil aviation, which includes the following steps:
step 101: acquiring voice characteristic data, wherein the voice characteristic data is time sequence characteristic information extracted based on voice signals;
step 102: inputting the voice characteristic data into a trained acoustic model to obtain a recognition result, wherein the recognition result represents air traffic control Chinese terminology characters corresponding to the voice signals; the acoustic model includes: the system comprises a TRM module, a BiGRU module, a full-connection layer FC and a CTC module which are sequentially connected, wherein the TRM module comprises a multi-head self-attention layer, a first residual error connection and layer standardization layer, a feedforward layer and a second residual error connection and layer standardization layer which are sequentially connected, the BiGRU module comprises a bidirectional gating circulation unit network, the CTC module comprises a connection time sequence classification layer, and the acoustic model is obtained by training blank pipe instruction term voice samples with Chinese character labels. Training of the acoustic model utilizes an Adam optimizer to fit training data through a back propagation algorithm, adjusts parameters on a validation set, and evaluates model quality on test data.
In the acoustic model structure provided by the embodiment of the invention, the TRM module can encode the input voice features, calculate the similarity of each frame feature and all frame data of the input voice respectively through a self-attention mechanism, fully consider the inter-connection of pronunciation and semantics between the input voice frames, and recalculate to obtain a feature representation associated with the context voice information. The BiGRU is a product combining the two-way circulation neural network with the gating circulation unit network, has the advantages of both the two, can process time sequence dependency relationship like the gating circulation unit network, has context information like the two-way circulation neural network, and is suitable for being used as an important module of a voice recognition acoustic model. CTC is used to solve the problem that the input sequence and the output sequence are difficult to be in one-to-one correspondence, while speech is a typical problem that the input sequence is not aligned with the tag sequence, and CTC aims to solve the problem that the deep learning model automatically learns to be aligned, so that the end-to-end speech recognition is realized. Based on the reasons, the voice recognition method for the Chinese civil aviation air traffic control provided by the embodiment of the invention has the advantage of high recognition accuracy. In addition, the acoustic model structure provided by the embodiment of the invention is only composed of TRM and BiGRU layers, so that the problems of gradient disappearance, gradient explosion and the like are not easy to occur, the model training process is easy to converge, the requirement on the training data volume is relatively low, and the data set labeling cost is low.
Aiming at the problem that the Chinese ATC instruction is incompatible with the mandarin pronunciation part, the embodiment self-builds an ATC voice data set, designs a deep learning architecture TRM-BiGRU-CTC containing a self-attention mechanism, and trains and verifies on the ATC data set to obtain a Chinese civil aviation air traffic control voice recognition acoustic model. The acoustic model for Chinese civil aviation air traffic control voice recognition provided by the invention has high accuracy for testing voice recognition, and can recognize the rapid ATC voice content recorded in a noisy environment. Aiming at a large number of professional expressions such as numbers, heights, letters and the like which are different from the mandarin pronunciation in the ATC voice, the text can be automatically converted into a corresponding text sequence.
As an implementation manner, the voice recognition method for the air traffic control of the chinese civil aviation provided in this embodiment collects voice of the air traffic control of the chinese civil aviation first; then, constructing an ATC blank pipe voice data set and carrying out data preprocessing, wherein the data preprocessing comprises the steps of removing a mute section, extracting the characteristics of blank pipe voice and carrying out characteristic processing; designing an acoustic model TRM-BiGRU-CTC containing a self-attention mechanism, and training on the preprocessed ATC voice data set; inputting the blank pipe voice to be recognized into an acoustic model obtained after training after feature extraction and feature processing; and decoding the output of the acoustic model through connection time sequence classification (Connectionist Temporal Classification, CTC) to obtain the Chinese character sequence corresponding to the air-conditioned voice content.
Furthermore, the format of the voice for air traffic control in the civilian aviation is defined as WAV format, and if other different formats are adopted, such as MP3, OGG and the like, format conversion is needed to ensure that voice data has uniform WAV format.
The voice data in the self-built ATC empty pipe voice data set is all derived from the actual operation environment of a certain empty pipe area, collected ATC voice is manually marked and checked according to the air traffic radio conversation term guide, and the data gauge module sufficiently covers conversation terms in most cases of the empty pipe area, so that the voice recognition model trained on the data set is ensured to be attached to the actual environment.
Prior to step 101 in this embodiment, the method may further include:
carrying out framing operation on the voice signals to obtain a plurality of voice frames, wherein, preferably, adjacent voice frames in the voice signals have overlapping areas with set proportion;
determining the voice characteristic data according to the voice frame; each of the voice feature data corresponds to a plurality of continuous voice frames, and preferably, the voice feature data in this embodiment is a mel-frequency cepstral coefficient of voice. Further, each of the voice feature data may correspond to a reference voice frame, a set number of voice frames before the reference voice frame, and a set number of voice frames after the reference voice frame. When the reference voice frame is the first m frames or the last n frames of the voice signal, zero padding is performed before or after the voice characteristic data to which the reference voice frame belongs, so that the data length of each voice characteristic data is the same. In addition, the determining the voice characteristic data according to the voice frame specifically includes: sampling the voice frame to obtain a plurality of sampling points; and determining the voice characteristic data based on the sampling points, wherein each voice characteristic data corresponds to the sampling points in a plurality of continuous voice frames.
The speech feature data refers to mel-frequency cepstral coefficient (Mel Frequency Cepstral Coefficent, MFCC) features of speech, which features have characteristics that conform to the hearing of the human ear. Since each frame feature contains only voice information for a short period of time, most voice frames are insufficient to express one chinese character, and further feature processing is required for this problem. Specifically, a left-right frame spelling operation is performed on each frame data of the extracted MFCC features, that is, for the current frame, the MFCC features of the left m frames and the right n frames are used as features of the current speech frame by stitching together the current frame, so that each frame data of the input acoustic model has more context-related information.
In this embodiment, before performing the framing operation on the voice signal, the method further includes: and carrying out de-silencing treatment on the voice signal.
Referring to fig. 2, the present embodiment may include two stages: training phase and recognition phase.
First, due to the specificity of air traffic control speech, we need to self-build an ATC speech dataset. A large amount of ATC voice is collected under the actual operation environment of a certain empty pipe area, the format of a voice file is normalized to WAV format, the bit rate is 128kbps, and the sampling rate is 8000Hz. And (3) manually marking and checking the ATC audio according to the air traffic radio call expression guide. The following explanation is made with respect to the pertinence label corresponding to a part of the special pronunciation in the label: (1) The pronunciation of the letters is marked as capital letters, such as letter A, the hollow tube pronunciation is Alpha, and the corresponding mark is A; (2) The numerals are marked as Arabic numerals as far as possible, and in addition, regarding the marking of the height numerals, different empty controllers or pilots may have different reading methods for the same height, for example 2100 meters, if the pronunciation is two units, the numerals are marked as 21, and if the pronunciation is two thousands of numerals, the numerals are marked as 2 thousands of numerals; (3) For some special waypoints, for example, the NASPA is directly labeled as NASPA, and is considered as one modeling unit at the time of recognition, rather than being considered as separate N, A, S, P and A five modeling units. The total time period of the self-built ATC data set in this embodiment is about 47 hours, which is 9300 voice samples, 7700 data are used as training sets, 540 data are used as verification sets, and 1060 data are used as test sets. Compared with the empty pipe voice data set in the prior art, the ATC voice data set of the embodiment has smaller data volume, and the data marking cost is reduced. However, the data volume of the data set still covers most of possible words of the ground-to-air dialogue in the field of air management, and the Chinese characters can be directly selected as a modeling unit because the difference between the Chinese character class number and the pinyin class number in the data set is not great, so that the most direct advantage is that an additional language model is not needed for conversion, and only an acoustic model is needed to be trained.
And then, extracting the characteristics of the training voice data from which the mute sections are removed. Firstly, the feature extraction needs to perform frame division operation, namely N sampling points are synthesized into an observation unit, because the voice signal has short-time stationarity (the voice signal can be considered to be approximately unchanged within 10-30 ms), the time covered by one frame is set to be 25ms in the embodiment, because the sampling rate of the blank pipe voice data is 8000Hz, 200 sampling points are arranged in each frame, in order to avoid the characteristic parameters of the front and rear adjacent frames from generating mutation, a section of overlapping area is generally arranged between the adjacent frames, and the overlapping area set in the embodiment is 12.5ms, namely, one frame is taken every 12.5 ms. However, the number of frames of a part of ATC voice samples in this embodiment is too long, which results in higher requirements on equipment, and in this embodiment, a downsampling mode of frame-separated sampling is adopted to relieve the pressure of the equipment. The embodiment of the invention extracts 26-dimensional MFCC features, namely 26 features are extracted from the feature vector of each frame. Since a frame of speech only contains 25ms of content and is generally insufficient to express a syllable, the extracted MFCC features are subjected to frame spelling, i.e. for the current frame, the MFCC features of the left m frame and the right n frame are taken and then are spliced with the current frame to be used as the features of the current frame. In this embodiment, m=7 and n=7 are set, if the current frame is located in the first 7 frames or the last 7 frames of the current speech, zero padding operation is performed, and if the current frame is located in the middle, MFCC characteristics of 7 frames on the left and right of the current frame are spliced to the current frame. Thus, after the frame is processed by spelling, each frame has 390 (7+7+1) 26 dimension and contains 375ms voice content information, which solves the problem of less information content in single frame data and can make each frame have context related information.
Then, constructing an acoustic model of a Chinese civil aviation air traffic control voice recognition method containing a Self-Attention mechanism, wherein the model structure is shown in figure 3, the network structure is called as a TRM-BiGRU-CTC network in the application, a TRM module refers to an encoding Block in a transducer model, each TRM module consists of a multi-head Self-Attention (multi-Attention-layer) layer and a feed forward (feed forward) layer, and residual connection and layer standardization are added into each layer; the BiGRU module refers to a bidirectional gating circulation unit network; CTC module refers to connection timing classification.
An important component of the acoustic model is the TRM, which is structured as shown in fig. 4. The TRM module is mainly divided into two parts: multi-head Self-Attention (Multihead Self-Attention) and Feed Forward (Feed Forward) layers, the "Add & Norm" in the figure representing residual connection and layer normalization. Residual connection can effectively improve the gradient vanishing and network degradation problems, while layer normalization can accelerate network convergence.
The Feed Forward layer of the TRM module is composed of two linear layers, in this embodiment, the number of hidden units in the two linear layers is 1024 and 390, the output is the same as the input dimension, so that residual connection is facilitated, the first linear layer is added with a ReLU activation function, no Dropout is available, the second linear layer has no activation function, and the ratio of Dropout is 0.3.
The structure of the multi Self-attribute layer of the TRM module is shown in fig. 5, where for input X, X is mapped to three different representations Q, K, V through Linear layer Linear, Q represents the query, K represents the key, V represents the value, and in this embodiment, the output dimension of the Linear layer is 390. In this embodiment, the number of multi-head attention heads is set to 5, and 5 different self-attention mapping representations are obtained by performing 5 different linear transformations on the input X, and the obtained self-attention mapping representations are spliced to form a new feature code. X is mapped through three mappings of Linear layers, Q, K,v is divided into 5 parts according to the characteristic dimension sequence, as shown in a1, a2, a3, a4 and a5 of FIG. 5, the characteristic dimension of each a is 78 (390/5), and the three representations of Q, K and V are included, and are input into a Self-attribute layer again, and the calculation process of the Self-attribute representation in FIG. 5 is thatWherein d is k The characteristic dimensions of Q, K, V are denoted 78 in this embodiment. By this calculation, the similarity can be calculated for each frame and all frames of the input multi-time frame data, and the similarity is used as a weight, and the weighted summation is performed for all frames to obtain the associated representation of each frame with respect to the input all frames, namely b1-b5 in fig. 5, the higher the correlation, the greater the influence of the associated representation between the frames with higher correlation on each other, and the greater the weight representing the similarity. And splicing b1-b5 (Concat) to obtain the output Y. The feature dimension of Y is still 390, consistent with input X for residual connection.
Another important component of an acoustic model is biglu. In a general gated loop unit network (GRU), the direction of propagation of its hidden layer state is unidirectional from front to back, i.e. the state value of a location is only related to the input from location 0 to location i, and is not related to the input from location i+1 to the end, i.e. the current state is only related to the "context". But in the task of speech recognition, the state of the current location often needs to incorporate "context" information to be more efficient. The basic idea of Bi-directional GRU (BiGRU) is to superimpose two unidirectional gated cyclic unit networks one above the other. The structure of BiGRU is shown in fig. 6, where the circles connected by right arrows from the first row from bottom to top represent forward GRUs, and the circles connected by left arrows from the second row from bottom to top represent backward GRUs. The same training sequence [ x1, x2, x3, x4, x5] is sequentially input into the forward GRU from front to back to obtain sequences [ b1, b2, b3, b4, b5], then is input into the backward GRU from back to front to obtain sequences [ a1, a2, a3, a4, a5], and the two sequences are correspondingly spliced together, for example, b1 and a1 are spliced into y1 to obtain output sequences [ y1, y2, y3, y4, y5], so that the output sequences can provide complete context information for the state at each moment, and the output dimension of the BiGRU is twice that of the unidirectional GRU, so that the BiGRU has stronger expression capability compared with the unidirectional GRU.
The principle structure of the unidirectional GRU in the BiGRU module is shown in fig. 7, xt represents input at t time, ht represents hidden state at t time, ht-1 represents hidden state at t-1, an asterisked circle represents Hadamard multiplication operation, and an plus circle represents addition operation. The GRU in principle achieves long-term timing dependency through one update gate and one reset gate. The update gate is calculated by the formula zt=f (Wz x, ht-1), where f is a Sigmoid function, resulting in a value between 0-1 that determines how much information for the current time step and the past time step is to be passed on. The reset gate is calculated by the formula rt=f (Wr [ Xt, ht-1 ]), where f is also a Sigmoid function, resulting in a value between 0-1 that determines how much information of past time steps is to be forgotten, while the reset gate and the update gate have the same calculation formula but have different parameters to achieve different functions. The current memory content is calculated by a formula H 't=tanh (W [ Ht, rt ] Ht-1 ]), and the hidden state at the time t is calculated by a formula ht=h't + (1-Zt) Ht-1, wherein H't represents updated information of the time step t, and (1-Zt) Ht-1 represents information continuously transmitted in the past time step, and the hidden state at the time t is calculated by the formula ht=h't + (1-Zt) Ht-1, and the hidden state at the time t is combined to obtain the output when the time step t passes.
In the present embodiment, the relevant parameters of the acoustic model are set as follows: the number of heads of a multi-head mechanism of a multi-head Self-Attention layer in the TRM is set to be 5, the parameter of a linear layer is 390, the parameter of each head is 78 (390/5), and the Dropout proportion is 0.3; the hidden node numbers of the two feedforward sublayers of the Feed Forward layer in the TRM are 1024 and 390 respectively, and the proportion of Dropout is 0.3; the number of the neuron nodes of the forward GRU and the backward GRU in the BiGRU module is 256, the activation function is a Tanh activation function, and the proportion of Dropout is set to 0.3; the hidden node of the first full connection layer FC after BiGRU is set to 256, the activation function is a ReLU activation function, and the proportion of Dropout is 0.3; because the acoustic model finally recognizes that the target is Chinese characters, and the ATC data set has 745 Chinese character classes in total, the hidden node of the full connection layer FC connected with the CTC is set to 746 (745+blank) without an activation function and Dropout; the loss function uses CTC loss function. The training of the model adopts an Adam optimizer to update network parameters, the initial learning rate is set to be 0.0005, and momentum values in Adam respectively take 0.9 and 0.99. During training, the MFCC features of 12 pieces of audio are selected as input for each batch, and the input data of each batch of the neural network is required to be aligned due to the difference of the audio lengths of each batch, so that the MFCC features of the shorter 11 pieces of voice in the 12 pieces of audio of each batch are subjected to zero padding.
After the trained acoustic model is obtained, the ATC voice data to be recognized can be recognized. As shown in the recognition flowchart of fig. 8, ATC voice data is collected by the air traffic control device, and the format of the ATC voice data is ensured to be WAV format, if the ATC voice data is not the WAV format, the ATC voice data is converted into the WAV format. And removing the mute segment, extracting the MFCC characteristics of the mute segment after pre-emphasis, framing and windowing, inputting the input audio model after the left and right frame spelling operation, and obtaining the corresponding speech content prediction text after CTC decoding of the output audio model. CTC decoding adopts a bundle Search (Beam Search) method. Beam Search decoding can be considered breadth-first Search that retains a suboptimal solution, for which all historical paths are retained during the process, while Beam Search retains only TOP-N (called Beam width beam_width) historical paths. In this embodiment, beam_width at decoding is set to 5. Assuming that the size of the vocabulary is 100, when generating the first word, because beam_width is equal to 5, 5 words with the highest probability are selected from the vocabulary, when generating the second word, the possible sequences of the last word are any one of the selected 5 words, and are respectively combined with the words in the vocabulary to obtain 5 x 100 new sequences, then 10 sequences with the highest confidence are selected from the new sequences, 5 x 100 possible sequences are also selected from the new sequences as the current sequences, 5 sequences with the highest confidence are still selected from the new sequences to the third word, and then the above processes are repeated until a terminator is encountered, and finally the 5 sequences with the highest confidence are selected. The Beam Search method belongs to the idea of greedy algorithm, and can not necessarily reach the global optimal solution. However, considering that the number of frames of the voice is very large, the number of corresponding characters is relatively large, and if the global optimal solution is wanted, the Search space and the path are unnecessarily large, and the Search efficiency is very low, so that the method of using the Beam Search in the embodiment is a relatively local optimal solution, but is also acceptable in terms of engineering effect.
The effect of the present invention is verified as follows
Evaluation indexes commonly used for Chinese speech recognition: word error rate (CharacterError Rate, CER). The word error rate is calculated by: in order to keep the consistency between the recognized sequence and the correct sequence, some characters need to be replaced, deleted or inserted, and the total number of the inserted, replaced and deleted characters accounts for the percentage of the total number of the characters in the correct sequence, namely CER, and the calculation formula is as follows:
the prior art in the technical field of voice recognition is quite a lot, but the prior art applied in the field of voice recognition of Chinese air traffic control is mainly in a CLDNN structure, namely a deep learning architecture consisting of a plurality of layers of CNNs, a plurality of layers of LSTM and a plurality of layers of fully-connected neural networks (from a patent with application publication number CN 110335609A, a ground-air call data analysis method and system based on voice recognition). Because the voice data set used in the field of Chinese air traffic control voice recognition is generally a self-built database, and the voice tone quality, the duration, the acquisition equipment, the recording environment and the like are all different, the quality of the recognition method cannot be evaluated by directly comparing the accuracy of different data sets. Therefore, the invention only carries out a comparison experiment with CLDNN, and adjusts the structure of the ATC data set according to the characteristics of the ATC data set. The specific model structure of the CLDNN mentioned in the embodiment of the present invention is as follows: two CNN layers, the convolution kernel size is 3*3, the step length is 1, the number of filters is 32 and 64 in sequence, the first CNN layer is connected with the largest pooling layer, the pooling window is 2 x2, the windows are not overlapped, and the second CNN layer is not connected with the pooling layer; the output of the convolution layer is used as the input of the full-connection network layer to reduce the dimension, and the number of the neurons of the full-connection network layer is 512; the full-connection network layer is connected with three LSTM layers, the number of the neurons is 256, the full-connection network layer and the softmax layer are connected with the full-connection network layer, and the number of the neurons is 256 and the number of the character categories respectively. The preprocessing of the voice data is the same as that of the embodiment of the invention, the MFCC characteristics of the voice are extracted, the frame spelling of the left 7 frames and the right 7 frames is carried out, the optimizer adopts an Adam optimizer as well, the initial learning rate is 0.005, and the momentum values in Adam are respectively 0.9 and 0.99. The CLDNN model has high requirement on data volume, is limited by data scale, can generate serious gradient vanishing problem when the LSTM layer number in the structure is overlarge, and is limited by hardware performance, so that the LSTM layer number is set to be 3 in the experiment.
This "CLDNN" model has been parameter-tuned to the ATC dataset, reducing the parameter size. However, the number of model parameters is counted as 10050163, while the number of model parameters in this embodiment is 3837483, in contrast to the smaller parameter scale of this embodiment, the model keeping costs low.
Table 1 shows the recognition effect of the acoustic model "TRM-BiGRU-CTC" on the test set, and the word error rates of BiGRU and CLDNN are also listed for comparison:
TABLE 1
From the above results, it is clear that the recognition effect of the present invention on ATC data set is better than "CLDNN" and also better than reference model biglu, in the case of adopting the same data preprocessing method.
The invention has the following advantages:
(1) ATC voice advantage: the invention designs the ATC empty pipe voice data set specifically aiming at the characteristics of ATC voice.
(2) Model advantage: in the acoustic model structure provided by the invention, the TRM module can encode the input voice features, calculate the similarity of each frame feature and all frame data of the input voice respectively through a self-attention mechanism, fully consider the inter-connection of pronunciation and semantics between the frames of the input voice, and recalculate to obtain a feature representation associated with the context voice information. The BiGRU is a product combining the two-way circulation neural network with the gating circulation unit network, has the advantages of both the two, can process time sequence dependency relationship like the gating circulation unit network, and can have context information like the two-way circulation neural network. CTC is used to solve the problem that the input sequence and the output sequence are difficult to be in one-to-one correspondence, while speech is a typical problem that the input sequence is not aligned with the tag sequence, and CTC aims to solve the problem that the deep learning model automatically learns to be aligned, so that the end-to-end speech recognition is realized. In summary, the acoustic model structure has rationality, meanwhile, the main structure of the acoustic model structure is only composed of TRM and BiGRU layers, the problems of gradient disappearance, gradient explosion and the like are not easy to occur, the model training process is easy to converge, the requirement on the data volume is relatively low, and the labeling cost of the data set is low. Compared with the prior art, the invention achieves better recognition effect on the ATC empty pipe voice data set with relatively less data volume.
Example 2
Referring to fig. 9, the present embodiment provides a voice recognition system for controlling air traffic in a chinese civil aviation, the system comprising:
the voice feature data acquisition module 901 is configured to acquire voice feature data, where the voice feature data is time sequence feature information extracted based on a voice signal;
the voice recognition module 902 is configured to input the voice feature data into a trained acoustic model to obtain a recognition result, where the recognition result represents an air traffic control chinese term text corresponding to the voice signal; the acoustic model includes: the system comprises a TRM module, a BiGRU module, a full-connection layer FC and a CTC module which are sequentially connected, wherein the TRM module comprises a multi-head self-attention layer, a first residual error connection and layer standardization layer, a feedforward layer and a second residual error connection and layer standardization layer which are sequentially connected, the BiGRU module comprises a bidirectional gating circulation unit network, the CTC module comprises a connection time sequence classification layer, and the acoustic model is obtained by training blank pipe instruction term voice samples with Chinese character labels.
As an implementation manner of this embodiment, the voice recognition system for air traffic control in the chinese civil aviation further includes:
the de-muting module is used for performing de-muting processing on the voice signal;
the framing module is used for framing the voice signals to obtain a plurality of voice frames, and adjacent voice frames have overlapping areas with set proportions;
the voice characteristic data determining module is used for determining the voice characteristic data according to the voice frame; each voice characteristic data corresponds to a plurality of continuous voice frames, and the voice characteristic data is a mel frequency cepstrum coefficient of voice.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims (10)

1. A voice recognition method for air traffic control in a chinese civil aviation, comprising:
acquiring voice characteristic data, wherein the voice characteristic data is time sequence characteristic information extracted based on voice signals;
inputting the voice characteristic data into a trained acoustic model to obtain a recognition result, wherein the recognition result represents air traffic control Chinese terminology characters corresponding to the voice signals; the acoustic model includes: the TRM module comprises a multi-head self-attention layer, a first residual error connection and layer standardization layer, a feedforward layer and a second residual error connection and layer standardization layer which are sequentially connected, wherein the feedforward layer consists of a first linear layer and a second linear layer, a ReLU activation function is added into the first linear layer, no Dropout exists, and the second linear layer has no activation function; the number of the multi-head attention heads is set to be 5, 5 different self-attention mapping representations are obtained by 5 different linear transformations of the input, and the multi-head attention heads are spliced to be used as new feature codes; the BiGRU module comprises a bidirectional gating circulation unit network, wherein in the bidirectional gating circulation unit network, the same training sequence [ x1, x2, x3, x4, x5] is sequentially input into a forward GRU from front to back to obtain sequences [ b1, b2, b3, b4, b5], then the sequences [ a1, a2, a3, a4, a5] are input into a backward GRU from back to front, and the two sequences are correspondingly spliced together to obtain output sequences [ y1, y2, y3, y4, y5]; the CTC module comprises a connection time sequence classification layer, and the acoustic model is obtained by training blank pipe instruction term voice samples with Chinese character labels.
2. The method for voice recognition of chinese civil aviation air traffic control of claim 1, further comprising, prior to said obtaining the voice feature data:
framing the voice signal to obtain a plurality of voice frames;
determining the voice characteristic data according to the voice frame; each voice feature data corresponds to a plurality of consecutive voice frames.
3. The method of claim 2, wherein each of the voice feature data corresponds to a reference voice frame and a set number of voice frames before the reference voice frame and a set number of voice frames after the reference voice frame.
4. A method of voice recognition for air traffic control in a chinese civil aviation as defined in claim 3, wherein when the reference voice frame is the first m frames or the last n frames of the voice signal, zero padding is performed before or after the voice feature data to which the reference voice frame belongs, respectively, so that the data length of each voice feature data is the same, wherein m and n are both positive integers.
5. The method for voice recognition of chinese civil aviation air traffic control according to any one of claims 2 to 4, wherein the determining the voice feature data from the voice frame specifically comprises:
sampling the voice frame to obtain a plurality of sampling points;
and determining the voice characteristic data based on the sampling points, wherein each voice characteristic data corresponds to the sampling points in a plurality of continuous voice frames.
6. The voice recognition method for air traffic control in chinese civil aviation according to claim 5, wherein the voice characteristic data is mel frequency cepstral coefficient of voice.
7. The method for voice recognition of chinese civil aviation air traffic control of claim 2, further comprising, prior to framing the voice signal:
and carrying out de-silencing treatment on the voice signal.
8. The method of claim 2, wherein adjacent speech frames in the speech signal have overlapping regions of a set proportion.
9. A chinese civil aviation air traffic control voice recognition system, comprising:
the voice characteristic data acquisition module is used for acquiring voice characteristic data, wherein the voice characteristic data is time sequence characteristic information extracted based on voice signals;
the voice recognition module is used for inputting the voice characteristic data into the trained acoustic model to obtain a recognition result, and the recognition result represents the air traffic control Chinese term words corresponding to the voice signals; the acoustic model includes: the TRM module comprises a multi-head self-attention layer, a first residual error connection and layer standardization layer, a feedforward layer and a second residual error connection and layer standardization layer which are sequentially connected, wherein the feedforward layer consists of a first linear layer and a second linear layer, a ReLU activation function is added into the first linear layer, no Dropout exists, and the second linear layer has no activation function; the number of the multi-head attention heads is set to be 5, 5 different self-attention mapping representations are obtained by 5 different linear transformations of the input, and the multi-head attention heads are spliced to be used as new feature codes; the BiGRU module comprises a bidirectional gating circulation unit network, wherein in the bidirectional gating circulation unit network, the same training sequence [ x1, x2, x3, x4, x5] is sequentially input into a forward GRU from front to back to obtain sequences [ b1, b2, b3, b4, b5], then the sequences [ a1, a2, a3, a4, a5] are input into a backward GRU from back to front, and the two sequences are correspondingly spliced together to obtain output sequences [ y1, y2, y3, y4, y5]; the CTC module comprises a connection time sequence classification layer, and the acoustic model is obtained by training blank pipe instruction term voice samples with Chinese character labels.
10. The chinese civil aviation air traffic control voice recognition system of claim 9, further comprising:
the de-muting module is used for performing de-muting processing on the voice signal;
the framing module is used for framing the voice signals to obtain a plurality of voice frames, and adjacent voice frames have overlapping areas with set proportions;
the voice characteristic data determining module is used for determining the voice characteristic data according to the voice frame; each voice characteristic data corresponds to a plurality of continuous voice frames, and the voice characteristic data is a mel frequency cepstrum coefficient of voice.
CN202110467893.0A 2021-04-28 2021-04-28 Chinese civil aviation air traffic control voice recognition method and system Active CN113160798B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110467893.0A CN113160798B (en) 2021-04-28 2021-04-28 Chinese civil aviation air traffic control voice recognition method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110467893.0A CN113160798B (en) 2021-04-28 2021-04-28 Chinese civil aviation air traffic control voice recognition method and system

Publications (2)

Publication Number Publication Date
CN113160798A CN113160798A (en) 2021-07-23
CN113160798B true CN113160798B (en) 2024-04-16

Family

ID=76872012

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110467893.0A Active CN113160798B (en) 2021-04-28 2021-04-28 Chinese civil aviation air traffic control voice recognition method and system

Country Status (1)

Country Link
CN (1) CN113160798B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113823275A (en) * 2021-09-07 2021-12-21 广西电网有限责任公司贺州供电局 Voice recognition method and system for power grid dispatching
CN113821053A (en) * 2021-09-28 2021-12-21 中国民航大学 Flight assisting method and system based on voice recognition and relation extraction technology
CN115206293B (en) * 2022-09-15 2022-11-29 四川大学 Multi-task air traffic control voice recognition method and device based on pre-training
CN115359784B (en) * 2022-10-21 2023-01-17 成都爱维译科技有限公司 Civil aviation land-air voice recognition model training method and system based on transfer learning

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6826146B1 (en) * 1999-06-02 2004-11-30 At&T Corp. Method for rerouting intra-office digital telecommunications signals
CN110838288A (en) * 2019-11-26 2020-02-25 杭州博拉哲科技有限公司 Voice interaction method and system and dialogue equipment
CN110992943A (en) * 2019-12-23 2020-04-10 苏州思必驰信息科技有限公司 Semantic understanding method and system based on word confusion network
CN111063336A (en) * 2019-12-30 2020-04-24 天津中科智能识别产业技术研究院有限公司 End-to-end voice recognition system based on deep learning
CN111243591A (en) * 2020-02-25 2020-06-05 上海麦图信息科技有限公司 Air control voice recognition method introducing external data correction
CN112037773A (en) * 2020-11-05 2020-12-04 北京淇瑀信息科技有限公司 N-optimal spoken language semantic recognition method and device and electronic equipment
CN112217947A (en) * 2020-10-10 2021-01-12 携程计算机技术(上海)有限公司 Method, system, equipment and storage medium for transcribing text by customer service telephone voice
CN112420024A (en) * 2020-10-23 2021-02-26 四川大学 Full-end-to-end Chinese and English mixed air traffic control voice recognition method and device
WO2021064907A1 (en) * 2019-10-02 2021-04-08 日本電信電話株式会社 Sentence generation device, sentence generation learning device, sentence generation method, sentence generation learning method, and program

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA3172758A1 (en) * 2016-07-11 2018-01-18 FTR Labs Pty Ltd Method and system for automatically diarising a sound recording
US11862146B2 (en) * 2019-07-05 2024-01-02 Asapp, Inc. Multistream acoustic models with dilations

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6826146B1 (en) * 1999-06-02 2004-11-30 At&T Corp. Method for rerouting intra-office digital telecommunications signals
WO2021064907A1 (en) * 2019-10-02 2021-04-08 日本電信電話株式会社 Sentence generation device, sentence generation learning device, sentence generation method, sentence generation learning method, and program
CN110838288A (en) * 2019-11-26 2020-02-25 杭州博拉哲科技有限公司 Voice interaction method and system and dialogue equipment
CN110992943A (en) * 2019-12-23 2020-04-10 苏州思必驰信息科技有限公司 Semantic understanding method and system based on word confusion network
CN111063336A (en) * 2019-12-30 2020-04-24 天津中科智能识别产业技术研究院有限公司 End-to-end voice recognition system based on deep learning
CN111243591A (en) * 2020-02-25 2020-06-05 上海麦图信息科技有限公司 Air control voice recognition method introducing external data correction
CN112217947A (en) * 2020-10-10 2021-01-12 携程计算机技术(上海)有限公司 Method, system, equipment and storage medium for transcribing text by customer service telephone voice
CN112420024A (en) * 2020-10-23 2021-02-26 四川大学 Full-end-to-end Chinese and English mixed air traffic control voice recognition method and device
CN112037773A (en) * 2020-11-05 2020-12-04 北京淇瑀信息科技有限公司 N-optimal spoken language semantic recognition method and device and electronic equipment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
《基于深度学习的ATC语音识别》;张松飞;《中国优秀硕士学位论文全文数据库 信息科技辑》;19-39, 46-69 *
《基于深度神经网络的关键词识别系统》;孙彦楠;夏秀渝;《计算机系统应用》;20180515;第27卷(第05期);41-48 *
机器阅读理解的技术研究综述;徐霄玲;郑建立;尹梓名;;小型微型计算机系统;20200315(第03期);全文 *

Also Published As

Publication number Publication date
CN113160798A (en) 2021-07-23

Similar Documents

Publication Publication Date Title
CN113160798B (en) Chinese civil aviation air traffic control voice recognition method and system
EP3680894B1 (en) Real-time speech recognition method and apparatus based on truncated attention, device and computer-readable storage medium
Lin et al. A unified framework for multilingual speech recognition in air traffic control systems
CN111666381B (en) Task type question-answer interaction system oriented to intelligent control
CN113223509B (en) Fuzzy statement identification method and system applied to multi-person mixed scene
CN111339750B (en) Spoken language text processing method for removing stop words and predicting sentence boundaries
CN111785257B (en) Empty pipe voice recognition method and device for small amount of labeled samples
CN113053366B (en) Multi-mode fusion-based control voice duplicate consistency verification method
CN110399850A (en) A kind of continuous sign language recognition method based on deep neural network
CN112420024B (en) Full-end-to-end Chinese and English mixed empty pipe voice recognition method and device
CN115910066A (en) Intelligent dispatching command and operation system for regional power distribution network
CN114385802A (en) Common-emotion conversation generation method integrating theme prediction and emotion inference
CN117765981A (en) Emotion recognition method and system based on cross-modal fusion of voice text
CN114944150A (en) Dual-task-based Conformer land-air communication acoustic model construction method
Helmke et al. Readback error detection by automatic speech recognition and understanding
CN117591648A (en) Power grid customer service co-emotion dialogue reply generation method based on emotion fine perception
CN118193702A (en) Intelligent man-machine interaction system and method for English teaching
Sankar et al. Multistream neural architectures for cued speech recognition using a pre-trained visual feature extractor and constrained ctc decoding
Gupta et al. CRIM's Speech Transcription and Call Sign Detection System for the ATC Airbus Challenge Task.
CN110390929A (en) Chinese and English civil aviaton land sky call acoustic model construction method based on CDNN-HMM
Shi et al. An end-to-end conformer-based speech recognition model for mandarin radiotelephony communications in civil aviation
CN114360584A (en) Phoneme-level-based speech emotion layered recognition method and system
CN113642862A (en) Method and system for identifying named entities of power grid dispatching instructions based on BERT-MBIGRU-CRF model
CN113421593A (en) Voice evaluation method and device, computer equipment and storage medium
CN115376547A (en) Pronunciation evaluation method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant