CN115206293B - Multi-task air traffic control voice recognition method and device based on pre-training - Google Patents

Multi-task air traffic control voice recognition method and device based on pre-training Download PDF

Info

Publication number
CN115206293B
CN115206293B CN202211118845.1A CN202211118845A CN115206293B CN 115206293 B CN115206293 B CN 115206293B CN 202211118845 A CN202211118845 A CN 202211118845A CN 115206293 B CN115206293 B CN 115206293B
Authority
CN
China
Prior art keywords
training
traffic control
task
air traffic
dimension
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211118845.1A
Other languages
Chinese (zh)
Other versions
CN115206293A (en
Inventor
张子宸
林毅
张建伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202211118845.1A priority Critical patent/CN115206293B/en
Publication of CN115206293A publication Critical patent/CN115206293A/en
Application granted granted Critical
Publication of CN115206293B publication Critical patent/CN115206293B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention relates to the technical field of artificial intelligence, in particular to a pre-training-based multi-task air traffic control voice recognition method and device. The method comprises the steps of obtaining air traffic control voice data and preprocessing the air traffic control voice data to obtain a training sample data set which is divided into a first-stage pre-training data set and a second-stage training data set; secondly, constructing an empty pipe speech coding model; inputting the pre-training data set of the first stage into an empty pipe speech coding model for pre-training; constructing a multi-task air traffic control speech recognition model after the pre-trained air traffic control speech coding model; establishing a loss function of a multitask empty pipe voice recognition model; training the multi-task air traffic control voice recognition model through a loss function and a second-stage training data set; and finally, inputting the air traffic control voice data segmented according to sentences into the trained multi-task air traffic control voice recognition model and outputting a result. The invention realizes the speech recognition with higher speed and higher accuracy by training based on fewer label samples.

Description

Multi-task air traffic control voice recognition method and device based on pre-training
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a multi-task air traffic control voice recognition method and device based on pre-training.
Background
In reality, a certain relation exists among a plurality of related problems, and the multi-task learning just utilizes the related information hidden in a plurality of related tasks to improve the generalization capability of the model, so that the model can learn better characteristic representation, and the performance of each task is improved. Meanwhile, because the multi-task learning can share network parameters among tasks, the results of a plurality of tasks can be obtained through one-time reasoning, the data quantity and the model parameter quantity required by training can be obviously reduced, and the model is more efficient during reasoning.
In recent years, more and more artificial intelligence fields begin to pay attention to an unsupervised pre-training mode, unsupervised pre-training can train a general network model with strong generalization capability in the field by using a large amount of unlabeled data, then fine-tuning is performed on a small amount of labeled data according to different downstream tasks, and finally, fewer labeled samples are used to obtain more excellent performance.
In the field of air traffic control intellectualization, the air traffic control voice with multiple attribute labels can provide more information sources for air traffic control safety auxiliary measures and provide more information for post analysis. At present, no good mode is available for simultaneously providing text transcription and performing multiple attribute classification for the air traffic control voice, so that the application provides a multi-task air traffic control voice recognition method and device based on pre-training to improve the task effect in the field of air traffic control voice recognition and perform multiple attribute classification on the air traffic control voice.
Disclosure of Invention
The invention aims to: aiming at the problem that the prior art does not have a good mode to carry out text recognition and multiple attribute classification on air-to-ground and air-to-air communication in real time, a multi-task air-to-air voice recognition method and device based on pre-training are provided.
In order to achieve the purpose, the invention adopts the technical scheme that:
a multi-task air traffic control voice recognition method based on pre-training comprises the following steps:
the method comprises the following steps that S1, air traffic control voice data are obtained and preprocessed, and a training sample data set is obtained and comprises a first-stage pre-training data set and a second-stage training data set which is used for manually carrying out text labeling and auxiliary task attribute labeling;
s2, constructing a pre-training-based air traffic control speech coding model;
s3, inputting the pre-training data set of the first stage into an empty pipe speech coding model for pre-training;
s4, constructing a multi-task air traffic control voice recognition model based on the pre-trained air traffic control voice coding model;
s5, establishing a loss function of the multitask empty pipe voice recognition model;
s6, training the multi-task air traffic control voice recognition model based on a loss function of the multi-task air traffic control voice recognition model and a second-stage training data set;
and S7, inputting the air traffic control voice data segmented according to the sentences into the trained multi-task air traffic control voice recognition model to obtain a text recognition result and an auxiliary task recognition result.
As a preferred scheme of the present invention, a pre-training-based multitask empty pipe speech recognition method, wherein in step S1, hollow pipe speech data is a chinese-english speech signal without a text label, comprising the following steps:
s11, after carrying out voice emphasis and framing pretreatment on the air traffic control voice data, segmenting the pretreated air traffic control voice data according to sentences;
s12, taking all the segmented air traffic control voice data as a pre-training data set in a first stage, wherein each training sample only comprises a single-sentence voice audio file;
and S13, selecting the partially segmented air traffic control voice data to manually label texts and attributes of auxiliary tasks as a second-stage training data set, wherein each training sample comprises a single-sentence voice audio file, a corresponding text label and an attribute classification label.
As a preferred scheme of the invention, the multitask empty pipe speech recognition method based on pre-training comprises the following steps of:
s21, establishing a convolution module consisting of a one-dimensional convolution layer and an activation function layer, and extracting the voice characteristics of the training sample by using the convolution module;
s22, establishing a context extraction module consisting of a deep neural network, extracting context information of the voice features by using the context extraction module, and recording the context information as:
Figure 821360DEST_PATH_IMAGE001
wherein the content of the first and second substances,cis the output of the convolution module and is,hfor the hidden layer feature output by each layer of neural network,h i is as followsiThe hidden layer characteristics output by the layer neural network,Nis the total number of layers of the deep neural network,Tin order to input the length of the speech,fin order to be a dimension of the hidden layer feature,Rthe set of real numbers is a set of real numbers,R 1 × T × f the representation feature dimension is (1,T, f);
s23, establishing an output module, and extracting the last context in the moduledAnd stacking the output of the layer hidden layer as the output of the empty pipe speech coding model, and recording as:
Figure DEST_PATH_IMAGE002
wherein, the first and the second end of the pipe are connected with each other,Yfor the multi-layer feature output of the encoder, hfor the hidden layer feature output by each layer of neural network,dthe number of layers of the deep neural network,Nis the total number of layers of the deep neural network,Tin order to input the length of the speech,fin order to be a dimension of the hidden layer feature,Rthe number is a real number set,R d × T × f represents a characteristic dimension of (d, T, f)。
As a preferred scheme of the invention, the step S4 of the multi-task air traffic control speech recognition model construction comprises the following steps:
s41, constructing a multi-attention module, and constructing an auxiliary task classifier based on the multi-attention module, wherein the auxiliary task classifier comprises a speaker role classifier, an instruction language classifier, a speaker gender classifier and an instruction intention classifier;
and S42, constructing a multi-attention module, and constructing a voice recognition classifier based on the multi-attention module.
As a preferred scheme of the invention, the construction of the multi-attention module comprises the following steps:
constructing a level attention module, performing attention operation in a level dimension according to the multi-layer characteristic output of the encoder to obtain an attention matrix of the level dimension, multiplying the attention matrix by the multi-layer characteristic output of the encoder to obtain a result, and recording the result as:
Figure 613736DEST_PATH_IMAGE003
wherein the content of the first and second substances,LRfor the output of the hierarchical attention module,Yis the multi-layer feature output of the encoder,
Figure DEST_PATH_IMAGE004
for the computational formula of the hierarchical dimensional attention operation,dthe number of layers of the deep neural network,Tin order to input the length of the speech,fin order to be a dimension of the hidden layer feature, Rthe number is a real number set,R d × T × f represents a characteristic dimension of (d, T, f),R 1 × T × f The representation feature dimension is (1,T, f);
constructing an attention module of a time sequence dimension and a frequency dimension, respectively carrying out attention operation on the time sequence dimension and the frequency dimension according to the output of the level attention module to obtain an attention moment matrix of the time sequence dimension and an attention matrix of the frequency dimension, multiplying the results of the time sequence dimension and the attention moment matrix of the frequency dimension by the results of the level attention module and outputting the results, and marking as:
Figure 610511DEST_PATH_IMAGE005
wherein the content of the first and second substances,LTFRfor the timing dimension and frequencyThe output of the attention module for the dimension,LRfor the output of the hierarchical attention module,
Figure DEST_PATH_IMAGE006
a calculation formula for attention operations in the time series dimension,
Figure 197350DEST_PATH_IMAGE007
a calculation formula for attention operation in the frequency dimension,Tin order to input the length of the speech,fin order to be a dimension of the hidden layer feature,Rthe set of real numbers is a set of real numbers,R 1 × T × f the representation feature dimension is (1,T, f)。
as a preferred scheme of the present invention, in a pre-training-based multi-task empty pipe speech recognition method, the construction of the auxiliary task classifier in S41 further includes:
s411, inputting output results of the multiple attention modules into a voice recognition classifier;
and S412, inputting the output result of the multi-attention module into a full connection layer to obtain an auxiliary task classification result.
As a preferred aspect of the present invention, the constructing of the speech recognition classifier in S42 further includes:
s421, adding the output result of the multi-attention module and the output of the multi-attention module of all the auxiliary task classifiers to obtain the voice characteristics containing various voice information, and recording as:
Figure DEST_PATH_IMAGE008
wherein the content of the first and second substances,X ASR in order to include the speech characteristics of a variety of speech information,LTFR ASR for the multi-attention module output of the speech recognition classifier,LTFR aux_i is as followsiThe multi-attention module output of each auxiliary task classifier,idenotes the firstiAn auxiliary task classifier for classifying the task of the task,nindicates the number of the auxiliary task classifiers,Tin order to input the length of the speech,fis the dimension of the hidden layer feature(s),Rthe number is a real number set,R 1 × T × f the representation feature dimension is (1,T, f);
s422, inputting the voice characteristics containing various voice information into the full connection layer to obtain a text recognition result.
As a preferred scheme of the present invention, in step S5, a loss function of the multi-task air traffic control speech recognition model is constructed by a weighted summation of a loss function of the speech recognition classifier and a loss function of the auxiliary task classifier, and a weight occupied by each task loss is adjusted as a parameter in a model training process, wherein the loss function of the speech recognition classifier adopts a connection timing sequence classification loss, the loss functions of the auxiliary task classifiers all adopt cross entropy losses, and the loss function of the multi-task air traffic control speech recognition model adopts a loss function of a connection timing sequence classification lossLAs is noted above, the number of the channels,
Figure 545154DEST_PATH_IMAGE009
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE010
and
Figure 610062DEST_PATH_IMAGE011
respectively representing a speech recognition classifier andithe loss value of each of the secondary task classifiers,
Figure DEST_PATH_IMAGE012
and
Figure 461344DEST_PATH_IMAGE013
respectively representing speech recognition loss andithe weight that the individual secondary tasks take up is lost,nindicating the number of auxiliary task classifiers.
As a preferred scheme of the present invention, the training in step S6 is loop iteration training, and the single loop iteration training process is:
s61, selecting a group of training samples in the second-stage training data set;
s62, inputting the training sample into a multi-task air traffic control voice recognition model, and outputting a text recognition result and an auxiliary task classification result;
and S63, carrying out parameter adjustment on the parameters of the multitask empty pipe voice recognition model based on the loss function of the multitask empty pipe voice recognition model.
The device for the pre-training-based multi-task air traffic control voice recognition method comprises at least one processor and a memory which is in communication connection with the at least one processor; the memory is configured to store instructions capable of being directed by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform any of the methods described above.
In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:
1. by utilizing the associated information among a plurality of voice related tasks, the neural network can learn more voice related shared characteristics, so that the performance of each task is improved. Meanwhile, as the multi-task learning shares network parameters among related tasks, the training and prediction of the model can be more efficient. And finally, the loss of each task is balanced by using a learning mode, the training speed is accelerated, and the accuracy of each task can be further improved.
2. The method adopts a mechanism of pre-training the empty-pipe speech coding model, trains on as much speech data as possible in a self-supervision learning mode, and extracts the common characteristics of the empty-pipe speech data as much as possible, so that the speech coding model can learn better speech characteristic representation, and the generalization capability of the model and the accuracy rate of downstream tasks are improved. Meanwhile, better effect can be realized by only needing fewer labeled empty pipe voice samples.
3. By adopting the multi-attention module, hidden layer characteristic information of different layers in speech coding can be fully utilized, more important time sequence and frequency dimension information can be captured, more effective speech representation information is provided for downstream tasks, and the performance of each task is improved.
In conclusion, the method has the advantages of higher speed and accuracy of the recognition of the blank pipe voice, less label samples required during training, capability of providing classified information such as the roles, languages, instruction intentions and the like of speakers corresponding to the blank pipe voice in real time, capability of providing more information sources for the blank pipe safety auxiliary measures and capability of providing more information for the post analysis.
Drawings
FIG. 1 is a model structure diagram of the empty pipe speech coding method of the present invention.
FIG. 2 is a block diagram of a multi-attention module according to the present invention.
FIG. 3 is a model structure diagram of the multitask empty pipe speech recognition method of the present invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings.
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Example 1
In order to perform text transcription and multiple attribute classification for an air-to-ground air call in real time, the embodiment provides a pre-training-based multitask air-to-air speech recognition method and device, wherein the device comprises at least one processor and a memory in communication connection with the at least one processor.
The multi-task air traffic control voice recognition method based on pre-training comprises the following steps:
s1, acquiring air traffic control voice data, and preprocessing to obtain a training sample data set, wherein the training sample data set comprises a first-stage pre-training data set and a second-stage training data set for manually performing text labeling and auxiliary task attribute labeling;
specifically, firstly, a Chinese-English voice signal without a text label under the ground-air communication environment is obtained through an air traffic control internal call system, chinese-English air traffic control ground-air communication voice is recorded in real time from a ground-air communication voice recorder by using a multi-channel voice signal acquisition device, and the voice is filtered, sampled and PCM encoded to form air traffic control voice data with 8K sampling rate and 16bit sampling precision.
Secondly, preprocessing the acquired air traffic control voice data in real time, including voice pre-emphasis, framing and the like, manually segmenting the preprocessed air traffic control voice data into instruction voice segments according to sentences, wherein each segment of voice only contains instructions of a single speaker, and storing the voice segments into a memory in a wav file format; constructing a pre-training data set in the first stage by using all the voice files, wherein each training sample only comprises a single-sentence voice audio file;
finally, randomly selecting empty pipe voice data for about 50 hours in a pre-training data set in the first stage, manually performing text labeling and attribute labeling of a plurality of auxiliary tasks, storing labeling results into a json file, organizing voice and label files to form a training data set in the second stage, wherein each training sample comprises a single-sentence voice audio file, a corresponding text label and a corresponding multi-task classification label;
the auxiliary tasks comprise speaker role classification, instruction language classification, speaker gender classification and instruction intention classification; the results of the speaker character classification include air traffic controllers and airplane drivers, the results of the instruction language classification include Chinese and English, the results of the speaker gender classification include males and females, and the results of the instruction intention classification include altitude or course change instructions such as ascending, descending, left-turning, right-turning and the like.
S2, constructing an air traffic control speech coding model based on pre-training;
specifically, the empty pipe speech coding model structure is shown in fig. 1, and is composed of 1 convolution module, 1 context extraction module and 1 output module, and includes the following:
s21, a convolution module which consists of 7 one-dimensional convolution layers (Conv 1d Layer) and an activation function Layer (GELU) and is used for extracting the voice characteristics of the input training sample;
wherein, the convolution layer adopts convolution kernels with the size of 1 multiplied by 3, the number of the convolution kernels is 512, and the step length is 2;
s22, a context extraction module which is composed of a deep neural network and used for extracting context information of the voice features, and the context information is recorded as:
Figure DEST_PATH_IMAGE014
wherein the content of the first and second substances,cis the output of the convolution module and is,hfor the hidden layer feature output by each layer of neural network,h i is a firstiThe hidden layer characteristics output by the layer neural network,Nis the total number of layers of the deep neural network,Tin order to input the length of the speech,fin order to be a dimension of the hidden layer feature,Rthe number is a real number set,R 1 × T × f the representation feature dimension is (1,T, f);
wherein, each hidden layer adopts an Encoder structure of a Transformer.
S23, an output module for extracting the last context in the context extraction module
Figure DEST_PATH_IMAGE015
The output of the layer hidden layer is stacked as output and is used as the input of a downstream speech recognition classifier and all auxiliary task classifiers, so that each classifier can obtain available information with more dimensions, and the available information is recorded as:
Figure DEST_PATH_IMAGE016
wherein the content of the first and second substances,Yfor the multi-layer feature output of the encoder, the output features are obtained bydThe output of the individual hidden layers is composed of a stack,dthe number of layers of the deep neural network,Nis the total number of layers of the deep neural network,hfor each layer of neural network output hidden layer features,Tin order to input the length of the speech,fis the dimension of the hidden layer feature(s),Rthe number is a real number set,R d × T × f represents a characteristic dimension of (d, T, f)。
And S3, inputting the pre-training data set in the first stage to pre-train the air traffic control voice coding model, training the air traffic control voice coding model in a self-supervision learning mode, and extracting the common characteristics of the air traffic control voice data, so that the voice coding model can learn better voice characteristic representation, and the generalization capability of the model and the accuracy rate of downstream tasks are improved. Meanwhile, better effect can be realized by only needing fewer labeled empty pipe voice samples.
Specifically, the pre-training method may refer to wav2vec 2.0, the pre-training is loop iteration training, and the steps executed in the single loop iteration training process are as follows:
s31, in a pre-training data set at the first stage, selecting a group of training samples to be input into an empty pipe speech coding model, and extracting hidden layer characteristics of the training samples by a convolution module of the empty pipe speech coding model;
s32, mapping the hidden layer characteristics obtained in the S31 into quantized hidden layer characteristics through a Gumbel softmax quantization module;
s33, randomly masking the hidden layer characteristics obtained in the S31, inputting the masked hidden layer characteristics to a context extraction module and outputting the masked hidden layer characteristics;
s34, constructing contrast learning loss, wherein negative samples are context features generated by positions added with masks in the output of S33, and positive samples are quantization features of the same positions in the quantization hidden layer features obtained in S32;
and S35, updating the parameters through back propagation.
Further, the pre-training method can also adopt wav2vec, vq-wav2vec and the like.
S4, constructing a multi-task air traffic control voice recognition model based on the air traffic control voice coding model;
the structure of the multi-task empty pipe speech recognition model is shown in fig. 2, and the multi-task empty pipe speech recognition model consists of an empty pipe speech coding model, a plurality of auxiliary task classifiers and a speech recognition classifier, wherein the auxiliary task classifiers and the speech recognition classifier share the empty pipe speech coding model, the output of the empty pipe speech coding model is used as the input of each classifier, and the specific steps are as follows:
constructing a multiple attention module (LTFAtt), as shown in fig. 3, the construction method comprises the following steps:
firstly, a level attention module is constructed, attention operation is carried out in a level dimension according to the multi-layer characteristic output of the encoder, an attention matrix of the level dimension is obtained, and the result obtained by multiplying the attention matrix of the level dimension and the multi-layer characteristic output of the encoder is recorded as:
Figure 546980DEST_PATH_IMAGE017
wherein, the first and the second end of the pipe are connected with each other,LRfor the output of the hierarchical attention module,Yfor the multi-layer feature output of the encoder,
Figure DEST_PATH_IMAGE018
a calculation formula for hierarchical dimensional attention operations;dthe number of layers of the deep neural network,Tin order to input the length of the speech,fin order to be a dimension of the hidden layer feature, Rthe number is a real number set,R d × T × f represents a characteristic dimension of (d, T, f),R 1 × T × f The representation feature dimension is (1,T, f);
Figure 382081DEST_PATH_IMAGE018
a neural network structure is adopted, and the neural network structure comprises two full connection layers and a Sigmoid activation function.
Secondly, constructing a time sequence and frequency dimension attention module, respectively performing attention operation on the time sequence dimension and the frequency dimension according to the result of the hierarchy attention module to obtain a time sequence dimension attention moment array and a frequency dimension attention matrix, multiplying the time sequence dimension attention moment array and the frequency dimension attention moment array by the result of the hierarchy attention module, and outputting the result, which is recorded as:
Figure 453942DEST_PATH_IMAGE019
wherein, the first and the second end of the pipe are connected with each other,LTFRthe output of the attention module in the time and frequency dimensions,LRfor the output of the hierarchical attention module,
Figure DEST_PATH_IMAGE020
a calculation formula for attention operations in the time series dimension,
Figure 159730DEST_PATH_IMAGE021
a calculation formula for attention operation in the frequency dimension,Tin order to input the length of the speech,fin order to be a dimension of the hidden layer feature,Rthe number is a real number set,R 1 × T × f the representation feature dimension is (1,T, f);
Figure 291634DEST_PATH_IMAGE020
a neural network structure is adopted, and the neural network structure comprises a global average pooling layer, two full-connection layers and a Sigmoid activation function;
Figure DEST_PATH_IMAGE022
a neural network structure is adopted, and the neural network structure comprises two full connection layers and a Sigmoid activation function.
By adopting the multi-attention module, hidden layer characteristic information of different layers in the voice coding process is fully utilized, and information of time sequence and frequency dimension is captured at the same time, so that more effective voice representation information is provided for downstream tasks.
S41, after the trained empty pipe speech coding model, constructing an auxiliary task classifier based on a multi-attention module, wherein the auxiliary task classifier comprises a speaker role classifier, an instruction language classifier, a speaker gender classifier and an instruction intention classifier; all the auxiliary task classifiers have the same structure but independent parameters, and specifically comprise:
firstly, constructing a multi-attention module, learning more important information for a recognition result from an attention mechanism module combining hierarchy, time sequence and frequency, and optimizing an attention parameter in a learning mode; secondly, inputting the attention results of the multiple attention modules into a voice recognition classifier, and providing internal representation of various tasks for voice recognition; furthermore, the attention result of the multi-attention module is input into the full-connection layer, and the category with the highest probability is used as the auxiliary task classification result.
S42, after the trained empty pipe speech coding model is obtained, a speech recognition classifier based on a multi-attention module is constructed, and the method specifically comprises the following steps:
similarly, firstly, a multi-attention module is constructed, information more important for the recognition result is learned from an attention mechanism module combining hierarchy, time sequence and frequency, and attention parameters are optimized in a learning mode; secondly, adding the attention result of the multi-attention module and the output of all the auxiliary task classifiers to obtain the voice characteristics containing various voice information, and recording the voice characteristics as:
Figure 82873DEST_PATH_IMAGE023
wherein the content of the first and second substances,X ASR in order to include the speech characteristics of a variety of speech information,LTFR ASR for the multi-attention module output of the speech recognition classifier,LTFR aux_i is a firstiThe multi-attention module output of each auxiliary task classifier,idenotes the firstiAn auxiliary task classifier for classifying the task of the user,nindicates the number of the auxiliary task classifiers,Tin order to input the length of the speech,fin order to be a dimension of the hidden layer feature,Rthe number is a real number set,R 1 × T × f the representation feature dimension is (1,T, f);
and finally, inputting the voice characteristics of the multiple voice information into the full-connection layer to obtain a text recognition result corresponding to each voice frame.
S5, establishing a loss function of the multi-task air traffic control voice recognition model considering voice recognition and contrast learning at the same time;
specifically, the loss function of the speech recognition classifier uses connection time sequence classification losses CTCLOs, and the loss functions of the auxiliary task classifiers use cross entropy losses CrossEntrolLoss function of the multi-task air traffic control speech recognition model is constructed in a mode of weighted summation of the Loss function of the speech recognition classifier and the Loss function of the auxiliary task classifier, the weight occupied by each task Loss can be used as a parameter to be adjusted in the process of model training, and the Loss function of the multi-task air traffic control speech recognition modelLIs recorded as:
Figure DEST_PATH_IMAGE024
wherein the content of the first and second substances,
Figure 489583DEST_PATH_IMAGE025
and
Figure DEST_PATH_IMAGE026
respectively representing a speech recognition classifier andithe loss value of each of the secondary task classifiers,
Figure DEST_PATH_IMAGE027
and
Figure 112194DEST_PATH_IMAGE028
respectively representing speech recognition loss andithe weight that the individual secondary tasks take up is lost,nrepresenting the number of the auxiliary task classifiers; the weights taken up by the speech recognition penalty and all the ancillary task penalties are determined in a learning manner and are optimized together during the model training process.
Further, each weight is defined by a corresponding uncertainty variable
Figure DEST_PATH_IMAGE029
It is determined that each uncertainty variable is a scalar and will be updated and optimized during the training process, and the uncertainty variables will determine the corresponding loss weights in the following manner
Figure DEST_PATH_IMAGE030
Figure DEST_PATH_IMAGE031
S6, training the multi-task air traffic control voice recognition model until the network converges on the basis of a loss function of the multi-task air traffic control voice recognition model and a second-stage training data set; adopting a loop iteration training mode, and executing the following operations in a single loop iteration training process:
s61, randomly selecting a group of training samples from the second-stage training data set;
s62, inputting the training sample into a multi-task air traffic control voice recognition model, and outputting a voice recognition classification result and each auxiliary task classification result;
and S63, performing parameter adjustment on the relevant parameters of the multitask empty pipe voice recognition model by using the loss function of the multitask empty pipe voice recognition model.
And S7, acquiring a Chinese and English voice signal without a text label under a real-time ground-to-air communication environment, segmenting the Chinese and English voice signal into empty pipe voice data according to sentences, inputting the empty pipe voice data into the trained multi-task empty pipe voice recognition model, and obtaining a text recognition result and a multi-task attribute classification result of the empty pipe voice data.
Specifically, the empty pipe voice data is input into a trained multi-task voice recognition model, the model outputs multi-task label probability, the category with the highest probability is used as an auxiliary task classification result, and further, the model predicts the text label probability corresponding to the voice frame according to the output; and decoding and outputting the instruction text according to the maximum probability.
In conclusion, the invention simultaneously introduces mechanisms such as self-supervision pre-training, multi-attention and multi-task learning, designs the multi-task air traffic control voice recognition method and model based on end-to-end Chinese and English mixing of deep learning, improves the voice recognition accuracy under the air traffic control scene, can perform multi-task attribute classification in real time, and provides more available information for air traffic control post analysis or other downstream applications.
Example 2
The feasibility and the performance of the technical scheme of the embodiment 1 are verified:
firstly, data preparation is carried out, a data acquisition scheme provided in embodiment 1 is adopted, and a Chinese and English voice signal without a text label under a ground-air communication environment is obtained through an air traffic control intercom system, so that a first-stage pre-training data set and a second-stage training data set are obtained, and a training set, a verification set or a test set is formed by a random selection strategy.
Wherein, the pre-training data set in the first stage is as follows:
the training set comprises 774083 data in total for 640.40 hours, and the verification set comprises 7749 data in total for 6.40 hours;
the second stage training data set is:
the training set comprises 58432 pieces of data for 53.56 hours, wherein 43178 pieces of Chinese data comprise 37.00 hours, 15254 pieces of English data comprise 16.56 hours; the test set comprises 1603 pieces of data for 1.45 hours, wherein the Chinese data are 1202 pieces of data for 1.01 hours, and the English data are 401 pieces of data for 0.44 hours; in the second stage of training, the vocabulary has a total of 668 characters, including 641 Chinese characters, 26 English letters and spaces.
The test results of example 2 were obtained by performing speech recognition on the test set.
Secondly, establishing a baseline model: in the embodiment, a wav2vec 2.0 model is used as an empty pipe speech coding model, a speech recognition classifier only comprising a full connection layer is connected behind the empty pipe speech coding model to be used as a baseline model to verify validity, and the model input is an original waveform of a speech file.
The baseline model and the technical method described in embodiment 1 are implemented using a pytorech framework, and the hyper-parameter configuration of model training is described as follows:
learning rate: setting the initial learning rate to be 1e-5, using a three-stage learning rate adjustment method (tri-stage lr schedule), carrying out learning rate warm-up (warming) on the first 10% of updates, keeping the learning rate on the next 40%, and carrying out linear decay (linear decay) on the rest;
batch training size: 8.
the hardware environment adopted by the experiment is as follows: the CPU is 2 multiplied by Intel Xeon E5-2680 v4, the video card is 1 multiplied by NVIDIA GeForce RTX 2080Ti, the video memory is 1 multiplied by 11GB, the memory is 128GB, and the operating system is Ubuntu 16.04;
in the above training data and configuration, a total of 8 experiments A1-A8 were performed, as follows:
a1: training the baseline model on the second stage training data set only to complete the speech recognition task;
a2: a pre-training learning mechanism is added during the training of the baseline model, firstly, the voice coding model part of the baseline model is pre-trained, and then, the baseline model is trained in a second stage so as to complete a voice recognition task;
a3: a multi-task learning mechanism is added during the training of the baseline model, and training is only carried out on the training data set of the second stage so as to complete the voice recognition and multi-attribute classification tasks;
a4: adding multiple attention modules during the training of the baseline model, and only training on a second-stage training data set to complete a voice recognition task, wherein the number of hidden layers used in the voice coding model is 6;
a5: a multi-task learning mechanism and a multi-attention module are added during the training of the baseline model, and training is only carried out on a second-stage training data set to complete the voice recognition and multi-attribute classification tasks, wherein the number of hidden layers used in the voice coding model is 6;
a6: pre-training and multi-task learning mechanisms are added during base line model training, firstly, pre-training is carried out on a voice coding model part of a base line model, and then second-stage training is carried out on the base line model so as to complete voice recognition and multi-attribute classification tasks;
a7: a pre-training learning mechanism and a multi-attention module are added during the training of the baseline model, firstly, the voice coding model part of the baseline model is pre-trained, and then the baseline model is trained in the second stage to complete the voice recognition task, wherein the number of hidden layers used in the voice coding model is 6;
a8: when the baseline model is trained, a pre-training mechanism, a multi-task learning mechanism and a multi-attention module are added at the same time, firstly, the voice coding model part of the baseline model is pre-trained, and then, the baseline model is trained at the second stage to complete the voice recognition and multi-attribute classification tasks, wherein the number of hidden layers used in the voice coding model is 6;
the auxiliary task result in the experimental result is measured by accuracy, namely the proportion of the correctly classified samples to the total samples, and the correctness of the speech recognition is the character error rate based on Chinese characters and English lettersCER(charcter error rate) as follows:
Figure DEST_PATH_IMAGE032
wherein the content of the first and second substances,Nfor the length of the real text label,I、D、Srepresenting the insertion, deletion and replacement operands required to convert the predictive text label to a real label, respectively.
In summary, the technical scheme of the invention verifies that only the performance of the acoustic model is considered, the language model processing and optimization are not involved, and the final result pair is shown in table 1.
TABLE 1
Figure DEST_PATH_IMAGE033
According to experimental results, compared with a baseline model, the pre-training learning mechanism, the multi-task learning mechanism and the multi-attention module provided by the scheme can improve the performance of the voice recognition model on the data set of the embodiment; compared with a method without introducing a pre-training learning mechanism, the method has the advantages that the performance improvement can be greatly achieved on the data set of the embodiment by introducing the pre-training learning mechanism, and the fact that on the air traffic control data set, the pre-training learning can learn better speech feature representation with higher robustness and finally support air traffic control speech recognition research is shown; furthermore, a multi-task learning and multi-attention module is introduced, so that the voice recognition performance can be improved to a certain degree; and by introducing a pre-training, multi-task learning mechanism and a multi-attention module, the baseline model obtains the optimal voice recognition performance on the data set of the embodiment.
In conclusion, the method adopts a pre-training and multi-task learning mechanism and a multi-attention module, plays a great role in promoting the performance of the air traffic control speech recognition model, and can improve the convergence efficiency of the model.
The above description is intended to be illustrative of the preferred embodiment of the present invention and should not be taken as limiting the invention, but rather, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

Claims (10)

1. A multi-task air traffic control voice recognition method based on pre-training is characterized by comprising the following steps:
the method comprises the following steps that S1, air traffic control voice data are obtained and preprocessed, and a training sample data set is obtained and comprises a first-stage pre-training data set and a second-stage training data set which is used for manually carrying out text labeling and auxiliary task attribute labeling;
s2, constructing an air traffic control speech coding model based on pre-training;
s3, inputting the pre-training data set of the first stage into the air traffic control speech coding model for pre-training;
s4, constructing a multi-task air traffic control voice recognition model based on the pre-trained air traffic control voice coding model;
s5, establishing a loss function of the multitask empty pipe voice recognition model;
s6, training the multi-task air traffic control voice recognition model based on a loss function of the multi-task air traffic control voice recognition model and a second-stage training data set;
s7, inputting the real-time ground-air communication voice data segmented according to sentences into a trained multi-task air traffic control voice recognition model to obtain a text recognition result and an auxiliary task recognition result;
the empty pipe speech coding model consists of 1 convolution module, 1 context extraction module and 1 output module;
the multi-task air traffic control voice recognition model is composed of an air traffic control voice coding model, a plurality of auxiliary task classifiers and a voice recognition classifier, wherein the plurality of auxiliary task classifiers and the voice recognition classifier share the air traffic control voice coding model, and the output of the air traffic control voice coding model is used as the input of each classifier.
2. The pre-training-based multitask empty pipe speech recognition method according to claim 1, wherein the empty pipe speech data in step S1 is a Chinese-English speech signal without text labels, comprising the following steps:
s11, after voice emphasis and framing pretreatment are carried out on the air traffic control voice data, the pretreated air traffic control voice data is segmented according to sentences;
s12, taking all the segmented air traffic control voice data as a pre-training data set in a first stage, wherein each training sample only comprises a single-sentence voice audio file;
and S13, selecting the partially segmented air traffic control voice data to manually perform text labeling and auxiliary task attribute labeling to serve as a second-stage training data set, wherein each training sample comprises a single-sentence voice audio file, a corresponding text label and an attribute classification label.
3. The method for multi-task air traffic control speech recognition based on pre-training according to claim 1, wherein the step S2 of constructing the pre-training-based air traffic control speech coding model comprises:
s21, establishing a convolution module consisting of a one-dimensional convolution layer and an activation function layer, and extracting the voice characteristics of the training sample by using the convolution module;
s22, establishing a context extraction module consisting of a deep neural network, extracting context information of the voice features by using the context extraction module, and recording the context information as:
Figure DEST_PATH_IMAGE002A
wherein the content of the first and second substances,cis a convolution modelThe output of the block is then processed,hfor each layer of neural network output hidden layer features,h i is as followsiThe hidden layer characteristics output by the layer neural network,Nis the total number of layers of the deep neural network,Tin order to input the length of the speech,fin order to be a dimension of the hidden layer feature,Rthe number is a real number set,R 1 × T × f the representation feature dimension is (1,T, f);
s23, establishing an output module, and extracting the last context in the moduledAnd stacking the output of the layer hidden layer as the output of the empty pipe speech coding model, and recording as:
Figure DEST_PATH_IMAGE004A
wherein, the first and the second end of the pipe are connected with each other,Yfor the multi-layer feature output of the encoder,hfor each layer of neural network output hidden layer features,dthe number of layers of the deep neural network,Nis the total number of layers of the deep neural network,Tin order to input the length of the speech,fin order to be a dimension of the hidden layer feature,Rthe number is a real number set,R d × T × f represents a characteristic dimension of (d, T, f)。
4. The method for multitask empty pipe speech recognition based on pre-training as claimed in claim 1, wherein the step S4 of constructing the multitask empty pipe speech recognition model comprises:
s41, constructing a multi-attention module, and constructing an auxiliary task classifier based on the multi-attention module, wherein the auxiliary task classifier comprises a speaker role classifier, an instruction language classifier, a speaker gender classifier and an instruction intention classifier;
and S42, constructing a multi-attention module, and constructing a voice recognition classifier based on the multi-attention module.
5. The pre-training-based multitask empty pipe speech recognition method according to claim 4, wherein the construction of the multi-attention module comprises the following steps:
constructing a level attention module, performing attention operation on a level dimension according to the multi-layer characteristic output of the encoder to obtain an attention matrix of the level dimension, multiplying the attention matrix by the multi-layer characteristic output of the encoder to obtain a result, and marking as:
Figure DEST_PATH_IMAGE006A
wherein the content of the first and second substances,LRfor the output of the hierarchical attention module,Yfor the multi-layer feature output of the encoder,
Figure DEST_PATH_IMAGE008A
a calculation formula for a hierarchical dimensional attention operation,dthe number of layers of the deep neural network,Tin order to input the length of the speech,fis the dimension of the hidden layer feature(s),Rthe set of real numbers is a set of real numbers,R d × T × f represents a characteristic dimension of (d, T, f),R 1 × T × f The representation feature dimension is (1,T, f);
constructing an attention module of a time sequence dimension and a frequency dimension, respectively carrying out attention operation on the time sequence dimension and the frequency dimension according to the output of the level attention module to obtain an attention moment matrix of the time sequence dimension and an attention matrix of the frequency dimension, multiplying the results of the time sequence dimension and the attention moment matrix of the frequency dimension by the results of the level attention module and outputting the results, and marking as:
Figure DEST_PATH_IMAGE010A
wherein the content of the first and second substances,LTFRthe output of the attention module in the timing dimension and the frequency dimension,LRfor the output of the hierarchical attention module,
Figure DEST_PATH_IMAGE012A
a calculation formula for attention operations in the time series dimension,
Figure DEST_PATH_IMAGE014A
a calculation formula for attention operations in the frequency dimension,Tin order to input the length of the speech,fis the dimension of the hidden layer feature(s),Rthe number is a real number set,R 1 × T × f the representation feature dimension is (1,T, f)。
6. the pre-training-based multitask empty pipe speech recognition method according to claim 4, wherein the construction of the auxiliary task classifier in S41 further comprises:
s411, inputting output results of the multiple attention modules into the voice recognition classifier;
and S412, inputting the output result of the multi-attention module into a full connection layer to obtain an auxiliary task classification result.
7. The method according to claim 4, wherein the constructing of the speech recognition classifier in S42 further comprises:
s421, adding the output result of the multi-attention module and the output of the multi-attention module of the auxiliary task classifier to obtain the voice characteristics including various voice information, and recording as:
Figure DEST_PATH_IMAGE016A
wherein, the first and the second end of the pipe are connected with each other,X ASR in order to include the speech characteristics of a variety of speech information,LTFR ASR for the multi-attention module output of the speech recognition classifier,LTFR aux_i is a firstiThe multi-attention module output of each auxiliary task classifier,iis shown asiAn auxiliary task classifier for classifying the task of the task,nindicates the number of the auxiliary task classifiers,Tin order to input the length of the speech,fis the dimension of the hidden layer feature(s),Rthe number is a real number set,R 1 × T × f the representation feature dimension is (1,T, f);
s422, inputting the voice characteristics containing various voice information into the full connection layer to obtain a text recognition result.
8. The pre-training-based multi-task air traffic control voice recognition method according to claim 1, wherein in step S5, the loss function of the multi-task air traffic control voice recognition model is constructed by weighted summation of the loss function of the voice recognition classifier and the loss function of the auxiliary task classifier, and the weight occupied by each task loss is adjusted as a parameter in the model training process, wherein the loss function of the voice recognition classifier adopts connection time sequence classification loss, the loss functions of the auxiliary task classifiers all adopt cross entropy loss, and the loss function of the multi-task air traffic control voice recognition model adopts loss functionLAs is noted above, the number of the channels,
Figure DEST_PATH_IMAGE018A
wherein, the first and the second end of the pipe are connected with each other,
Figure DEST_PATH_IMAGE020AA
and
Figure DEST_PATH_IMAGE022AA
respectively representing a speech recognition classifier andithe loss value of each of the secondary task classifiers,
Figure DEST_PATH_IMAGE020AAA
and
Figure DEST_PATH_IMAGE022AAA
respectively representing speech recognition loss andithe weight that the individual secondary tasks take up is lost,nindicating the number of auxiliary task classifiers.
9. The pre-training-based multitask empty pipe speech recognition method according to claim 1, wherein the training in step S6 is loop iteration training, and the single loop iteration training process is as follows:
s61, selecting a group of training samples in the second-stage training data set;
s62, inputting the training sample into the multi-task air traffic control voice recognition model, and outputting a text recognition result and an auxiliary task classification result;
and S63, carrying out parameter adjustment on the parameters of the multitask empty pipe voice recognition model based on the loss function of the multitask empty pipe voice recognition model.
10. A pre-training based multitask, empty pipe speech recognition device comprising at least one processor and a memory communicatively coupled to said at least one processor; the memory is to store instructions, which are pointed to by the at least one processor, that are executable by the at least one processor to ensure that the at least one processor is capable of performing the method of any of claims 1-9.
CN202211118845.1A 2022-09-15 2022-09-15 Multi-task air traffic control voice recognition method and device based on pre-training Active CN115206293B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211118845.1A CN115206293B (en) 2022-09-15 2022-09-15 Multi-task air traffic control voice recognition method and device based on pre-training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211118845.1A CN115206293B (en) 2022-09-15 2022-09-15 Multi-task air traffic control voice recognition method and device based on pre-training

Publications (2)

Publication Number Publication Date
CN115206293A CN115206293A (en) 2022-10-18
CN115206293B true CN115206293B (en) 2022-11-29

Family

ID=83572350

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211118845.1A Active CN115206293B (en) 2022-09-15 2022-09-15 Multi-task air traffic control voice recognition method and device based on pre-training

Country Status (1)

Country Link
CN (1) CN115206293B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116168690B (en) * 2023-04-19 2023-08-01 易方信息科技股份有限公司 Real-time voice desensitization method, system, equipment and medium based on deep learning
CN116504234B (en) * 2023-05-29 2023-10-13 镁佳(北京)科技有限公司 Method, device, equipment and medium for generating voice awakening and detecting model
CN116453514B (en) * 2023-06-08 2023-08-25 四川大学 Multi-view-based voice keyword detection and positioning method and device
CN117577116B (en) * 2024-01-17 2024-03-19 清华大学 Training method, device, equipment and medium for continuously learning voice identification model

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2405422A1 (en) * 2010-07-08 2012-01-11 Honeywell International, Inc. Speech recognition and voice training data storage and access method and apparatus
EP2874133A1 (en) * 2013-11-14 2015-05-20 Honeywell International Inc. Aircraft systems and methods for reducing and detecting read-back and hear-back errors
CN112420024A (en) * 2020-10-23 2021-02-26 四川大学 Full-end-to-end Chinese and English mixed air traffic control voice recognition method and device
CN113160798A (en) * 2021-04-28 2021-07-23 厦门大学 Chinese civil aviation air traffic control voice recognition method and system
CN113284485A (en) * 2021-07-09 2021-08-20 中国科学院自动化研究所 End-to-end framework for unified Chinese and English mixed text generation and speech recognition
CN114582330A (en) * 2022-03-11 2022-06-03 中国科学技术大学 Training method of voice recognition model, voice recognition method and electronic equipment
CN114596845A (en) * 2022-04-13 2022-06-07 马上消费金融股份有限公司 Training method of voice recognition model, voice recognition method and device
CN114648982A (en) * 2022-05-24 2022-06-21 四川大学 Controller voice recognition method and device based on comparative learning
CN114944150A (en) * 2022-05-07 2022-08-26 深圳职业技术学院 Dual-task-based Conformer land-air communication acoustic model construction method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11222627B1 (en) * 2017-11-22 2022-01-11 Educational Testing Service Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system
US11257481B2 (en) * 2018-10-24 2022-02-22 Tencent America LLC Multi-task training architecture and strategy for attention-based speech recognition system
US11238845B2 (en) * 2018-11-21 2022-02-01 Google Llc Multi-dialect and multilingual speech recognition
CN113889090A (en) * 2021-09-29 2022-01-04 北京中科智加科技有限公司 Multi-language recognition model construction and training method based on multi-task learning

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2405422A1 (en) * 2010-07-08 2012-01-11 Honeywell International, Inc. Speech recognition and voice training data storage and access method and apparatus
EP2874133A1 (en) * 2013-11-14 2015-05-20 Honeywell International Inc. Aircraft systems and methods for reducing and detecting read-back and hear-back errors
CN112420024A (en) * 2020-10-23 2021-02-26 四川大学 Full-end-to-end Chinese and English mixed air traffic control voice recognition method and device
CN113160798A (en) * 2021-04-28 2021-07-23 厦门大学 Chinese civil aviation air traffic control voice recognition method and system
CN113284485A (en) * 2021-07-09 2021-08-20 中国科学院自动化研究所 End-to-end framework for unified Chinese and English mixed text generation and speech recognition
CN114582330A (en) * 2022-03-11 2022-06-03 中国科学技术大学 Training method of voice recognition model, voice recognition method and electronic equipment
CN114596845A (en) * 2022-04-13 2022-06-07 马上消费金融股份有限公司 Training method of voice recognition model, voice recognition method and device
CN114944150A (en) * 2022-05-07 2022-08-26 深圳职业技术学院 Dual-task-based Conformer land-air communication acoustic model construction method
CN114648982A (en) * 2022-05-24 2022-06-21 四川大学 Controller voice recognition method and device based on comparative learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A Unified Framework for Multilingual Speech Recognition in Air Traffic Control Systems;Yi Lin;<IEEE Transactions on Neural Networks and Learning Systems>;20200824;全文 *
基于CGRU多输入特征的地空通话自动切分;林毅;《四川大学学报》;20200828;全文 *
民航陆空通话语音识别技术研究与应用;周凯;《中国优秀硕士学位论文全文数据库》;20210715(第7期);全文 *

Also Published As

Publication number Publication date
CN115206293A (en) 2022-10-18

Similar Documents

Publication Publication Date Title
CN115206293B (en) Multi-task air traffic control voice recognition method and device based on pre-training
Schuller et al. The INTERSPEECH 2021 computational paralinguistics challenge: COVID-19 cough, COVID-19 speech, escalation & primates
CN111837178B (en) Speech processing system and method for processing speech signal
Chen et al. End-to-end neural network based automated speech scoring
Ferrer et al. Study of senone-based deep neural network approaches for spoken language recognition
CN112233646B (en) Voice cloning method, system, equipment and storage medium based on neural network
CN112017644A (en) Sound transformation system, method and application
CN111400469A (en) Intelligent generation system and method for voice question answering
CN107408384A (en) The end-to-end speech recognition of deployment
CN110827801A (en) Automatic voice recognition method and system based on artificial intelligence
Pokorny et al. Detection of negative emotions in speech signals using bags-of-audio-words
CN109887484A (en) A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device
CN109741732A (en) Name entity recognition method, name entity recognition device, equipment and medium
GB2326320A (en) Text to speech synthesis using neural network
CN109559736A (en) A kind of film performer&#39;s automatic dubbing method based on confrontation network
CN109754790A (en) A kind of speech recognition system and method based on mixing acoustic model
CN109671423A (en) Non-parallel text compressing method under the limited situation of training data
Zhao et al. End-to-end-based Tibetan multitask speech recognition
CN112184859A (en) End-to-end virtual object animation generation method and device, storage medium and terminal
Soliman et al. Isolated word speech recognition using convolutional neural network
CN116229932A (en) Voice cloning method and system based on cross-domain consistency loss
Nagano et al. Data augmentation based on vowel stretch for improving children's speech recognition
CN111090726A (en) NLP-based electric power industry character customer service interaction method
CN115455136A (en) Intelligent digital human marketing interaction method and device, computer equipment and storage medium
CN114944150A (en) Dual-task-based Conformer land-air communication acoustic model construction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant