CN115206293A - Multi-task air traffic control voice recognition method and device based on pre-training - Google Patents

Multi-task air traffic control voice recognition method and device based on pre-training Download PDF

Info

Publication number
CN115206293A
CN115206293A CN202211118845.1A CN202211118845A CN115206293A CN 115206293 A CN115206293 A CN 115206293A CN 202211118845 A CN202211118845 A CN 202211118845A CN 115206293 A CN115206293 A CN 115206293A
Authority
CN
China
Prior art keywords
training
task
dimension
traffic control
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211118845.1A
Other languages
Chinese (zh)
Other versions
CN115206293B (en
Inventor
张子宸
林毅
张建伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202211118845.1A priority Critical patent/CN115206293B/en
Publication of CN115206293A publication Critical patent/CN115206293A/en
Application granted granted Critical
Publication of CN115206293B publication Critical patent/CN115206293B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention relates to the technical field of artificial intelligence, in particular to a multi-task air traffic control voice recognition method and device based on pre-training. The method comprises the steps of obtaining air traffic control voice data and preprocessing the air traffic control voice data to obtain a training sample data set which is divided into a first-stage pre-training data set and a second-stage training data set; secondly, constructing an empty pipe speech coding model; inputting the pre-training data set of the first stage into an empty pipe speech coding model for pre-training; constructing a multi-task air traffic control voice recognition model after the pre-trained air traffic control voice coding model; establishing a loss function of a multitask empty pipe voice recognition model; training the multi-task air traffic control voice recognition model through a loss function and a second-stage training data set; and finally, inputting the empty pipe voice data segmented by sentences into the trained multi-task empty pipe voice recognition model and outputting a result. The invention realizes the speech recognition with higher speed and higher accuracy by training based on fewer label samples.

Description

Multi-task air traffic control voice recognition method and device based on pre-training
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a multi-task air traffic control voice recognition method and device based on pre-training.
Background
In reality, a certain relation exists among a plurality of related problems, and the multi-task learning is to improve the generalization capability of the model by utilizing the related information hidden in a plurality of related tasks, so that the model can learn better feature representation, and the performance of each task is improved. Meanwhile, because the network parameters can be shared among tasks in the multi-task learning, the results of a plurality of tasks can be obtained through one-time reasoning, the data quantity and the model parameter quantity required by training can be obviously reduced, and the model can be more efficient during reasoning.
In recent years, more and more artificial intelligence fields begin to pay attention to an unsupervised pre-training mode, unsupervised pre-training can use a large amount of unlabeled data to train a network model which is general in the field and has strong generalization capability, then fine adjustment is performed on a small amount of labeled data according to different downstream tasks, and finally fewer labeled samples are used to obtain more excellent performance.
In the field of air traffic control intellectualization, the air traffic control voice with multiple attribute labels can provide more information sources for air traffic control safety auxiliary measures and provide more information for post analysis. At present, no good mode is available for simultaneously providing text transcription and carrying out multiple attribute classification for the air traffic control voice, so that the application provides a pre-training-based multi-task air traffic control voice recognition method and device to improve the task effect in the field of air traffic control voice recognition and carry out multiple attribute classification on the air traffic control voice.
Disclosure of Invention
The invention aims to: aiming at the problem that the prior art does not have a good mode to carry out text recognition and multiple attribute classification on air-to-ground and air-to-air communication in real time, a multi-task air-to-air voice recognition method and device based on pre-training are provided.
In order to achieve the purpose, the invention adopts the technical scheme that:
a multi-task empty pipe voice recognition method based on pre-training comprises the following steps:
the method comprises the following steps that S1, air traffic control voice data are obtained and preprocessed, and a training sample data set is obtained and comprises a first-stage pre-training data set and a second-stage training data set which is used for manually carrying out text labeling and auxiliary task attribute labeling;
s2, constructing an air traffic control speech coding model based on pre-training;
s3, inputting the pre-training data set of the first stage into an empty pipe speech coding model for pre-training;
s4, constructing a multi-task air traffic control voice recognition model based on the pre-trained air traffic control voice coding model;
s5, establishing a loss function of the multi-task air traffic control voice recognition model;
s6, training the multi-task air traffic control voice recognition model based on a loss function of the multi-task air traffic control voice recognition model and a second-stage training data set;
and S7, inputting the air traffic control voice data segmented according to the sentences into the trained multi-task air traffic control voice recognition model to obtain a text recognition result and an auxiliary task recognition result.
As a preferred scheme of the present invention, a pre-training-based multitask empty pipe speech recognition method, wherein in step S1, hollow pipe speech data is a chinese-english speech signal without a text label, comprising the following steps:
s11, after carrying out voice emphasis and framing pretreatment on the air traffic control voice data, segmenting the pretreated air traffic control voice data according to sentences;
s12, taking all the segmented air traffic control voice data as a pre-training data set in a first stage, wherein each training sample only comprises a single-sentence voice audio file;
and S13, selecting the partially segmented air traffic control voice data to manually perform text labeling and auxiliary task attribute labeling to serve as a second-stage training data set, wherein each training sample comprises a single-sentence voice audio file, a corresponding text label and an attribute classification label.
As a preferred scheme of the invention, the multitask empty pipe speech recognition method based on pre-training comprises the following steps of:
s21, establishing a convolution module consisting of a one-dimensional convolution layer and an activation function layer, and extracting the voice characteristics of the training sample by using the convolution module;
s22, establishing a context extraction module consisting of a deep neural network, extracting context information of the voice features by using the context extraction module, and recording the context information as:
Figure 821360DEST_PATH_IMAGE001
wherein the content of the first and second substances,cis the output of the convolution module and is,hfor the hidden layer feature output by each layer of neural network,h i is as followsiThe hidden layer characteristics output by the layer neural network,Nis the total number of layers of the deep neural network,Tin order to input the length of the speech,fin order to be a dimension of the hidden layer feature,Rthe number is a real number set,R 1 × T × f the representation feature dimension is (1,T, f);
s23, establishing an output module, and extracting the last context in the moduledAnd stacking the output of the layer hidden layer as the output of the empty pipe speech coding model, and recording as:
Figure DEST_PATH_IMAGE002
wherein the content of the first and second substances,Yis the multi-layer feature output of the encoder, hfor the hidden layer feature output by each layer of neural network,dthe number of layers of the deep neural network,Nis the total number of layers of the deep neural network,Tin order to input the length of the speech,fis the dimension of the hidden layer feature(s),Rthe number is a real number set,R d × T × f represents a characteristic dimension of (d, T, f)。
As a preferred scheme of the invention, the step S4 of the multi-task air traffic control speech recognition model construction comprises the following steps:
s41, constructing a multi-attention module, and constructing an auxiliary task classifier based on the multi-attention module, wherein the auxiliary task classifier comprises a speaker role classifier, an instruction language classifier, a speaker gender classifier and an instruction intention classifier;
and S42, constructing a multi-attention module, and constructing a voice recognition classifier based on the multi-attention module.
As a preferred scheme of the invention, the construction of the multi-attention module comprises the following steps:
constructing a level attention module, performing attention operation in a level dimension according to the multi-layer characteristic output of the encoder to obtain an attention matrix of the level dimension, multiplying the attention matrix by the multi-layer characteristic output of the encoder to obtain a result, and recording the result as:
Figure 613736DEST_PATH_IMAGE003
wherein the content of the first and second substances,LRfor the output of the hierarchical attention module,Yfor the multi-layer feature output of the encoder,
Figure DEST_PATH_IMAGE004
a calculation formula for a hierarchical dimensional attention operation,dthe number of layers of the deep neural network,Tin order to input the length of the speech,fin order to be a dimension of the hidden layer feature, Rthe number is a real number set,R d × T × f represents a characteristic dimension of (d, T, f),R 1 × T × f The representation feature dimension is (1,T, f);
constructing an attention module of a time sequence dimension and a frequency dimension, respectively carrying out attention operation on the time sequence dimension and the frequency dimension according to the output of the hierarchy attention module to obtain an attention moment matrix of the time sequence dimension and an attention matrix of the frequency dimension, multiplying the results of the time sequence dimension and the attention moment matrix of the frequency dimension by the results of the hierarchy attention module and outputting the results, and recording the results as:
Figure 610511DEST_PATH_IMAGE005
wherein the content of the first and second substances,LTFRthe output of the attention module in the timing dimension and the frequency dimension,LRfor the output of the hierarchical attention module,
Figure DEST_PATH_IMAGE006
a calculation formula for attention operations in the time series dimension,
Figure 197350DEST_PATH_IMAGE007
a calculation formula for attention operation in the frequency dimension,Tin order to input the length of the speech,fis the dimension of the hidden layer feature(s),Rthe number is a real number set,R 1 × T × f the representation feature dimension is (1,T, f)。
as a preferred scheme of the present invention, in a pre-training-based multi-task empty pipe speech recognition method, the construction of the auxiliary task classifier in S41 further includes:
s411, inputting output results of the multiple attention modules into a voice recognition classifier;
and S412, inputting the output result of the multi-attention module into a full connection layer to obtain an auxiliary task classification result.
As a preferred embodiment of the present invention, the constructing of the speech recognition classifier in S42 further includes:
s421, adding the output result of the multi-attention module and the output of the multi-attention module of all the auxiliary task classifiers to obtain the voice characteristics containing various voice information, and recording as:
Figure DEST_PATH_IMAGE008
wherein the content of the first and second substances,X ASR in order to include the speech characteristics of a variety of speech information,LTFR ASR for the multi-attention module output of the speech recognition classifier,LTFR aux_i is as followsiThe multi-attention module output of each auxiliary task classifier,iis shown asiAn auxiliary task classifier for classifying the task of the task,nindicates the number of the auxiliary task classifiers,Tin order to input the length of the speech,fin order to be a dimension of the hidden layer feature,Rthe set of real numbers is a set of real numbers,R 1 × T × f the representation feature dimension is (1,T, f);
s422, inputting the voice characteristics containing various voice information into the full connection layer to obtain a text recognition result.
As a preferred scheme of the present invention, in step S5, a loss function of the multitask air traffic control speech recognition model is constructed by a weighted summation of a loss function of the speech recognition classifier and a loss function of the auxiliary task classifier, and a weight occupied by each task loss is adjusted as a parameter in the model training process, wherein the loss function of the speech recognition classifier adopts a connection timing sequence classification loss, the loss functions of the auxiliary task classifiers all adopt cross entropy loss, and the loss function of the multitask air traffic control speech recognition model adopts a loss function of a cross entropy lossLAs is noted above, the number of the channels,
Figure 545154DEST_PATH_IMAGE009
wherein, the first and the second end of the pipe are connected with each other,
Figure DEST_PATH_IMAGE010
and
Figure 610062DEST_PATH_IMAGE011
respectively representing a speech recognition classifier andithe loss value of each of the secondary task classifiers,
Figure DEST_PATH_IMAGE012
and
Figure 461344DEST_PATH_IMAGE013
respectively representing the speech recognition loss and the secondiIndividual assistantThe weight that the mission aid is losing is,nindicating the number of auxiliary task classifiers.
As a preferred scheme of the present invention, the training in step S6 is loop iteration training, and the single loop iteration training process is:
s61, selecting a group of training samples in the second-stage training data set;
s62, inputting the training sample into a multi-task air traffic control voice recognition model, and outputting a text recognition result and an auxiliary task classification result;
and S63, carrying out parameter adjustment on the parameters of the multitask empty pipe voice recognition model based on the loss function of the multitask empty pipe voice recognition model.
The device based on the pre-training multitask empty pipe voice recognition method comprises at least one processor and a memory which is in communication connection with the at least one processor; the memory is configured to store instructions which can be directed to by the at least one processor, the instructions being executable by the at least one processor to ensure that the at least one processor is capable of performing the method of any of the above.
In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:
1. by utilizing the associated information among a plurality of voice related tasks, the neural network can learn more voice related shared characteristics, so that the performance of each task is improved. Meanwhile, due to the fact that multi-task learning shares network parameters among related tasks, training and prediction of the model are more efficient. And finally, the loss of each task is balanced by using a learning mode, the training speed is accelerated, and the accuracy of each task can be further improved.
2. The method adopts a mechanism of pre-training the empty-pipe speech coding model, trains on as much speech data as possible in a self-supervision learning mode, and extracts the common characteristics of the empty-pipe speech data as much as possible, so that the speech coding model can learn better speech characteristic representation, and the generalization capability of the model and the accuracy rate of downstream tasks are improved. Meanwhile, better effect can be realized by only needing fewer labeled empty pipe voice samples.
3. By adopting the multi-attention module, hidden layer characteristic information of different layers in the voice coding process can be fully utilized, more important time sequence and frequency dimension information can be captured, more effective voice representation information is provided for downstream tasks, and the performance of each task is improved.
In conclusion, the method has the advantages of higher speed and accuracy of the recognition of the blank pipe voice, less label samples required during training, capability of providing classified information such as the roles, languages, instruction intentions and the like of speakers corresponding to the blank pipe voice in real time, capability of providing more information sources for the blank pipe safety auxiliary measures and capability of providing more information for the post analysis.
Drawings
FIG. 1 is a model structure diagram of the empty pipe speech coding method of the present invention.
FIG. 2 is a block diagram of a multi-attention module according to the present invention.
FIG. 3 is a model structure diagram of the multitask empty pipe speech recognition method of the present invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings.
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Example 1
In order to perform text transcription and multiple attribute classification for an air-to-ground air call in real time, the embodiment provides a pre-training-based multitask air-to-air speech recognition method and device, wherein the device comprises at least one processor and a memory in communication connection with the at least one processor.
The multi-task air traffic control voice recognition method based on pre-training comprises the following steps:
s1, acquiring air traffic control voice data, and preprocessing to obtain a training sample data set, wherein the training sample data set comprises a pre-training data set in a first stage and a training data set in a second stage, wherein text labeling and auxiliary task attribute labeling are carried out manually;
specifically, firstly, a Chinese-English voice signal without a text label under the ground-air communication environment is obtained through an air traffic control internal call system, chinese-English air traffic control ground-air communication voice is recorded in real time from a ground-air communication voice recorder by using a multi-channel voice signal acquisition device, and the voice is filtered, sampled and PCM encoded to form air traffic control voice data with 8K sampling rate and 16bit sampling precision.
Secondly, preprocessing the acquired air traffic control voice data in real time, including voice pre-emphasis, framing and the like, manually segmenting the preprocessed air traffic control voice data into instruction voice segments according to sentences, wherein each segment of voice only contains instructions of a single speaker, and storing the voice segments into a memory in a wav file format; constructing a pre-training data set in the first stage by using all the voice files, wherein each training sample only comprises a single-sentence voice audio file;
finally, randomly selecting empty pipe voice data for about 50 hours in a pre-training data set in the first stage, manually performing text labeling and attribute labeling of a plurality of auxiliary tasks, storing labeling results into a json file, organizing voice and label files to form a training data set in the second stage, wherein each training sample comprises a single-sentence voice audio file, a corresponding text label and a corresponding multi-task classification label;
the auxiliary tasks comprise speaker role classification, instruction language classification, speaker gender classification and instruction intention classification; the results of the speaker character classification include air traffic controllers and airplane drivers, the results of the instruction language classification include Chinese and English, the results of the speaker gender classification include males and females, and the results of the instruction intention classification include altitude or course change instructions such as ascending, descending, left-turning, right-turning and the like.
S2, constructing an air traffic control speech coding model based on pre-training;
specifically, the empty pipe speech coding model structure is shown in fig. 1, and is composed of 1 convolution module, 1 context extraction module and 1 output module, and includes the following:
s21, a convolution module which consists of 7 one-dimensional convolution layers (Conv 1d Layer) and an activation function Layer (GELU) and is used for extracting the voice characteristics of the input training sample;
wherein, the convolution layer adopts convolution kernels with the size of 1 multiplied by 3, the number of the convolution kernels is 512, and the step length is 2;
s22, a context extraction module which is composed of a deep neural network and used for extracting context information of the voice features, and the context information is recorded as:
Figure DEST_PATH_IMAGE014
wherein the content of the first and second substances,cis the output of the convolution module and is,hfor the hidden layer feature output by each layer of neural network,h i is as followsiThe hidden layer characteristics output by the layer neural network,Nis the total number of layers of the deep neural network,Tin order to input the length of the speech,fin order to be a dimension of the hidden layer feature,Rthe number is a real number set,R 1 × T × f the representation feature dimension is (1,T, f);
wherein, each hidden layer adopts an Encoder structure of a Transformer.
S23, an output module for extracting the last context in the context extraction module
Figure DEST_PATH_IMAGE015
The output of the layer hidden layer is stacked as output and is used as the input of a downstream speech recognition classifier and all auxiliary task classifiers, so that each classifier can obtain available information with more dimensions, and the available information is recorded as:
Figure DEST_PATH_IMAGE016
wherein the content of the first and second substances,Yfor the multi-layer feature output of the encoder, the output features are obtained bydThe output of the individual hidden layers is composed of a stack,dthe number of layers of the deep neural network,Nis the total number of layers of the deep neural network,hfor the hidden layer feature output by each layer of neural network,Tin order to input the length of the speech,fin order to be a dimension of the hidden layer feature,Rthe number is a real number set,R d × T × f represents a characteristic dimension of (d, T, f)。
And S3, inputting the pre-training data set in the first stage to pre-train the air traffic control voice coding model, training the air traffic control voice coding model in a self-supervision learning mode, and extracting the common characteristics of the air traffic control voice data, so that the voice coding model can learn better voice characteristic representation, and the generalization capability of the model and the accuracy rate of downstream tasks are improved. Meanwhile, better effect can be realized by only needing fewer labeled empty pipe voice samples.
Specifically, the pre-training method may refer to wav2vec 2.0, the pre-training is loop iteration training, and the steps executed in the single loop iteration training process are as follows:
s31, in a pre-training data set in the first stage, selecting a group of training samples to be input into an empty pipe speech coding model, and extracting hidden layer characteristics of the training samples by a convolution module of the empty pipe speech coding model;
s32, mapping the hidden layer characteristics obtained in the S31 into quantized hidden layer characteristics through a Gumbel softmax quantization module;
s33, randomly masking the hidden layer characteristics obtained in the S31, inputting the masked hidden layer characteristics to a context extraction module and outputting the masked hidden layer characteristics;
s34, constructing contrast learning loss, wherein negative samples are context features generated by positions added with masks in the output of the S33, and positive samples are quantization features of the same positions in the quantization hidden layer features obtained in the S32;
and S35, updating the parameters through back propagation.
Further, the pre-training method can also adopt wav2vec, vq-wav2vec and the like.
S4, constructing a multi-task air traffic control voice recognition model based on the air traffic control voice coding model;
the structure of the multi-task empty pipe speech recognition model is shown in fig. 2, and the multi-task empty pipe speech recognition model consists of an empty pipe speech coding model, a plurality of auxiliary task classifiers and a speech recognition classifier, wherein the auxiliary task classifiers and the speech recognition classifier share the empty pipe speech coding model, the output of the empty pipe speech coding model is used as the input of each classifier, and the specific steps are as follows:
constructing a multiple attention module (LTFAtt), as shown in fig. 3, the construction method comprises the following steps:
firstly, a level attention module is constructed, attention operation is carried out in a level dimension according to the multi-layer characteristic output of the encoder, an attention matrix of the level dimension is obtained, and the result obtained by multiplying the attention matrix of the level dimension and the multi-layer characteristic output of the encoder is recorded as:
Figure 546980DEST_PATH_IMAGE017
wherein, the first and the second end of the pipe are connected with each other,LRfor the output of the hierarchical attention module,Yfor the multi-layer feature output of the encoder,
Figure DEST_PATH_IMAGE018
a calculation formula for hierarchical dimensional attention operations;dthe number of layers of the deep neural network,Tin order to input the length of the speech,fin order to be a dimension of the hidden layer feature, Rthe number is a real number set,R d × T × f represents a characteristic dimension of (d, T, f),R 1 × T × f The representation feature dimension is (1,T, f);
Figure 382081DEST_PATH_IMAGE018
a neural network structure is adopted, and the neural network structure comprises two full connection layers and a Sigmoid activation function.
Secondly, constructing a time sequence and frequency dimension attention module, respectively performing attention operation on the time sequence dimension and the frequency dimension according to the result of the level attention module to obtain an attention moment array of the time sequence dimension and an attention matrix of the frequency dimension, multiplying the results of the time sequence dimension and the frequency dimension by the results of the level attention module, outputting the results, and recording the results as:
Figure 453942DEST_PATH_IMAGE019
wherein the content of the first and second substances,LTFRthe output of the attention module in the timing dimension and the frequency dimension,LRfor the output of the hierarchical attention module,
Figure DEST_PATH_IMAGE020
a calculation formula for attention operations in the time series dimension,
Figure 159730DEST_PATH_IMAGE021
a calculation formula for attention operation in the frequency dimension,Tin order to input the length of the speech,fis the dimension of the hidden layer feature(s),Rthe number is a real number set,R 1 × T × f the representation feature dimension is (1,T, f);
Figure 291634DEST_PATH_IMAGE020
a neural network structure is adopted, and the neural network structure comprises a global average pooling layer, two full-connection layers and a Sigmoid activation function;
Figure DEST_PATH_IMAGE022
a neural network structure is adopted, and the neural network structure comprises two full connection layers and a Sigmoid activation function.
By adopting the multi-attention module, hidden layer characteristic information of different layers in speech coding is fully utilized, and time sequence and frequency dimension information is captured at the same time, so that more effective speech representation information is provided for downstream tasks.
S41, after the trained empty pipe speech coding model, constructing an auxiliary task classifier based on a multi-attention module, wherein the auxiliary task classifier comprises a speaker role classifier, an instruction language classifier, a speaker gender classifier and an instruction intention classifier; all the auxiliary task classifiers have the same structure but independent parameters, and specifically comprise:
firstly, constructing a multi-attention module, learning more important information for a recognition result from an attention mechanism module combining hierarchy, time sequence and frequency, and optimizing an attention parameter in a learning mode; secondly, inputting the attention results of the multiple attention modules into a voice recognition classifier to provide internal representation of various tasks for voice recognition; furthermore, the attention result of the multi-attention module is input into the full-connection layer, and the category with the highest probability is used as the auxiliary task classification result.
S42, after the trained empty pipe speech coding model, constructing a speech recognition classifier based on multiple attention modules, wherein the speech recognition classifier specifically comprises the following steps:
similarly, firstly, a multi-attention module is constructed, information more important for the recognition result is learned from an attention mechanism module combining hierarchy, time sequence and frequency, and attention parameters are optimized in a learning mode; secondly, adding the attention result of the multi-attention module and the output of all the auxiliary task classifiers to obtain the voice characteristics containing various voice information, and recording the voice characteristics as:
Figure 82873DEST_PATH_IMAGE023
wherein the content of the first and second substances,X ASR in order to include the speech characteristics of a variety of speech information,LTFR ASR for the multi-attention module output of the speech recognition classifier,LTFR aux_i is as followsiThe multi-attention module output of each auxiliary task classifier,idenotes the firstiAn auxiliary task classifier for classifying the task of the task,nindicates the number of the auxiliary task classifiers,Tin order to input the length of the speech,fis the dimension of the hidden layer feature(s),Rthe number is a real number set,R 1 × T × f the representation feature dimension is (1,T, f);
and finally, inputting the voice characteristics of various voice information into the full-connection layer to obtain a text recognition result corresponding to each voice frame.
S5, establishing a loss function of the multi-task empty-pipe speech recognition model which simultaneously considers speech recognition and comparison learning;
in particular, loss of speech recognition classifierThe function uses connection time sequence classification losses CTCLOss, the loss functions of the auxiliary task classifiers all use cross entropy loss Cross Entrophoploss, the loss function of the multi-task air traffic control speech recognition model is constructed in a mode of weighted summation of the loss function of the speech recognition classifier and the loss function of the auxiliary task classifier, the weight occupied by each task loss is adjusted in the model training process as a parameter, and the loss function of the multi-task air traffic control speech recognition modelLIs recorded as:
Figure DEST_PATH_IMAGE024
wherein the content of the first and second substances,
Figure 489583DEST_PATH_IMAGE025
and
Figure DEST_PATH_IMAGE026
respectively representing a speech recognition classifier andithe loss value of each of the secondary task classifiers,
Figure DEST_PATH_IMAGE027
and
Figure 112194DEST_PATH_IMAGE028
respectively representing the speech recognition loss and the secondiThe weight that the individual secondary tasks take up is lost,nrepresenting the number of the auxiliary task classifiers; the weights taken up by the speech recognition penalty and all ancillary task penalties are determined in a learning manner and are optimized together during the model training process.
Further, each weight is defined by a corresponding uncertainty variable
Figure DEST_PATH_IMAGE029
It is determined that each uncertainty variable is a scalar and will be updated and optimized during the training process, and the uncertainty variables will determine the corresponding loss weights in the following manner
Figure DEST_PATH_IMAGE030
Figure DEST_PATH_IMAGE031
S6, training the multi-task air traffic control voice recognition model until the network converges on the basis of a loss function of the multi-task air traffic control voice recognition model and a second-stage training data set; adopting a loop iteration training mode, and executing the following operations in the single loop iteration training process:
s61, randomly selecting a group of training samples from the second-stage training data set;
s62, inputting the training sample into a multi-task air traffic control voice recognition model, and outputting a voice recognition classification result and each auxiliary task classification result;
and S63, performing parameter adjustment on the relevant parameters of the multitask empty pipe voice recognition model by using the loss function of the multitask empty pipe voice recognition model.
And S7, acquiring a Chinese and English voice signal without a text label under a real-time ground-to-air communication environment, segmenting the Chinese and English voice signal into empty pipe voice data according to sentences, inputting the empty pipe voice data into the trained multi-task empty pipe voice recognition model, and obtaining a text recognition result and a multi-task attribute classification result of the empty pipe voice data.
Specifically, the empty pipe voice data is input into a trained multi-task voice recognition model, the model outputs multi-task label probability, the category with the highest probability is used as an auxiliary task classification result, and further, the model predicts the text label probability corresponding to the voice frame according to the output; and decoding and outputting the instruction text according to the maximum probability.
In conclusion, the invention simultaneously introduces mechanisms such as self-supervision pre-training, multi-attention and multi-task learning, designs the multi-task air traffic control voice recognition method and model based on end-to-end Chinese and English mixing of deep learning, improves the voice recognition accuracy under the air traffic control scene, can perform multi-task attribute classification in real time, and provides more available information for air traffic control post analysis or other downstream applications.
Example 2
The feasibility and the performance of the technical scheme of the embodiment 1 are verified:
firstly, data preparation is carried out, a data acquisition scheme provided in embodiment 1 is adopted, chinese and English voice signals without text labels under the ground-air conversation environment are obtained through an air traffic control internal conversation system, a first-stage pre-training data set and a second-stage training data set are obtained, and a training set, a verification set or a test set is formed by a random selection strategy.
Wherein, the pre-training data set in the first stage is as follows:
the training set comprises 774083 data in total for 640.40 hours, and the verification set comprises 7749 data in total for 6.40 hours;
the second stage training dataset is:
the training set comprises 58432 pieces of data for 53.56 hours, wherein 43178 pieces of Chinese data for 37.00 hours and 15254 pieces of English data for 16.56 hours; the test set comprises 1603 pieces of data for 1.45 hours, wherein the Chinese data are 1202 pieces of data for 1.01 hours, and the English data are 401 pieces of data for 0.44 hours; in the second stage of training, the vocabulary has a total of 668 characters, including 641 Chinese characters, 26 English letters and spaces.
The test results of example 2 were obtained by performing speech recognition on the test set.
Secondly, establishing a baseline model: in the embodiment, a wav2vec 2.0 model is used as an empty pipe speech coding model, a speech recognition classifier only comprising a full connection layer is connected behind the empty pipe speech coding model to be used as a baseline model to verify validity, and the model input is an original waveform of a speech file.
The baseline model and the technical method described in embodiment 1 are implemented using a pytorech framework, and the hyper-parameter configuration for model training is described as follows:
learning rate: setting the initial learning rate to be 1e-5, using a three-stage learning rate adjustment method (tri-stage lr schedule), carrying out learning rate warm-up (warming) on the first 10% of updates, keeping the learning rate on the next 40%, and carrying out linear decay (linear decay) on the rest;
batch training size: 8.
the hardware environment adopted by the experiment is as follows: the CPU is 2 multiplied by Intel Xeon E5-2680 v4, the video card is 1 multiplied by NVIDIA GeForce RTX 2080Ti, the video memory is 1 multiplied by 11GB, the memory is 128GB, and the operating system is Ubuntu 16.04;
under the above training data and configuration, a total of 8 experiments A1-A8 were performed, as follows:
a1: training the baseline model only on the second stage training data set to complete the speech recognition task;
a2: a pre-training learning mechanism is added during the training of the baseline model, firstly, the voice coding model part of the baseline model is pre-trained, and then, the baseline model is trained in the second stage so as to complete the voice recognition task;
a3: a multi-task learning mechanism is added during the training of the baseline model, and training is only carried out on the training data set of the second stage so as to complete the voice recognition and multi-attribute classification tasks;
a4: when the baseline model is trained, a multi-attention module is added, and only the training is carried out on the second-stage training data set to complete the voice recognition task, wherein the number of hidden layers used in the voice coding model is 6;
a5: a multi-task learning mechanism and a multi-attention module are added during the training of the baseline model, and training is only carried out on a second-stage training data set to complete the voice recognition and multi-attribute classification tasks, wherein the number of hidden layers used in the voice coding model is 6;
a6: pre-training and multi-task learning mechanisms are added during base line model training, firstly, pre-training is carried out on a voice coding model part of a base line model, and then second-stage training is carried out on the base line model so as to complete voice recognition and multi-attribute classification tasks;
a7: a pre-training learning mechanism and a multi-attention module are added during the training of the baseline model, firstly, the voice coding model part of the baseline model is pre-trained, and then the baseline model is trained in the second stage to complete the voice recognition task, wherein the number of hidden layers used in the voice coding model is 6;
a8: the method comprises the steps that a pre-training, multi-task learning mechanism and a multi-attention module are added during base line model training, firstly, a voice coding model part of a base line model is pre-trained, then, a second stage of training is carried out on the base line model to complete voice recognition and multi-attribute classification tasks, wherein the number of hidden layers used in the voice coding model is 6;
the auxiliary task result in the experimental result is measured by accuracy, namely the proportion of the correctly classified samples to the total samples, and the correctness of the speech recognition is the character error rate based on Chinese characters and English lettersCER(charcter error rate) as follows:
Figure DEST_PATH_IMAGE032
wherein the content of the first and second substances,Nfor the length of the real-text label,I、D、Srepresenting the insertion, deletion and replacement operands required to convert the predictive text label to a real label, respectively.
In summary, the technical scheme of the invention verifies that only the performance of the acoustic model is considered, the language model processing and optimization are not involved, and the final result pair is shown in table 1.
TABLE 1
Figure DEST_PATH_IMAGE033
According to experimental results, compared with a baseline model, the pre-training learning mechanism, the multi-task learning mechanism and the multi-attention module provided by the scheme can improve the performance of the voice recognition model on the data set of the embodiment; compared with a method without introducing a pre-training learning mechanism, the introduction of the pre-training learning mechanism can achieve greater performance improvement on the data set of the embodiment, which shows that on the air traffic control data set, the pre-training learning can learn a better speech feature representation with stronger robustness, and finally supports the air traffic control speech recognition research; furthermore, a multi-task learning and multi-attention module is introduced, so that the voice recognition performance can be improved to a certain degree; and by introducing a pre-training mechanism, a multi-task learning mechanism and a multi-attention module, the baseline model obtains the optimal speech recognition performance on the data set of the embodiment.
In conclusion, the method adopts a pre-training and multi-task learning mechanism and a multi-attention module, plays a great role in promoting the performance of the air traffic control speech recognition model, and can improve the convergence efficiency of the model.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (10)

1. A multi-task empty pipe voice recognition method based on pre-training is characterized by comprising the following steps:
the method comprises the following steps that S1, air traffic control voice data are obtained and preprocessed, and a training sample data set is obtained and comprises a first-stage pre-training data set and a second-stage training data set which is used for manually carrying out text labeling and auxiliary task attribute labeling;
s2, constructing a pre-training-based air traffic control speech coding model;
s3, inputting the pre-training data set of the first stage into the empty pipe speech coding model for pre-training;
s4, constructing a multi-task air traffic control voice recognition model based on the pre-trained air traffic control voice coding model;
s5, establishing a loss function of the multitask empty pipe voice recognition model;
s6, training the multi-task air traffic control voice recognition model based on a loss function of the multi-task air traffic control voice recognition model and a second-stage training data set;
and S7, inputting the real-time ground-air communication voice data segmented according to the sentences into the trained multi-task air traffic control voice recognition model to obtain a text recognition result and an auxiliary task recognition result.
2. The pre-training-based multitask empty pipe speech recognition method according to claim 1, wherein the empty pipe speech data in step S1 is a Chinese-English speech signal without text labels, comprising the following steps:
s11, after voice emphasis and framing pretreatment are carried out on the air traffic control voice data, the pretreated air traffic control voice data is segmented according to sentences;
s12, taking all the segmented air traffic control voice data as a pre-training data set in a first stage, wherein each training sample only comprises a single-sentence voice audio file;
and S13, selecting the partially segmented air traffic control voice data to manually perform text labeling and auxiliary task attribute labeling to serve as a second-stage training data set, wherein each training sample comprises a single-sentence voice audio file, a corresponding text label and an attribute classification label.
3. The method for multi-task air traffic control speech recognition based on pre-training according to claim 1, wherein the step S2 of constructing the pre-training-based air traffic control speech coding model comprises:
s21, establishing a convolution module consisting of a one-dimensional convolution layer and an activation function layer, and extracting the voice characteristics of the training sample by using the convolution module;
s22, establishing a context extraction module consisting of a deep neural network, extracting context information of the voice features by using the context extraction module, and recording the context information as:
Figure DEST_PATH_IMAGE001
wherein the content of the first and second substances,cis the output of the convolution module and is,hfor each layer of neural network output hidden layer features,h i is as followsiThe hidden layer characteristics output by the layer neural network,Nis the total number of layers of the deep neural network,Tin order to input the length of the speech,fin order to be a dimension of the hidden layer feature,Rthe number is a real number set,R 1 × T × f the representation feature dimension is (1,T, f);
s23, establishing an output moduleBlock, last in context extraction moduledAnd stacking the output of the layer hidden layer as the output of the empty pipe speech coding model, and recording as:
Figure 360226DEST_PATH_IMAGE002
wherein the content of the first and second substances,Yfor the multi-layer feature output of the encoder,hfor the hidden layer feature output by each layer of neural network,dthe number of layers of the deep neural network,Nis the total number of layers of the deep neural network,Tin order to input the length of the speech,fin order to be a dimension of the hidden layer feature,Rthe number is a real number set,R d × T × f represents a characteristic dimension of (d, T, f)。
4. The method for multitask empty pipe speech recognition based on pre-training as claimed in claim 1, wherein the step S4 of constructing the multitask empty pipe speech recognition model comprises:
s41, constructing a multi-attention module, and constructing an auxiliary task classifier based on the multi-attention module, wherein the auxiliary task classifier comprises a speaker role classifier, an instruction language classifier, a speaker gender classifier and an instruction intention classifier;
and S42, constructing a multi-attention module, and constructing a voice recognition classifier based on the multi-attention module.
5. The pre-training-based multitask empty pipe speech recognition method according to claim 4, wherein the construction of the multiple attention module comprises the following steps:
constructing a level attention module, performing attention operation on a level dimension according to the multi-layer characteristic output of the encoder to obtain an attention matrix of the level dimension, multiplying the attention matrix by the multi-layer characteristic output of the encoder to obtain a result, and marking as:
Figure DEST_PATH_IMAGE003
wherein, the first and the second end of the pipe are connected with each other,LRfor the output of the hierarchical attention module,Yis the multi-layer feature output of the encoder,
Figure 933159DEST_PATH_IMAGE004
a calculation formula for a hierarchical dimensional attention operation,dthe number of layers of the deep neural network,Tin order to input the length of the speech,fin order to be a dimension of the hidden layer feature,Rthe set of real numbers is a set of real numbers,R d × T × f represents a characteristic dimension of (d, T, f),R 1 × T × f The representation feature dimension is (1,T, f);
constructing an attention module of a time sequence dimension and a frequency dimension, respectively carrying out attention operation on the time sequence dimension and the frequency dimension according to the output of the hierarchy attention module to obtain an attention moment matrix of the time sequence dimension and an attention matrix of the frequency dimension, multiplying the results of the time sequence dimension and the attention moment matrix of the frequency dimension by the results of the hierarchy attention module and outputting the results, and recording the results as:
Figure DEST_PATH_IMAGE005
wherein the content of the first and second substances,LTFRthe output of the attention module in the timing dimension and the frequency dimension,LRfor the output of the hierarchical attention module,
Figure 368688DEST_PATH_IMAGE006
a calculation formula for attention operations in the time series dimension,
Figure DEST_PATH_IMAGE007
a calculation formula for attention operation in the frequency dimension,Tin order to input the length of the speech,fin order to be a dimension of the hidden layer feature,Rthe number is a real number set,R 1 × T × f the representation feature dimension is (1,T, f)。
6. the pre-training-based multitask empty pipe speech recognition method according to claim 4, wherein the construction of the auxiliary task classifier in S41 further comprises:
s411, inputting output results of the multiple attention modules into the voice recognition classifier;
and S412, inputting the output result of the multi-attention module into a full connection layer to obtain an auxiliary task classification result.
7. The method according to claim 4, wherein the constructing of the speech recognition classifier in S42 further comprises:
and S421, adding the output result of the multi-attention module and the output of the multi-attention module of all the auxiliary task classifiers to obtain the voice characteristics containing various voice information, and recording the voice characteristics as:
Figure 419689DEST_PATH_IMAGE008
wherein the content of the first and second substances,X ASR in order to include the speech characteristics of a variety of speech information,LTFR ASR for the multi-attention module output of the speech recognition classifier,LTFR aux_i is as followsiThe multi-attention module output of each auxiliary task classifier,iis shown asiAn auxiliary task classifier for classifying the task of the task,nindicates the number of the auxiliary task classifiers,Tin order to input the length of the speech,fin order to be a dimension of the hidden layer feature,Rthe number is a real number set,R 1 × T × f the representation feature dimension is (1,T, f);
s422, inputting the voice characteristics containing various voice information into the full connection layer to obtain a text recognition result.
8. The method as claimed in claim 1, wherein in step S5, the loss function of the multitask empty pipe speech recognition model is constructed by weighted summation of the loss function of the speech recognition classifier and the loss function of the auxiliary task classifier, and the weight occupied by each task loss is adjusted as a parameter in the model training process, wherein the loss function of the speech recognition classifier is connected with the time sequence classification loss, the loss functions of the auxiliary task classifiers are all cross entropy loss, and the loss function of the multitask empty pipe speech recognition model is loss function of the multitask empty pipe speech recognition modelLAs is noted above, the number of the channels,
Figure DEST_PATH_IMAGE009
wherein the content of the first and second substances,
Figure 155433DEST_PATH_IMAGE010
and
Figure DEST_PATH_IMAGE011
respectively representing a speech recognition classifier andithe loss value of each of the secondary task classifiers,
Figure 368109DEST_PATH_IMAGE012
and
Figure DEST_PATH_IMAGE013
respectively representing the speech recognition loss and the secondiThe weight that each secondary task takes to lose,nindicating the number of auxiliary task classifiers.
9. The pre-training-based multitask empty pipe speech recognition method according to claim 1, wherein the training in step S6 is loop iteration training, and the single loop iteration training process is as follows:
s61, selecting a group of training samples in the second-stage training data set;
s62, inputting the training sample into the multi-task air traffic control voice recognition model, and outputting a text recognition result and an auxiliary task classification result;
and S63, carrying out parameter adjustment on the parameters of the multitask empty pipe voice recognition model based on the loss function of the multitask empty pipe voice recognition model.
10. A pre-training based multitask, empty pipe speech recognition device comprising at least one processor and a memory communicatively coupled to said at least one processor; the memory is configured to store instructions that are pointable by the at least one processor, the instructions being executable by the at least one processor to ensure that the at least one processor is capable of performing the method of any of claims 1-9.
CN202211118845.1A 2022-09-15 2022-09-15 Multi-task air traffic control voice recognition method and device based on pre-training Active CN115206293B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211118845.1A CN115206293B (en) 2022-09-15 2022-09-15 Multi-task air traffic control voice recognition method and device based on pre-training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211118845.1A CN115206293B (en) 2022-09-15 2022-09-15 Multi-task air traffic control voice recognition method and device based on pre-training

Publications (2)

Publication Number Publication Date
CN115206293A true CN115206293A (en) 2022-10-18
CN115206293B CN115206293B (en) 2022-11-29

Family

ID=83572350

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211118845.1A Active CN115206293B (en) 2022-09-15 2022-09-15 Multi-task air traffic control voice recognition method and device based on pre-training

Country Status (1)

Country Link
CN (1) CN115206293B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116168690A (en) * 2023-04-19 2023-05-26 易方信息科技股份有限公司 Method, device, equipment and storage medium for real-time voice desensitization based on deep learning
CN116453514A (en) * 2023-06-08 2023-07-18 四川大学 Multi-view-based voice keyword detection and positioning method and device
CN116504234A (en) * 2023-05-29 2023-07-28 镁佳(北京)科技有限公司 Method, device, equipment and medium for generating voice awakening and detecting model
CN117577116A (en) * 2024-01-17 2024-02-20 清华大学 Training method, device, equipment and medium for continuously learning voice identification model

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2405422A1 (en) * 2010-07-08 2012-01-11 Honeywell International, Inc. Speech recognition and voice training data storage and access method and apparatus
EP2874133A1 (en) * 2013-11-14 2015-05-20 Honeywell International Inc. Aircraft systems and methods for reducing and detecting read-back and hear-back errors
US20200135174A1 (en) * 2018-10-24 2020-04-30 Tencent America LLC Multi-task training architecture and strategy for attention-based speech recognition system
US20200160836A1 (en) * 2018-11-21 2020-05-21 Google Llc Multi-dialect and multilingual speech recognition
CN112420024A (en) * 2020-10-23 2021-02-26 四川大学 Full-end-to-end Chinese and English mixed air traffic control voice recognition method and device
CN113160798A (en) * 2021-04-28 2021-07-23 厦门大学 Chinese civil aviation air traffic control voice recognition method and system
CN113284485A (en) * 2021-07-09 2021-08-20 中国科学院自动化研究所 End-to-end framework for unified Chinese and English mixed text generation and speech recognition
CN113889090A (en) * 2021-09-29 2022-01-04 北京中科智加科技有限公司 Multi-language recognition model construction and training method based on multi-task learning
US11222627B1 (en) * 2017-11-22 2022-01-11 Educational Testing Service Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system
CN114582330A (en) * 2022-03-11 2022-06-03 中国科学技术大学 Training method of voice recognition model, voice recognition method and electronic equipment
CN114596845A (en) * 2022-04-13 2022-06-07 马上消费金融股份有限公司 Training method of voice recognition model, voice recognition method and device
CN114648982A (en) * 2022-05-24 2022-06-21 四川大学 Controller voice recognition method and device based on comparative learning
CN114944150A (en) * 2022-05-07 2022-08-26 深圳职业技术学院 Dual-task-based Conformer land-air communication acoustic model construction method

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2405422A1 (en) * 2010-07-08 2012-01-11 Honeywell International, Inc. Speech recognition and voice training data storage and access method and apparatus
EP2874133A1 (en) * 2013-11-14 2015-05-20 Honeywell International Inc. Aircraft systems and methods for reducing and detecting read-back and hear-back errors
US11222627B1 (en) * 2017-11-22 2022-01-11 Educational Testing Service Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system
US20200135174A1 (en) * 2018-10-24 2020-04-30 Tencent America LLC Multi-task training architecture and strategy for attention-based speech recognition system
US20200160836A1 (en) * 2018-11-21 2020-05-21 Google Llc Multi-dialect and multilingual speech recognition
CN112420024A (en) * 2020-10-23 2021-02-26 四川大学 Full-end-to-end Chinese and English mixed air traffic control voice recognition method and device
CN113160798A (en) * 2021-04-28 2021-07-23 厦门大学 Chinese civil aviation air traffic control voice recognition method and system
CN113284485A (en) * 2021-07-09 2021-08-20 中国科学院自动化研究所 End-to-end framework for unified Chinese and English mixed text generation and speech recognition
CN113889090A (en) * 2021-09-29 2022-01-04 北京中科智加科技有限公司 Multi-language recognition model construction and training method based on multi-task learning
CN114582330A (en) * 2022-03-11 2022-06-03 中国科学技术大学 Training method of voice recognition model, voice recognition method and electronic equipment
CN114596845A (en) * 2022-04-13 2022-06-07 马上消费金融股份有限公司 Training method of voice recognition model, voice recognition method and device
CN114944150A (en) * 2022-05-07 2022-08-26 深圳职业技术学院 Dual-task-based Conformer land-air communication acoustic model construction method
CN114648982A (en) * 2022-05-24 2022-06-21 四川大学 Controller voice recognition method and device based on comparative learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YI LIN: "A Unified Framework for Multilingual Speech Recognition in Air Traffic Control Systems", <IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS> *
周凯: "民航陆空通话语音识别技术研究与应用", 《中国优秀硕士学位论文全文数据库》 *
林毅: "基于CGRU多输入特征的地空通话自动切分", 《四川大学学报》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116168690A (en) * 2023-04-19 2023-05-26 易方信息科技股份有限公司 Method, device, equipment and storage medium for real-time voice desensitization based on deep learning
CN116504234A (en) * 2023-05-29 2023-07-28 镁佳(北京)科技有限公司 Method, device, equipment and medium for generating voice awakening and detecting model
CN116504234B (en) * 2023-05-29 2023-10-13 镁佳(北京)科技有限公司 Method, device, equipment and medium for generating voice awakening and detecting model
CN116453514A (en) * 2023-06-08 2023-07-18 四川大学 Multi-view-based voice keyword detection and positioning method and device
CN116453514B (en) * 2023-06-08 2023-08-25 四川大学 Multi-view-based voice keyword detection and positioning method and device
CN117577116A (en) * 2024-01-17 2024-02-20 清华大学 Training method, device, equipment and medium for continuously learning voice identification model
CN117577116B (en) * 2024-01-17 2024-03-19 清华大学 Training method, device, equipment and medium for continuously learning voice identification model

Also Published As

Publication number Publication date
CN115206293B (en) 2022-11-29

Similar Documents

Publication Publication Date Title
CN115206293B (en) Multi-task air traffic control voice recognition method and device based on pre-training
Schuller et al. The INTERSPEECH 2021 computational paralinguistics challenge: COVID-19 cough, COVID-19 speech, escalation & primates
Chen et al. End-to-end neural network based automated speech scoring
Ferrer et al. Study of senone-based deep neural network approaches for spoken language recognition
CN112017644B (en) Sound transformation system, method and application
CN111837178A (en) Speech processing system and method for processing speech signal
CN111989742A (en) Speech recognition system and method for using speech recognition system
CN107408384A (en) The end-to-end speech recognition of deployment
CN110600047A (en) Perceptual STARGAN-based many-to-many speaker conversion method
GB2326320A (en) Text to speech synthesis using neural network
CN111400469A (en) Intelligent generation system and method for voice question answering
Pokorny et al. Detection of negative emotions in speech signals using bags-of-audio-words
CN109887484A (en) A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device
Zhang et al. Improving end-to-end single-channel multi-talker speech recognition
CN110070855A (en) A kind of speech recognition system and method based on migration neural network acoustic model
CN109671423A (en) Non-parallel text compressing method under the limited situation of training data
CN114596844A (en) Acoustic model training method, voice recognition method and related equipment
Zhao et al. End-to-end-based Tibetan multitask speech recognition
Soliman et al. Isolated word speech recognition using convolutional neural network
Thai et al. Fully convolutional ASR for less-resourced endangered languages
CN114944150A (en) Dual-task-based Conformer land-air communication acoustic model construction method
CN111090726A (en) NLP-based electric power industry character customer service interaction method
Ng et al. Teacher-student training for text-independent speaker recognition
Rouhe et al. Low resource comparison of attention-based and hybrid ASR exploiting wav2vec 2.0
CN113327585A (en) Automatic voice recognition method based on deep neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant