CN115240648A - Controller voice enhancement method and device facing voice recognition - Google Patents

Controller voice enhancement method and device facing voice recognition Download PDF

Info

Publication number
CN115240648A
CN115240648A CN202210841871.0A CN202210841871A CN115240648A CN 115240648 A CN115240648 A CN 115240648A CN 202210841871 A CN202210841871 A CN 202210841871A CN 115240648 A CN115240648 A CN 115240648A
Authority
CN
China
Prior art keywords
voice
controller
speech
data
loss function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210841871.0A
Other languages
Chinese (zh)
Other versions
CN115240648B (en
Inventor
余欣乘
林毅
张建伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202210841871.0A priority Critical patent/CN115240648B/en
Publication of CN115240648A publication Critical patent/CN115240648A/en
Application granted granted Critical
Publication of CN115240648B publication Critical patent/CN115240648B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)

Abstract

The invention relates to the field of civil aviation air traffic control and the field of voice enhancement, in particular to a voice recognition-oriented controller voice enhancement method and a voice recognition-oriented controller voice enhancement device. The method takes the preprocessed air traffic control original clean voice-noisy voice data pair collected in a real scene as a data set, builds a controller voice enhancement preliminary model comprising a SASC module and a CSSAtt module, and utilizes a neural network model trained by a multitask loss function facing to a voice enhancement task and a voice recognition task at the same time to enhance the voice of the existing air traffic control noisy controller, thereby eliminating echo influence, improving the definition and intelligibility of the controller voice and effectively increasing the recognition accuracy of the controller voice recognition.

Description

Controller voice enhancement method and device facing voice recognition
Technical Field
The invention relates to the field of civil aviation air traffic control and the field of voice enhancement, in particular to a voice recognition-oriented controller voice enhancement method and device.
Background
In the field of Air Traffic Control (ATC), the main communication method between a controller and a pilot is voice communication, and voice signals are transmitted through Very High Frequency (VHF) radio. The controller issues voice commands to the pilot, and the pilot repeats the voice commands and returns the voice commands to the controller. The voice transmission confirmation mechanism between the controller and the pilot ensures the orderly operation of the air traffic control system. Fig. 1 describes the generation and transmission process of air traffic control speech, which is as follows:
(1) A controller inputs voice by using a microphone, and the voice is transmitted to a radio station in an uplink mode through a ground-to-air communication internal call system and a communication server and is sent to a pilot end;
(2) In order to enable a controller to know whether a voice command is safely transmitted to a pilot, the air traffic control radio system uses a special 'return mechanism', when the voice command is transmitted to a radio station, the radio station transmits the voice command sent by the controller back to a controller earphone along the same radio frequency, so that the controller can hear the voice command sent by the controller;
(3) After receiving the voice command of the controller, the pilot repeats the voice command and transmits the voice command to the controller through a radio station, a communication server and a ground-to-air communication internal phone system in a downlink manner to finish command interaction;
(4) In order to ensure that the control seats have uniform voice interfaces, the internal speech system can combine uplink and downlink speech signals of the controller and downlink speech signals of the pilot in a superposition splicing mode, and the combined speech can be used for subsequent speech processing tasks, such as speech recognition, voiceprint recognition and the like.
The voice command sent by the controller is transmitted in the uplink and the downlink, and a time delay phenomenon occurs when the voice system is superposed, so that the obtained voice signal of the controller is a 'control echo' superposed signal which is specific to the air traffic control voice system. FIGS. 2 and 3 depict waveforms and corresponding waveform diagrams and spectrogram of voices on various transmission lines, wherein the uppermost voice represents upstream voice, the middle voice represents downstream voice, and the lowermost voice represents downstream voiceThe speech of (2) represents mixed speech and the time delay is represented as in figure 2
Figure DEST_PATH_IMAGE001
. As shown in fig. 2 and fig. 3, when the voice signals of the controller are superimposed, the voice waveform and the corresponding spectrogram are accompanied by more "echo" noise (as shown in the block in fig. 3). In addition, due to the complexity of the air traffic condition, the voice signal is also affected by factors such as the collection equipment, the transmission device, the weather environment and the speaker characteristics during the propagation process, and the factors bring more noise data pollution to the voice signal. The noise in the voice band can reduce the intelligibility and the definition of the voice band, and the representation characteristics of the signal are interfered, thereby influencing the subsequent voice recognition task.
The intelligibility and the definition of the voice signal of the controller are reduced, the auditory perception of the subsequent voice echo analysis task is influenced, the accuracy of the voice content acquisition is reduced, and the voice information analysis is not facilitated. In addition, the current voice recognition method shows that the recognition accuracy of the voice of the controller with echo is obviously lower than that of the voice of a pilot without echo, the controller is used as an initiator of ground-air communication information exchange, the voice information of the controller has important significance in the aspect of ensuring the orderly operation of an air traffic control system, and the subsequent voice processing task can be greatly influenced due to the low voice recognition accuracy of the controller. Therefore, there is a need for a controller voice enhancement method and apparatus that can eliminate echo effect, improve voice quality, and improve accuracy of voice recognition.
Disclosure of Invention
The invention aims to solve the problems of poor voice quality and low voice recognition rate of a controller with a control echo in an air traffic control complex radio communication scene in the prior art, and provides a voice recognition-oriented controller voice enhancement method and device.
In order to achieve the above purpose, the invention provides the following technical scheme:
a controller voice enhancement method facing voice recognition comprises the following steps:
s1: acquiring an original clean voice-voice data pair with noise of a ground-air call to form an original data set, and outputting an effective data set after preprocessing and labeling the original data set;
s2: building a controller voice enhancement preliminary model based on a neural network structure;
s3: establishing a multitask loss function of the controller voice enhancement preliminary model based on a controller voice enhancement task and a controller voice recognition task;
s4: iteratively updating model parameters of the controller voice enhancement preliminary model through a gradient descent neural network training algorithm based on the multitask loss function and the effective data set, and outputting a controller voice enhancement model;
s5: and inputting the voice of the controller to be enhanced into the controller voice enhancement model, and outputting corresponding enhanced voice. The method takes the preprocessed air traffic control original clean voice-noisy voice data pair collected in a real scene as a data set, builds a controller voice enhancement preliminary model comprising a SASC module and a CSSAtt module, and utilizes a neural network model trained by a multitask loss function facing to a voice enhancement task and a voice recognition task at the same time to enhance the voice of the existing air traffic control noisy controller, thereby eliminating echo influence, improving the definition and intelligibility of the controller voice and effectively increasing the recognition accuracy of the controller voice recognition.
As a preferable embodiment of the present invention, the step S1 includes the steps of:
s1-1: acquiring an original clean voice-voice data pair with noise of a ground-air communication to form an original data set;
the method for acquiring the original clean voice-noisy voice data pair comprises the following steps:
on the basis of the existing internal speech system, adding an auxiliary internal speech system to each empty pipe seat, and simultaneously acquiring the speech of a controller through the auxiliary internal speech system and the existing internal speech system to obtain the original clean speech-noisy speech data pair;
s1-2: preprocessing an original clean voice-voice data pair with noise in the original data set, and outputting the preprocessed original clean voice-voice data pair with noise; the preprocessing comprises voice activity detection, speaker role classification, redundant data screening and time sequence alignment;
s1-3: randomly dividing the preprocessed original clean voice-voice data pairs with noise into an effective training set, an effective verification set and an effective test set, manually labeling the data pairs of the effective test set, and outputting the effective training set, the effective verification set and the labeled effective test set as effective data sets; and the manual labeling content is an instruction text corresponding to the original clean voice-voice data with noise. According to the invention, a preprocessing method for processing the original voice is designed according to a voice generating and collecting mechanism under a real empty management scene, so that the processing operation efficiency and the accuracy of the voice recognition process after training, testing and even voice enhancement of the model method related by the invention are effectively improved.
As a preferred scheme of the present invention, the controller speech enhancement preliminary model includes a first SCN module, a second SCN module, a plurality of encoder units, and corresponding decoder units; the first SCN module is arranged between the input end of the preliminary model and the input end of the encoder unit; the second SCN module is arranged between the output end of the preliminary model and the output end of the decoder unit; the encoder unit comprises a CNN module and a CSSAtt module; the decoder unit comprises a CNN module and a CSSAtt module; the encoder unit and the corresponding decoder unit are connected through a BilSTM module and an SASC module;
the first SCN module is used for performing feature upsampling on the voice data in the effective data set;
the second SCN module is used for down-sampling the voice feature map output by the decoder unit;
the CNN module is used for extracting a preliminary voice feature map of the voice data and outputting the preliminary voice feature map to the CSSAtt module; namely, the CNN module can extract the local features of the voice signal and synthesize the local features in a network deep layer to obtain the global features of the voice signal;
the BilSTM module is used for capturing the dependency relationship of the time sequence change of the voice data and mining the time sequence correlation between the signal frames of the voice data;
the SASC module is erected between peer layers of the encoder and the decoder and transfers the same-dimension characteristics of the voice data from a shallow network to a deep network in a skipping mode;
and the CSSAtt module is used for guiding the preliminary model to respectively mine features from the channel dimension and the space dimension of the preliminary voice feature map and optimizing the segmentation attention parameter of the channel space.
The invention designs a full-end-to-end controller voice enhancement model, the input and the output of the model are the original voice waveforms, other voice transformation steps are not involved, and the controller voice enhancement model can be directly applied to the enhancement and optimization processing of voice data input into the existing voice recognition model under the real air traffic complex environment on the basis of not retraining the existing voice recognition model.
As a preferred scheme of the present invention, the SASC module includes the following operation steps:
s2-1-1: obtaining the voice data through the second
Figure DEST_PATH_IMAGE002
Obtaining a coding characteristic diagram after coding characteristics of the coder unit
Figure DEST_PATH_IMAGE003
Figure DEST_PATH_IMAGE004
Wherein B represents the batch size, C represents the number of channels, and L represents the data length;
s2-1-2: obtaining the voice data through the second
Figure 435894DEST_PATH_IMAGE002
Obtaining a decoding characteristic diagram after decoding characteristics of the decoder unit
Figure DEST_PATH_IMAGE005
Figure DEST_PATH_IMAGE006
;
S2-1-3: respectively aiming at the coding feature maps
Figure 759559DEST_PATH_IMAGE003
And said decoding feature map
Figure 211400DEST_PATH_IMAGE005
Performing self-attention operation to obtain the coding feature map
Figure 278713DEST_PATH_IMAGE003
And said decoding feature map
Figure 889823DEST_PATH_IMAGE005
The initial self-attention weight is spliced and activated to obtain a fused self-attention weight, and the operation formula is as follows:
Figure DEST_PATH_IMAGE007
Figure DEST_PATH_IMAGE008
Figure DEST_PATH_IMAGE009
Figure DEST_PATH_IMAGE010
Figure DEST_PATH_IMAGE011
Figure DEST_PATH_IMAGE012
wherein, the first and the second end of the pipe are connected with each other,
Figure DEST_PATH_IMAGE013
it is indicated that the operation is performed by self-attention,
Figure DEST_PATH_IMAGE014
an initial self-attention weight is represented,
Figure DEST_PATH_IMAGE015
a stitching operation in the channel dimension is represented,
Figure DEST_PATH_IMAGE016
representing a first activation function, the ReLU activation function for enhancing the ability of a neural network to fit a non-linear function,
Figure DEST_PATH_IMAGE017
a self-attention weight representing the encoder and decoder peer fusion;
s2-1-4: and performing self-attention operation and activation processing on the fused self-attention weight to obtain a skipping attention weight coefficient of the peer layers of the encoder and the decoder, wherein the operation formula is as follows:
Figure DEST_PATH_IMAGE018
,
Figure DEST_PATH_IMAGE019
wherein, the first and the second end of the pipe are connected with each other,
Figure DEST_PATH_IMAGE020
it is shown that the second activation function is,
Figure DEST_PATH_IMAGE021
representing a jump attention weight coefficient;
s2-1-5: adjusting the coding feature map according to the skipping attention weight coefficient
Figure 964483DEST_PATH_IMAGE003
The weights of all the characteristic points are spliced when skipping steps, and the decoding characteristic graph is spliced
Figure 903620DEST_PATH_IMAGE005
And outputting the jump connection voice characteristic graph processed by the SASC module, wherein the operation formula is as follows:
Figure DEST_PATH_IMAGE022
Figure DEST_PATH_IMAGE023
wherein
Figure DEST_PATH_IMAGE024
Which means that the multiplication operations are performed by elements,
Figure DEST_PATH_IMAGE025
and representing a jump connection voice characteristic diagram.
According to the invention, according to the encoder-decoder structure of the model, an SASC module for speech processing is designed, and the SASC module uses a self-attention mechanism to mine and analyze useful characteristics of a speech characteristic diagram between model peer layers, suppresses redundant characteristics, guides the model to focus on a data characteristic encoding and decoding rule and helps the model to better converge.
As a preferred scheme of the invention, the CSSAtt module comprises the following operating steps:
s2-2-1: inputting a batch of the preliminary voice feature map, dividing the batch of the preliminary voice feature map into G groups of sub-feature maps, and dividing each group of sub-feature maps into two branch sub-feature maps
Figure DEST_PATH_IMAGE026
And
Figure DEST_PATH_IMAGE027
Figure DEST_PATH_IMAGE028
(ii) a Wherein the content of the first and second substances,
Figure 451321DEST_PATH_IMAGE026
indicating channel branchingThe sub-feature map is a map of the features,
Figure 792304DEST_PATH_IMAGE027
representing a space branch sub-feature diagram, B representing batch size, C representing channel number, L representing data length, and G representing preset group number;
s2-2-2: based on the
Figure 316826DEST_PATH_IMAGE026
And generating an initialized channel attention weight through an adaptive average pooling operation on the channel dimension, wherein the operation formula is as follows:
Figure DEST_PATH_IMAGE029
Figure DEST_PATH_IMAGE030
wherein
Figure DEST_PATH_IMAGE031
An adaptive average pooling operation is represented,
Figure DEST_PATH_IMAGE032
the dimensions of the channels are represented by,
Figure DEST_PATH_IMAGE033
indicating an initialized channel attention weight;
s2-2-3: based on the
Figure 87467DEST_PATH_IMAGE027
Generating an initialized spatial attention weight through a grouping normalization operation on a spatial dimension, wherein the operation formula is as follows:
Figure DEST_PATH_IMAGE034
Figure DEST_PATH_IMAGE035
wherein
Figure DEST_PATH_IMAGE036
Which represents the packet normalization operation, is shown,
Figure DEST_PATH_IMAGE037
the dimensions of the space are represented by,
Figure DEST_PATH_IMAGE038
representing an initialized spatial attention weight;
s2-2-4: mining said by learnable parameters
Figure 716157DEST_PATH_IMAGE026
In the channel dimension and the
Figure 177225DEST_PATH_IMAGE027
Feature dependency on the space dimension is achieved, attention weight coefficients on the channel dimension and the space dimension are generated after activation of an activation function, and the operation formula is as follows:
Figure DEST_PATH_IMAGE039
Figure DEST_PATH_IMAGE040
Figure DEST_PATH_IMAGE041
Figure DEST_PATH_IMAGE042
wherein
Figure 216856DEST_PATH_IMAGE014
And
Figure DEST_PATH_IMAGE043
the representation of the learnable parameter is,
Figure 802690DEST_PATH_IMAGE020
it is shown that the second activation function is,
Figure DEST_PATH_IMAGE044
which means that the multiplication operations are performed by elements,
Figure DEST_PATH_IMAGE045
and
Figure DEST_PATH_IMAGE046
respectively representing a channel attention weight coefficient and a space attention weight coefficient;
s2-2-5: adjusting the channel attention weight coefficient and the spatial attention weight coefficient respectively
Figure 890862DEST_PATH_IMAGE026
And said
Figure 206437DEST_PATH_IMAGE027
The weight of each characteristic point is added, and the method is spliced
Figure 338341DEST_PATH_IMAGE026
And said
Figure 739367DEST_PATH_IMAGE027
Enabling information communication among different groups of the sub-feature maps by using a channel shuffling operation for the sub-feature maps, and outputting the voice feature map processed by the CSSAtt module, wherein the operation formula is as follows:
Figure DEST_PATH_IMAGE047
Figure DEST_PATH_IMAGE048
wherein
Figure DEST_PATH_IMAGE049
A feature map stitching operation in the channel dimension is represented,
Figure DEST_PATH_IMAGE050
representing channel shuffle operations,
Figure DEST_PATH_IMAGE051
Representing the voice characteristic map processed by the CSSAtt module; the channel shuffle operation is specifically: and the channel sequence among different sub-feature graphs in the same batch of feature graphs is disturbed, so that the features of different channels are linked, information intercommunication among different sub-feature graphs is realized, and the common features are convenient to learn.
The CSSAtt module for voice processing is designed, and carries out grouping processing on a batch of voice feature maps, channel dimension and space dimension features are respectively extracted from each group of sub-feature maps, and finally the dimension features of different sub-feature maps are fused to realize feature communication among the sub-feature maps, so that feature communication among different samples is facilitated, and the robust capability of the model on data features is improved.
As a preferable embodiment of the present invention, the step S3 includes the steps of:
s3-1: constructing a loss function facing a voice enhancement task based on LAE, directly measuring the error between the output waveform and the real waveform of the controller voice enhancement preliminary model in the time domain, and recording as the LAE loss function
Figure DEST_PATH_IMAGE052
S3-2: constructing a frequency domain loss function facing a voice enhancement task based on the multi-resolution STFT amplitude spectrum, measuring the error of the STFT amplitude spectrum between the output voice of the controller voice enhancement preliminary model and the real voice, and marking as the STFT loss function
Figure DEST_PATH_IMAGE053
S3-3: constructing a characteristic loss function facing to a voice recognition task based on multi-resolution, and recording the characteristic loss function as the characteristic loss function
Figure DEST_PATH_IMAGE054
S3-4: constructing a multitask loss function of the controller voice enhancement preliminary model in a weighted summation mode, wherein the calculation formula is as follows:
Figure DEST_PATH_IMAGE055
wherein
Figure DEST_PATH_IMAGE056
For the purpose of the multi-tasking loss function,
Figure DEST_PATH_IMAGE057
Figure DEST_PATH_IMAGE058
and
Figure DEST_PATH_IMAGE059
respectively represent the LAE loss function
Figure 460592DEST_PATH_IMAGE052
The STFT loss function
Figure 642392DEST_PATH_IMAGE053
And the characteristic loss function
Figure 86142DEST_PATH_IMAGE054
The preset weight of (c).
As a preferred embodiment of the present invention, the step S3-2 includes the steps of:
s3-2-1: constructing a triple parameter for the STFT operation; wherein the triplet parameters are formed as follows:
[ sampling point, frame shift, window frame ];
the parameter sampling points represent the number N of sampling points of a voice frame formed by N sampling points in all the sampling points of the voice signal, the parameter frame shift represents the time difference of the initial positions of two adjacent frames, and the parameter window frame represents the type of a window function processed by the voice signal;
s3-2-2: constructing the STFT amplitude spectrum of the speech by using the triplet parameters
Figure 708885DEST_PATH_IMAGE002
The STFT loss function of the STFT amplitude spectrum constructed by the group triplet parameters is
Figure DEST_PATH_IMAGE060
S3-2-3: constructing the STFT loss function as follows:
Figure DEST_PATH_IMAGE061
wherein
Figure DEST_PATH_IMAGE062
Representing the number of groups of the triplet parameters.
As a preferred embodiment of the present invention, the step S3-3 comprises the steps of:
s3-3-1: critical frequency band integration, loudness pre-emphasis, cubic root compression, inverse Fourier transform and linear prediction processing are carried out on the STFT amplitude spectrum to obtain perception linear prediction acoustic features, a loss function based on the perception linear prediction features is established, the error between output voice of the controller voice enhancement preliminary model and real voice perception linear prediction acoustic features is measured and marked as PLP loss function
Figure DEST_PATH_IMAGE063
S3-3-2: performing Mel filtering and logarithmic transformation processing on the STFT amplitude spectrum to obtain acoustic characteristics of a filter bank, establishing a loss function based on the characteristics of the filter bank, measuring errors between filter bank characteristics of output voice and real voice of the controller voice enhancement preliminary model, and recording as FBANK loss function
Figure DEST_PATH_IMAGE064
S3-3-3: discrete cosine transform is carried out on the acoustic features of the filter bank to obtain the acoustic features of the Mel cepstrum coefficient, and the features based on the Mel cepstrum coefficient are establishedThe error between the Mel cepstrum coefficient characteristics of the model output speech and the real speech is measured and recorded as MFCC loss function
Figure DEST_PATH_IMAGE065
S3-3-4: constructing feature loss function for voice recognition task
Figure 545385DEST_PATH_IMAGE054
Figure DEST_PATH_IMAGE066
Wherein the STFT loss function
Figure 366710DEST_PATH_IMAGE053
And the characteristic loss function
Figure 981362DEST_PATH_IMAGE054
Has the same operation form:
Figure DEST_PATH_IMAGE067
wherein
Figure DEST_PATH_IMAGE068
Representing the STFT magnitude spectrum, the PLP signature, the FBANK signature, or the MFCC signature,
Figure DEST_PATH_IMAGE069
which represents a clean speech signal, is,
Figure DEST_PATH_IMAGE070
which represents an enhanced speech signal, is,
Figure DEST_PATH_IMAGE071
calculating an F norm;
as a preferable embodiment of the present invention, the step S4 includes the steps of:
s4-1: randomly acquiring a plurality of original clean speech-noisy speech data pairs from the effective training set as a training set
Figure DEST_PATH_IMAGE072
And extracting a pure noise data waveform according to the difference between the noisy speech waveform and the clean speech waveform, wherein the operation formula is as follows:
Figure DEST_PATH_IMAGE073
Figure DEST_PATH_IMAGE074
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE075
a clean speech waveform representing the original clean speech data in the original clean speech-noisy speech data pair,
Figure DEST_PATH_IMAGE076
a noisy speech waveform representing noisy speech data in the original clean speech-noisy speech data pair,
Figure DEST_PATH_IMAGE077
showing pure noise waveforms, all three shapes
Figure DEST_PATH_IMAGE078
B denotes a batch size, C denotes the number of lanes, L denotes a data length,
Figure DEST_PATH_IMAGE079
representing a subtraction operation by eigenvalue;
s4-2: randomly disturbing the distribution of the pure noise waveforms in the training set, and adding the pure noise waveforms and the clean speech waveforms to obtain enhanced noisy speech waveforms, wherein the operation formula is as follows:
Figure DEST_PATH_IMAGE080
Figure DEST_PATH_IMAGE081
wherein
Figure DEST_PATH_IMAGE082
Indicating that the data is being shuffled through the operation,
Figure DEST_PATH_IMAGE083
representing clean speech waveform-pure noise waveform data pairs,
Figure DEST_PATH_IMAGE084
representing the waveform of said enhanced noisy speech,
Figure DEST_PATH_IMAGE085
indicating an addition operation by a characteristic value;
s4-3: combining the clean speech waveform and the enhanced noisy speech waveform into a new training set, and recording as a second training set
Figure DEST_PATH_IMAGE086
S4-4: iteratively updating model parameters of the controller voice enhancement preliminary model through a gradient descent algorithm based on the second training set and the multitask loss function, verifying whether the controller voice enhancement preliminary model is converged through the effective verification set in a training process, and outputting the current controller voice enhancement preliminary model as the controller voice enhancement model after model training is converged;
the basis for judging the model training convergence is as follows: calculating the multitask loss function of the primary model through the effective verification set every m iteration rounds, and when the multitask loss function does not fall any more after the calculation for n times, considering the multitask loss function as model training convergence; m and n are preset values;
the invention designs a data enhancement method for a model training stage, which randomly redistributes sample noise characteristics, achieves the purpose of expanding the data volume of a training set, effectively increases the noise robustness of a model and generalizes different noise distributions.
S4-5: testing the model with the labeled valid test set.
S5: and inputting the voice of the controller to be enhanced into the controller voice enhancement model, and outputting corresponding enhanced voice.
A controller speech enhancement device oriented to speech recognition comprising at least one processor and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any of the methods described above.
Compared with the prior art, the invention has the following beneficial effects:
1. the method takes the preprocessed air traffic control original clean voice-noisy voice data pair collected in a real scene as a data set, builds a controller voice enhancement preliminary model comprising a SASC module and a CSSAtt module, and utilizes a neural network model trained by a multitask loss function facing to a voice enhancement task and a voice recognition task at the same time to enhance the voice of the existing air traffic control noisy controller, thereby eliminating echo influence, improving the definition and intelligibility of the controller voice and effectively increasing the recognition accuracy of the controller voice recognition.
2. According to the invention, a preprocessing method for processing original voice is designed according to a voice generation and collection mechanism under a real empty management scene, so that the processing operation efficiency and accuracy of the voice recognition process after training, testing and even voice enhancement of the model method related by the invention are effectively improved.
3. The invention designs a full-end-to-end controller voice enhancement model, the input and the output of the model are the original voice waveforms, other voice transformation steps are not involved, and the controller voice enhancement model can be directly applied to the enhancement and optimization processing of voice data input into the existing voice recognition model under the real air traffic complex environment on the basis of not retraining the existing voice recognition model.
4. According to the invention, an SASC module for speech processing is designed according to an encoder-decoder structure of a model, and the SASC module uses a self-attention mechanism to mine and analyze useful characteristics of a speech characteristic diagram between model peer layers, suppresses redundant characteristics, guides a model to focus on a data characteristic coding and decoding rule, and is beneficial to better convergence of the model.
5. The CSSAtt module for voice processing is designed, and carries out grouping processing on a batch of voice feature maps, channel dimension and space dimension features are respectively extracted from each group of sub-feature maps, and finally the dimension features of different sub-feature maps are fused to realize feature communication among the sub-feature maps, so that feature communication among different samples is facilitated, and the robust capability of the model on data features is improved.
6. The invention designs a data enhancement method for a model training phase, which randomly redistributes the noise characteristics of a sample, realizes the purpose of expanding the data volume of a training set, effectively increases the noise robustness of the model and generalizes different noise distributions.
Drawings
FIG. 1 is a schematic diagram of the generation and transmission of hollow tube speech according to the background art of the present invention.
Fig. 2 is a diagram of the voice signal and the corresponding waveform collected by the hollow tube voice on different transmission lines according to the background art of the present invention.
Fig. 3 shows the speech signal and the corresponding spectrogram of the hollow tube speech collected on different transmission lines according to the background art of the present invention.
Fig. 4 is a schematic flowchart of a speech recognition-oriented controller speech enhancement method according to embodiment 1 of the present invention.
Fig. 5 is a model structure diagram of a controller voice enhancement preliminary model in a voice recognition-oriented controller voice enhancement method according to embodiment 2 of the present invention.
Fig. 6 is a schematic structural diagram of each module of the controller speech enhancement preliminary model in the speech recognition-oriented controller speech enhancement method according to embodiment 2 of the present invention.
Fig. 7 is an experimental description table of a comparison experiment in the controller voice enhancement method for voice recognition according to embodiment 3 of the present invention.
Fig. 8 is a schematic diagram of an experimental result of a controller voice enhancement indicator in the voice enhancement method for a controller facing voice recognition according to embodiment 3 of the present invention.
Fig. 9 is a schematic diagram of an experimental result of controller voice recognition indexes in the voice enhancement method for a controller facing voice recognition according to embodiment 3 of the present invention.
Fig. 10 is a schematic structural diagram of a controller voice enhancement apparatus for voice recognition according to embodiment 4 of the present invention, which uses the controller voice enhancement method for voice recognition according to embodiment 1.
Detailed Description
The present invention will be described in further detail with reference to test examples and specific embodiments. It should be understood that the scope of the above-described subject matter is not limited to the following examples, and any techniques implemented based on the disclosure of the present invention are within the scope of the present invention.
Example 1
As shown in fig. 4, a controller voice enhancement method facing voice recognition includes the following steps:
s1: acquiring an original clean voice-voice data pair with noise of a ground-air call to form an original data set, and outputting an effective data set after preprocessing and labeling the original data set;
s2: building a controller voice enhancement preliminary model based on a neural network structure;
s3: establishing a multitask loss function of the controller voice enhancement preliminary model based on a controller voice enhancement task and a controller voice recognition task;
s4: iteratively updating model parameters of the controller voice enhancement preliminary model through a gradient descent neural network training algorithm based on the multitask loss function and the effective data set, and outputting a controller voice enhancement model;
s5: and inputting the voice of the controller to be enhanced into the controller voice enhancement model, and outputting corresponding enhanced voice.
Example 2
This embodiment is a specific implementation manner of the method described in embodiment 1, and includes the following steps:
s1: the method comprises the steps of obtaining an original clean voice-voice data pair with noise of a ground-air call to form an original data set, preprocessing and labeling the original data set, and outputting an effective data set.
S1-1: acquiring an original clean voice-voice data pair with noise of a ground-air communication to form an original data set;
the method for acquiring the original clean voice-noisy voice data pair comprises the following steps:
on the basis of the existing internal speech system, an auxiliary internal speech system is added to each empty pipe seat, and meanwhile, the auxiliary internal speech system and the existing internal speech system are used for collecting the speech of a controller to obtain the original clean speech-noisy speech data pair.
S1-2: preprocessing an original clean voice-voice data pair with noise in the original data set, and outputting the preprocessed original clean voice-voice data pair with noise; the preprocessing includes voice activity detection, speaker role classification, redundant data screening, and timing alignment.
S1-2-1: analyzing the collected data of the original data set, and dividing the continuous voice signal into instruction voice segments according to voice activity detection, wherein the duration of the divided voice instruction segments is between 0.1 and 10 s.
S1-2-2: since the internal speech system superimposes and merges the controller up-and-down speech and the pilot down-speech, the presence speech data is a mixed speech containing both the controller speech and the pilot speech. And (3) carrying out role classification on the segmented voice sections by adopting a speaker role classification model, wherein the classification result comprises three types: controller voice, pilot voice, controller pilot mixed voice. Pilot speech and controller pilot mixed speech are discarded, and the invention only uses the controller speech as an experimental sample for subsequent processing.
S1-2-3: screening out mute, noise and data with the duration less than 1s according to the obtained controller voice, and aligning the time sequence to ensure that the duration of clean voice and voice with noise of the same voice instruction text is consistent;
the original clean speech-noisy speech data pair is processed by the steps to obtain an effective original clean speech-noisy speech data pair, which comprises the following characteristics:
(1) The speech data pair includes a language used in the recognition scene.
(2) The voice data pair comprises voices in various pronunciation states; the pronunciation state comprises one or more of slow speech speed, normal speech speed, fast speech speed, comfortable emotion, tense emotion and accent.
(3) The voice data pair comprises a control expression of the air traffic control field specialty.
S1-3: randomly dividing the preprocessed original clean voice-voice data pairs with noise into an effective training set, an effective verification set and an effective test set, manually labeling the data pairs of the effective test set, and outputting the effective training set, the effective verification set and the labeled effective test set as effective data sets; and the manual labeling content is an instruction text corresponding to the original clean voice-voice data with noise.
S1-3-1: and the obtained effective original clean voice-noisy voice data pair is processed according to the following steps of 8:1: the 1-ratio is randomly divided into a training set, a validation set and a test set.
S1-3-2: and manually labeling the test set data and screening out data with unknown semantics by taking the clean voice data of the test set as a reference to obtain the instruction text of each pair of voices of the test set.
S1-3-3: the training set data and the verification set data are stored in pairs, respectively, each pair of data comprising two voices, clean voice and noisy voice, respectively. And organizing and storing the test set data and the corresponding instruction text thereof to form an effective test set with labels.
S2: as shown in fig. 5, a controller voice enhancement preliminary model is built based on a neural network structure, and the controller voice enhancement preliminary model includes a first SCN module, a second SCN module, a plurality of encoder units and corresponding decoder units; the first SCN module is arranged between the input end of the preliminary model and the input end of the encoder unit; the second SCN module is arranged between the output end of the preliminary model and the output end of the decoder unit; the encoder unit comprises a CNN module and a CSSAtt module; the decoder unit comprises a CNN module and a CSSAtt module; the encoder unit and the corresponding decoder unit are connected through a BilSTM module and an SASC module.
The SCN module utilizes the characteristics of a sinc filter in signal processing and adopts a sinc interpolation convolution network to extract the characteristics of the voice signal, so that data points of the voice signal which are lost due to sampling can be reconstructed, and the data integrity of the voice signal is ensured.
The first SCN module is used for performing feature upsampling on the voice data in the effective data set.
And the second SCN module is used for down-sampling the voice characteristic diagram output by the decoder unit.
The CNN module is used for extracting a preliminary voice feature map of the voice data and outputting the preliminary voice feature map to the CSSAtt module. The CNN module can extract local features of the voice signal and synthesize the local features deep in the network to obtain global features of the voice signal. The CNN module specifically includes:
Figure DEST_PATH_IMAGE087
a convolutional (deconvolution) network, in which
Figure DEST_PATH_IMAGE088
Which represents the size of the convolution kernel,
Figure DEST_PATH_IMAGE089
the representation step size, the ReLU activation function,
Figure DEST_PATH_IMAGE090
convolutional networks, GLU activation functions.
The BilSTM module is used for capturing the dependency relationship of the time sequence change of the voice data and mining the time sequence correlation between the signal frames of the voice data.
The SASC module is erected between peer layers of the encoder and the decoder, can guide voice features to pay attention to useful features of the SASC module, inhibits redundant features, and transmits the same-dimension features of the voice data from a shallow network to a deep network in a skipping mode, so that the deep network can learn the shallow features and recover details of the shallow network; the model structure is shown in fig. 6, and specifically includes the following processing steps:
s2-1-1: obtaining the voice data through the second
Figure 704118DEST_PATH_IMAGE002
Obtaining a coding characteristic diagram after coding characteristics of the coder unit
Figure 921472DEST_PATH_IMAGE003
Figure 66146DEST_PATH_IMAGE004
Wherein B represents the batch size, C represents the number of channels, and L represents the data length;
s2-1-2: obtaining the voice data through the second
Figure 586120DEST_PATH_IMAGE002
Obtaining a decoding characteristic diagram after decoding characteristics of the decoder unit
Figure 449034DEST_PATH_IMAGE005
Figure 599306DEST_PATH_IMAGE006
;
S2-1-3: respectively aiming at the coding feature maps
Figure 598486DEST_PATH_IMAGE003
And said decoding feature map
Figure 679575DEST_PATH_IMAGE005
Performing self-attention operation to obtain the coding feature map
Figure 764205DEST_PATH_IMAGE003
And said decoding feature map
Figure 464308DEST_PATH_IMAGE005
The initial self-attention weight is spliced and activated to obtain a fused self-attention weight, and the operation formula is as follows:
Figure 317995DEST_PATH_IMAGE007
Figure 38826DEST_PATH_IMAGE008
Figure 876332DEST_PATH_IMAGE009
Figure 848967DEST_PATH_IMAGE010
Figure 947373DEST_PATH_IMAGE011
Figure 714472DEST_PATH_IMAGE012
wherein the content of the first and second substances,
Figure 773695DEST_PATH_IMAGE013
it is indicated that the operation is self-attentive,
Figure 940234DEST_PATH_IMAGE014
an initial self-attention weight is represented,
Figure 502933DEST_PATH_IMAGE015
a stitching operation in the channel dimension is represented,
Figure 706513DEST_PATH_IMAGE016
representing a first activation function, said activation functionFor improving the ability of neural networks to fit nonlinear functions,
Figure 112086DEST_PATH_IMAGE017
a self-attention weight representing the encoder and decoder peer fusion;
s2-1-4: and performing self-attention operation and activation processing on the fused self-attention weight to obtain a skipping attention weight coefficient of the peer layers of the encoder and the decoder, wherein the operation formula is as follows:
Figure 957683DEST_PATH_IMAGE018
,
Figure 374889DEST_PATH_IMAGE019
wherein the content of the first and second substances,
Figure 139582DEST_PATH_IMAGE020
it is shown that the second activation function is,
Figure 907818DEST_PATH_IMAGE021
representing the jump attention weight coefficient, in this case
Figure DEST_PATH_IMAGE091
Figure DEST_PATH_IMAGE092
And
Figure DEST_PATH_IMAGE093
the same neural network structure is adopted, and all the neural network structures are 1x1 convolutional networks; the first activation function and the second activation function are any two different activation functions (e.g., a Sigmoid activation function, a tanh activation function, a ReLU activation function, a leak ReLU function, an ELU activation function, a Mish activation function, a Swish activation function, a siru activation function, and the like).
S2-1-5: according to the jump notesAdjusting the encoded feature map by an intention weighting factor
Figure 166892DEST_PATH_IMAGE003
The weights of all the characteristic points are spliced during the step skipping process, and the decoding characteristic graph is spliced
Figure 704184DEST_PATH_IMAGE005
And outputting the jump connection voice characteristic graph processed by the SASC module, wherein the operation formula is as follows:
Figure 639779DEST_PATH_IMAGE022
Figure 629732DEST_PATH_IMAGE023
wherein
Figure 817131DEST_PATH_IMAGE024
Which means that the multiplication operations are performed by elements,
Figure 333563DEST_PATH_IMAGE025
and representing a jump connection voice characteristic diagram.
The CSSAtt module of the controller voice enhancement preliminary model is used for guiding the model to respectively concern useful information from the channel dimension and the space dimension of the voice feature map, mining features and optimizing the segmentation attention parameter of the channel space. The model structure is shown in fig. 6, and specifically includes the following processing steps:
s2-2-1: inputting a batch of the preliminary voice feature map, dividing the batch of the preliminary voice feature map into G groups of sub-feature maps, and dividing each group of sub-feature maps into two branch sub-feature maps
Figure 49846DEST_PATH_IMAGE026
And
Figure 792674DEST_PATH_IMAGE027
Figure 642818DEST_PATH_IMAGE028
(ii) a Wherein, the first and the second end of the pipe are connected with each other,
Figure 154702DEST_PATH_IMAGE026
a sub-feature map of the channel branch is shown,
Figure 41887DEST_PATH_IMAGE027
representing a space branch sub-feature diagram, B representing batch size, C representing channel number, L representing data length, and G representing preset group number;
s2-2-2: based on the
Figure 131065DEST_PATH_IMAGE026
And generating an initialized channel attention weight through an adaptive average pooling operation on the channel dimension, wherein the operation formula is as follows:
Figure 394688DEST_PATH_IMAGE029
Figure 761078DEST_PATH_IMAGE030
wherein
Figure 943798DEST_PATH_IMAGE031
An adaptive average pooling operation is represented,
Figure 661218DEST_PATH_IMAGE032
the dimensions of the channels are shown in the figure,
Figure 728531DEST_PATH_IMAGE033
indicating an initialized channel attention weight;
s2-2-3: based on the
Figure 808483DEST_PATH_IMAGE027
Generating an initialized spatial attention weight through a grouping normalization operation on a spatial dimension, wherein the operation formula is as follows:
Figure 568628DEST_PATH_IMAGE034
Figure 366820DEST_PATH_IMAGE035
wherein
Figure 237824DEST_PATH_IMAGE036
Which represents the packet normalization operation, is shown,
Figure 313227DEST_PATH_IMAGE037
the dimensions of the space are represented by,
Figure 244274DEST_PATH_IMAGE038
representing an initialized spatial attention weight;
s2-2-4: mining said by learnable parameters
Figure 795341DEST_PATH_IMAGE026
In the channel dimension and the
Figure 294255DEST_PATH_IMAGE027
Feature dependency on the space dimension is achieved, attention weight coefficients on the channel dimension and the space dimension are generated after activation of an activation function, and the operation formula is as follows:
Figure 224165DEST_PATH_IMAGE039
Figure 326113DEST_PATH_IMAGE040
Figure 98897DEST_PATH_IMAGE041
Figure 311704DEST_PATH_IMAGE042
wherein
Figure 627279DEST_PATH_IMAGE014
And
Figure 493603DEST_PATH_IMAGE043
the representation of the learnable parameter is,
Figure 894629DEST_PATH_IMAGE044
which means that the multiplication operations are performed by elements,
Figure 645547DEST_PATH_IMAGE045
and
Figure 815628DEST_PATH_IMAGE046
respectively representing a channel attention weight coefficient and a space attention weight coefficient;
s2-2-5: respectively adjusting the channels according to the channel attention weight coefficients and the spatial attention weight coefficients
Figure 852855DEST_PATH_IMAGE026
And said
Figure 741176DEST_PATH_IMAGE027
The weight of each characteristic point is added, and the method is spliced
Figure 295785DEST_PATH_IMAGE026
And said
Figure 179428DEST_PATH_IMAGE027
Outputting the speech feature map processed by the CSSAtt module for a sub-feature map and enabling information communication between different groups of the sub-feature maps using a channel shuffle operation, the operation being:
Figure 528501DEST_PATH_IMAGE047
Figure 904118DEST_PATH_IMAGE048
wherein
Figure 121473DEST_PATH_IMAGE049
Features expressed in channel dimensionsThe operation of splicing the symbolic map is carried out,
Figure 567DEST_PATH_IMAGE050
a channel shuffle operation is represented that is,
Figure 520542DEST_PATH_IMAGE051
representing the voice characteristic map processed by the CSSAtt module; the channel shuffle operation is specifically: and the channel sequence among different sub-feature graphs in the same batch of feature graphs is disturbed, so that the features of different channels are linked, information intercommunication among different sub-feature graphs is realized, and the common features are convenient to learn.
S3: establishing a multitask loss function of the controller voice enhancement preliminary model based on a controller voice enhancement task and a controller voice recognition task;
the step S3 includes the steps of:
s3-1: constructing a loss function facing a voice enhancement task based on LAE (Least Absolute Error), directly measuring the Error between the output waveform and the real waveform of the controller voice enhancement preliminary model in the time domain, and recording as the LAE loss function
Figure 242510DEST_PATH_IMAGE052
S3-2: constructing a frequency domain loss function facing a voice enhancement task based on a multi-resolution STFT (Short Time Fourier Transform) amplitude spectrum, measuring an error of the STFT amplitude spectrum between output voice of the controller voice enhancement preliminary model and real voice, and recording the error as the STFT loss function
Figure 873343DEST_PATH_IMAGE053
S3-2-1: constructing a triple parameter for the STFT operation; wherein the triplet parameters are formed as follows:
[ sampling points, frame Shift, window frame ]
The parameter sampling points represent the number N of sampling points of a voice frame formed by N sampling points in all the sampling points of the voice signal, the parameter frame represents the time difference of the initial positions of two adjacent frames, and the parameter window frame represents the type of a window function processed by the voice signal;
in the present invention, the triplet values are [512, 100, hamming window ], [1024, 200, hamming window ], [256, 50, hamming window ].
Taking the first group of triplets as an example, 512 samples are taken as the parameter samples, that is, 512 samples are taken as a frame. The parameter frame shift is set to 100, that is, 100 samples are frame shifts, and if there are not 512 samples, the samples are padded with zeros. The parameter window frame selects a hamming window function.
S3-2-2: constructing the STFT amplitude spectrum of the voice by using the triple parameters, and recording the number
Figure 872523DEST_PATH_IMAGE002
The STFT loss function of the STFT amplitude spectrum constructed by the group triplet parameters is
Figure 688032DEST_PATH_IMAGE060
S3-2-3: constructing the STFT loss function as follows:
Figure 772662DEST_PATH_IMAGE061
wherein
Figure 207186DEST_PATH_IMAGE062
And representing the group number of the triple parameters, and taking M as 3 in the invention.
S3-3: constructing a characteristic loss function facing to a voice recognition task based on multi-resolution, and recording the characteristic loss function as the characteristic loss function
Figure 919927DEST_PATH_IMAGE054
S3-3-1: performing critical band integration, loudness pre-emphasis, cubic root compression, inverse Fourier transform and Linear prediction processing on the STFT amplitude spectrum to obtain Perceptual Linear Prediction (PLP) acoustic features, and establishing the acoustic features based on the PLPThe loss function of the perception linear prediction characteristic measures the error between the output voice of the controller voice enhancement preliminary model and the real voice perception linear prediction acoustic characteristic and records as a PLP loss function
Figure 47283DEST_PATH_IMAGE063
S3-3-2: carrying out Mel filtering and logarithmic transformation on the STFT amplitude spectrum to obtain acoustic characteristics of a Filter Bank (FBANK), establishing a loss function based on the characteristics of the Filter Bank, measuring errors between the Filter Bank characteristics of output voice of the controller voice enhancement preliminary model and real voice, and marking as FBANK loss function
Figure 619210DEST_PATH_IMAGE064
S3-3-3: performing discrete cosine transform on the acoustic features of the filter bank to obtain Mel Cepstral coefficient acoustic features, establishing a loss function based on Mel-frequency Cepstral coeffients (MFCC) features, and measuring the error between the Mel Cepstral coefficient features of the model output voice and the real voice, and recording the error as an MFCC loss function
Figure 982058DEST_PATH_IMAGE065
S3-3-4: constructing feature loss function for voice recognition task
Figure 690251DEST_PATH_IMAGE054
Figure 988508DEST_PATH_IMAGE066
Wherein the STFT loss function
Figure 906786DEST_PATH_IMAGE053
And the characteristic loss function
Figure 683112DEST_PATH_IMAGE054
Has the same operation form:
Figure 511391DEST_PATH_IMAGE067
wherein
Figure 839604DEST_PATH_IMAGE068
Representing the STFT magnitude spectrum, the PLP signature, the FBANK signature, or the MFCC signature,
Figure 120544DEST_PATH_IMAGE069
which represents a clean speech signal, is,
Figure 700561DEST_PATH_IMAGE070
which represents an enhanced speech signal, is,
Figure 507980DEST_PATH_IMAGE071
calculating an F norm;
s3-4: constructing a multitask loss function of the controller voice enhancement preliminary model in a weighted summation mode, wherein the calculation formula is as follows:
Figure 148039DEST_PATH_IMAGE055
wherein
Figure 650696DEST_PATH_IMAGE056
For the purpose of the multi-tasking loss function,
Figure 893459DEST_PATH_IMAGE057
Figure 430750DEST_PATH_IMAGE058
and
Figure 241711DEST_PATH_IMAGE059
respectively represent the LAE loss function
Figure 621877DEST_PATH_IMAGE052
The STFT loss function
Figure 278118DEST_PATH_IMAGE053
And the characteristic loss function
Figure 935495DEST_PATH_IMAGE054
The preset weights in the present invention are all preset to 1.
S4: iteratively updating model parameters of the controller voice enhancement preliminary model through a gradient descent neural network training algorithm based on the multitask loss function and the effective data set, and outputting a controller voice enhancement model;
s4-1: randomly acquiring a plurality of original clean speech-noisy speech data pairs from the effective training set as a training set
Figure 41991DEST_PATH_IMAGE072
And extracting a pure noise data waveform according to the difference between the noisy speech waveform and the clean speech waveform, wherein the operation formula is as follows:
Figure 507521DEST_PATH_IMAGE073
Figure 233032DEST_PATH_IMAGE074
wherein the content of the first and second substances,
Figure 603970DEST_PATH_IMAGE075
a clean speech waveform representing the original clean speech data in the original clean speech-noisy speech data pair,
Figure 756734DEST_PATH_IMAGE076
a noisy speech waveform representing noisy speech data in the original clean speech-noisy speech data pair,
Figure 721279DEST_PATH_IMAGE077
showing pure noise waveforms, all three shapes
Figure 109535DEST_PATH_IMAGE078
B denotes a batch size, C denotes the number of lanes, L denotes a data length,
Figure 475926DEST_PATH_IMAGE079
representing a subtraction operation by eigenvalue;
s4-2: randomly disturbing the distribution of the pure noise waveforms in the training set, and adding the pure noise waveforms and the clean speech waveforms to obtain enhanced noisy speech waveforms, wherein the operation formula is as follows:
Figure 658645DEST_PATH_IMAGE080
Figure 376066DEST_PATH_IMAGE081
wherein
Figure 443379DEST_PATH_IMAGE082
Indicating that the data is being shuffled through the operation,
Figure 664276DEST_PATH_IMAGE083
representing clean speech waveform-pure noise waveform data pairs,
Figure 283476DEST_PATH_IMAGE084
representing the waveform of said enhanced noisy speech,
Figure 222613DEST_PATH_IMAGE085
indicating an addition operation by a characteristic value;
s4-3: combining the clean speech waveform and the enhanced noisy speech waveform into a new training set, and recording as a second training set
Figure 687092DEST_PATH_IMAGE086
S4-4: based on the second training set and the multitask loss function, model parameters of the controller voice enhancement preliminary model are updated through a gradient descent algorithm in an iterative mode, whether the controller voice enhancement preliminary model is converged is verified through the effective verification set in the training process, and the current controller voice enhancement preliminary model is output to be the controller voice enhancement model after model training is converged
The basis for judging the model training convergence is as follows: calculating the multitask loss function of the primary model through the effective verification set every m iteration rounds, and when the multitask loss function does not fall any more after the calculation for n times, considering the multitask loss function as model training convergence; in the invention, m is set to 10, n is set to 5;
s4-5: testing the model with the labeled valid test set.
S5: and inputting the voice of the controller to be enhanced into the controller voice enhancement model, and outputting corresponding enhanced voice.
Example 3
The embodiment is an actual operation analysis of the method of the present invention under the following data conditions, and is used for verifying the feasibility and performance of the technical scheme of the present invention, and specifically includes the following steps:
1. preparing data: voice data are collected in a real control scene, preprocessing is carried out according to a preprocessing scheme provided by the invention to form an effective data set required by the voice enhancement method, a training set, a verification set and a test set are formed according to the data set division step of the method, and the data set is specifically described as follows:
training set: 47253 pieces of data (42.83 hours) in total, including 42189 pieces of chinese data (37.28 hours), 5064 pieces of english data (5.55 hours);
and (4) verification set: 4764 pieces of data (4.31 hours) in total, including 4188 pieces of Chinese data (3.69 hours) and 558 pieces of English data (0.62 hours);
and (3) test set: 6514 pieces of data (5.62 hours) in total, containing 6012 pieces of Chinese data (5.08 hours), 502 pieces of English data (0.54 hours);
the training set and the verification set are all taken from voice data on the same date, the test set is taken from voice data on different dates from the training set and the verification set, and the sampling rate of all data is 8KHz. The test results of this embodiment are the results of speech enhancement and speech recognition performed on the test set.
2. Speech enhancement baseline model: in this embodiment, the model formed by the SCN module, the CNN module, and the BiLSTM module in step S2 is used as a baseline model, the loss function is a non-multiresolution loss function only oriented to speech enhancement in step S3-1 and step S3-2, and the data enhancement mechanism of the present invention is used in the model training phase. The model input and output are all original waveforms.
The baseline model and the invention model were implemented using a pytorech framework. The hyper-parameter configuration for model training is described as follows:
(1) An optimizer: an Adam optimizer is adopted, the initial learning rate is 0.0003, and the learning rate attenuation rate is 0.999;
(2) Model hyper-parameters: the number of characteristic channels is 48, the size of a convolution kernel is 8, and the step length is 4;
(3) Batch training size: 32.
the hardware environment adopted by the experiment is as follows: the Ubuntu Linux 16.04 operating system comprises a CPU (central processing unit) of 2 multiplied by Intel Core i7-6800K, a display card of 2 multiplied by NVIDIA GeForce RTX 2080Ti, a display memory of 2 multiplied by 11GB and a memory of 64GB.
Under the above training data and configuration conditions, a total of 5 sets of experiments were performed to respectively prove the advantages of the model and the loss function proposed by the present invention, which are specifically as follows:
a1: training a baseline model on the effective training data;
a2: on the basis of A1, changing a non-multiresolution loss function into a multiresolution loss function, and training on the effective training data;
a3: on the basis of A2, adding an SASC module and a CSSAtt module, and training on the training data;
a4: on the basis of A2, adopting a multi-task loss function facing to a voice enhancement task and a voice recognition task at the same time, and training on the training data;
a5: on the basis of A2, adding the module A3 and adopting the multitask loss function A4, and training on the training data;
the experimental description is shown in fig. 7.
3. Evaluation index of voice enhancement effect:
the experiment adopts objective voice quality evaluation indexes to measure the model voice enhancement effect, and the experiment is as follows:
(1) PESQ (Perception evaluation of speed qualification): perception voice quality evaluation, wherein the value of the perception voice quality evaluation is between 0~5, and the higher the value is, the better the enhancement effect is;
(2) CSIG (Mean opinion score (MOS) prediction of the signal discrimination attachment to the speed signal): the mean opinion score of the voice distortion is between 1~5, and the higher the value is, the better the enhancement effect is;
(3) CBAK (MOS prediction of the intervention of background noise): the average opinion score of the background noise invasiveness is between 1~5, and the higher the value is, the better the enhancement effect is;
(4) COVL (MOS prediction of the over effect): the average opinion score of the overall effect is between 1~5, and the higher the numerical value is, the better the enhancement effect is;
(5) STOI (Short-Time object intelligentity): the short-time objective intelligibility is a percentage value, the value of the percentage value is between 0 and 100, and a higher numerical value represents a better enhancement effect.
And respectively calculating the noisy speech and the enhanced speech with the clean speech in pairs to obtain the evaluation index scores by taking the clean speech as a reference, wherein the higher the calculation score is, the more similar the speech waveform characteristics of the noisy speech and the enhanced speech to the clean speech, namely, the better the speech quality is reflected.
4. And (3) a voice recognition model: in the experiment, the conventional DeepSpeech2 acoustic model (DS 2) is used as a voice recognition effect verification model, and the DS2 model does not need to be retrained. Respectively passing the experimental results obtained by the 5 groups of experiments in the embodiment through a DS2 model to obtain instruction texts, passing clean/noisy test data through the DS2 model to obtain clean/noisy instruction texts, and analyzing and comparing all instruction text results by taking the clean instruction texts as a reference;
5. evaluation index of voice recognition effect:
the Chinese characters and English letters based Character Error Rate (CER) is adopted to measure the voice recognition effect, the lower the numerical value is, the better the voice recognition effect is represented, and the CER calculation mode is as follows:
Figure DEST_PATH_IMAGE094
in which
Figure 434600DEST_PATH_IMAGE077
For the length of the real instruction text,
Figure DEST_PATH_IMAGE095
Figure DEST_PATH_IMAGE096
and
Figure 568909DEST_PATH_IMAGE089
representing the insertion, deletion and replacement operands required to convert the predicted instruction text to the real instruction text, respectively.
6. The experimental results are as follows:
the experimental results of the present invention are shown in fig. 8 and fig. 9, and it can be known from the experimental results that both the module and the loss function mechanism proposed by the present invention can improve the effects of speech enhancement and speech recognition on the data set of the present embodiment. Specifically, the method comprises the following steps:
(1) Experiments A1 and A2 show that after a multi-resolution mechanism of a loss function is introduced, a voice enhancement effect and a voice recognition effect are improved compared with a baseline model, and the fact that a plurality of groups of Fourier transform triple parameters are set can construct a voice amplitude spectrum from multiple aspects and multiple scales is beneficial to deeply mining effective information of voice by the model, so that a voice enhancement task and a voice recognition task are supported for research.
(2) It can be known from experiment A3 that the SASC module and the CSSAtt module proposed by the present invention are introduced on the basis of experiment A2 to enhance noisy data, and objective evaluation score indexes of the corresponding experiments are superior to those of the reference model A2 and the model A4 using multitask loss, which indicates that the module proposed by the present invention is beneficial to long-distance transmission and reconstruction of speech features when the model network is deep, and also beneficial to capturing speech information from multiple dimensions when the model analyzes a speech feature map, thereby improving robustness of the model and making the enhancement effect better.
(3) It can be known from experiment A4 that the multitask loss provided by the present invention is introduced on the basis of experiment A2, the voice recognition effect is better than that of other models which do not adopt the multitask loss, and all experiments are tested by using the existing voice recognition model. The method shows that on the basis of not needing to retrain the speech recognition model, the characteristics of the speech recognition task are directly considered on the front edge enhancement task, and the model is utilized to learn the common characteristic representation of the noisy speech and the clean speech on the recognition, so that the optimized speech data have better recognition effect.
(4) As can be seen from experiment A5, the multiresolution mechanism, the model module and the adoption of the multitask loss function, which are disclosed by the invention, are introduced simultaneously, so that the baseline model can obtain the optimal speech enhancement and speech recognition performance on the test set of the embodiment, and the effectiveness of the method provided by the invention is proved.
Example 4
As shown in fig. 10, a controller voice enhancement apparatus for voice recognition comprises at least one processor, and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a speech recognition oriented policer speech enhancement method as described in the previous embodiments. The input and output interface can comprise a display, a keyboard, a mouse and a USB interface and is used for inputting and outputting data; the power supply is used for supplying electric energy to the electronic equipment.
Those skilled in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as a removable Memory device, a Read Only Memory (ROM), a magnetic disk, or an optical disk.
When the integrated unit of the present invention is implemented in the form of a software functional unit and sold or used as a separate product, it may also be stored in a computer-readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (10)

1. A controller voice enhancement method facing voice recognition is characterized by comprising the following steps:
s1: acquiring an original clean voice-voice with noise data pair of a ground-air call to form an original data set, and outputting an effective data set after preprocessing and labeling the original data set;
s2: building a controller voice enhancement preliminary model based on a neural network structure;
s3: establishing a multitask loss function of the controller voice enhancement preliminary model based on a controller voice enhancement task and a controller voice recognition task;
s4: iteratively updating model parameters of the controller voice enhancement preliminary model through a gradient descent neural network training algorithm based on the multitask loss function and the effective data set, and outputting a controller voice enhancement model;
s5: and inputting the voice of the controller to be enhanced into the controller voice enhancement model, and outputting corresponding enhanced voice.
2. The method for speech enhancement of controller for speech recognition according to claim 1, wherein the step S1 comprises the steps of:
s1-1: acquiring an original clean voice-voice data pair with noise of a ground-air communication to form an original data set;
the method for acquiring the original clean voice-noisy voice data pair comprises the following steps:
on the basis of the existing internal speech system, adding an auxiliary internal speech system to each empty pipe seat, and simultaneously acquiring the speech of a controller through the auxiliary internal speech system and the existing internal speech system to obtain the original clean speech-noisy speech data pair;
s1-2: preprocessing an original clean voice-voice data pair with noise in the original data set, and outputting the preprocessed original clean voice-voice data pair with noise; the preprocessing comprises voice activity detection, speaker role classification, redundant data screening and time sequence alignment;
s1-3: randomly dividing the preprocessed original clean voice-voice data pairs with noise into an effective training set, an effective verification set and an effective test set, manually labeling the data pairs of the effective test set, and outputting the effective training set, the effective verification set and the labeled effective test set as effective data sets; and the manual labeling content is an instruction text corresponding to the original clean voice-voice data with noise.
3. The controller voice enhancement method facing voice recognition according to claim 2, wherein the controller voice enhancement preliminary model comprises a first SCN module, a second SCN module, a plurality of encoder units and corresponding decoder units; the first SCN module is arranged between the input end of the preliminary model and the input end of the encoder unit; the second SCN module is arranged between the output end of the preliminary model and the output end of the decoder unit; the encoder unit comprises a CNN module and a CSSAtt module; the decoder unit comprises a CNN module and a CSSAtt module; the encoder unit and the corresponding decoder unit are connected through a BilSTM module and an SASC module;
the first SCN module is used for performing feature upsampling on the voice data in the effective data set;
the second SCN module is used for down-sampling the voice feature map output by the decoder unit;
the CNN module is used for extracting a preliminary voice feature map of the voice data and outputting the preliminary voice feature map to the CSSAtt module;
the BilSTM module is used for capturing the dependency relationship of the time sequence change of the voice data and mining the time sequence correlation between the signal frames of the voice data;
the SASC module is erected between peer layers of the encoder and the decoder and transfers the same-dimension characteristics of the voice data from a shallow network to a deep network in a skipping mode;
and the CSSAtt module is used for guiding the preliminary model to respectively mine features from the channel dimension and the space dimension of the preliminary voice feature map and optimizing the segmentation attention parameter of the channel space.
4. The speech enhancement method for controller facing speech recognition according to claim 3, wherein the SASC module comprises the following operation steps:
s2-1-1: obtaining the voice data through the second
Figure 4185DEST_PATH_IMAGE001
Obtaining a coding characteristic diagram after coding characteristics of the coder unit
Figure 186905DEST_PATH_IMAGE002
Figure 497800DEST_PATH_IMAGE003
Wherein B represents the batch size, C represents the number of channels, and L represents the data length;
s2-1-2: obtaining the voice data through the second
Figure 158589DEST_PATH_IMAGE001
Obtaining a decoding characteristic diagram after decoding characteristics of the decoder unit
Figure 972961DEST_PATH_IMAGE004
Figure 326582DEST_PATH_IMAGE005
S2-1-3: respectively aiming at the coding feature maps
Figure 124774DEST_PATH_IMAGE002
And said decoding feature map
Figure 323674DEST_PATH_IMAGE004
Performing self-attention operation to obtain the coding feature map
Figure 258132DEST_PATH_IMAGE002
And said decoding feature map
Figure 48233DEST_PATH_IMAGE004
The initial self-attention weight is spliced and activated to obtain a fused self-attention weight, and the operation formula is as follows:
Figure 802563DEST_PATH_IMAGE006
Figure 70733DEST_PATH_IMAGE007
Figure 859697DEST_PATH_IMAGE008
Figure 289542DEST_PATH_IMAGE009
Figure 796746DEST_PATH_IMAGE010
Figure 868608DEST_PATH_IMAGE011
wherein the content of the first and second substances,
Figure 512078DEST_PATH_IMAGE012
it is indicated that the operation is self-attentive,
Figure 378403DEST_PATH_IMAGE013
an initial self-attention weight is represented,
Figure 107325DEST_PATH_IMAGE014
a stitching operation in the channel dimension is represented,
Figure 982877DEST_PATH_IMAGE015
a first activation function is represented that is,
Figure 746434DEST_PATH_IMAGE016
a self-attention weight representing the encoder to decoder peer fusion;
s2-1-4: and performing self-attention operation and activation processing on the fused self-attention weight to obtain a skip attention weight coefficient of peer layers of an encoder and a decoder, wherein the operation formula is as follows:
Figure 518081DEST_PATH_IMAGE017
,
Figure 999878DEST_PATH_IMAGE018
wherein the content of the first and second substances,
Figure 413541DEST_PATH_IMAGE019
a second activation function is represented that is,
Figure 31604DEST_PATH_IMAGE020
representing a skipping attention weight coefficient;
s2-1-5: adjusting the coding feature map according to the skipping attention weight coefficient
Figure 974153DEST_PATH_IMAGE002
The weights of all the characteristic points are spliced during the step skipping process, and the decoding characteristic graph is spliced
Figure 677666DEST_PATH_IMAGE004
And outputting the jump connection voice characteristic graph processed by the SASC module, wherein the operation formula is as follows:
Figure 895021DEST_PATH_IMAGE021
Figure 367591DEST_PATH_IMAGE022
wherein
Figure 746620DEST_PATH_IMAGE023
Which means that the multiplication operations are performed by elements,
Figure 937429DEST_PATH_IMAGE024
and representing a jump connection voice characteristic diagram.
5. The speech enhancement method for controllers facing speech recognition according to claim 3, wherein the CSSAtt module comprises the following operation steps:
s2-2-1: inputting a batch of the preliminary voice feature map, dividing the batch of the preliminary voice feature map into G groups of sub-feature maps, and dividing each group of sub-feature maps into two branch sub-feature maps
Figure 427317DEST_PATH_IMAGE025
And
Figure 308989DEST_PATH_IMAGE026
Figure 858919DEST_PATH_IMAGE027
(ii) a Wherein the content of the first and second substances,
Figure 271446DEST_PATH_IMAGE025
a sub-feature map of the channel branch is shown,
Figure 830603DEST_PATH_IMAGE026
representing a space branch sub-feature diagram, B representing batch size, C representing channel number, L representing data length, and G representing preset group number;
s2-2-2: based on the
Figure 277765DEST_PATH_IMAGE025
And generating an initialized channel attention weight through an adaptive average pooling operation on the channel dimension, wherein the operation formula is as follows:
Figure 998596DEST_PATH_IMAGE028
Figure 429577DEST_PATH_IMAGE029
wherein
Figure 261267DEST_PATH_IMAGE030
An adaptive average pooling operation is represented,
Figure 562936DEST_PATH_IMAGE031
the dimensions of the channels are shown in the figure,
Figure 454668DEST_PATH_IMAGE032
indicating initialized channel attention weights;
s2-2-3: based on the
Figure 841787DEST_PATH_IMAGE026
Generating an initialized spatial attention weight through grouping normalization operation on a spatial dimension, wherein the operation formula is as follows:
Figure 742747DEST_PATH_IMAGE033
Figure 164501DEST_PATH_IMAGE034
wherein
Figure 227135DEST_PATH_IMAGE035
Which represents the packet normalization operation, is shown,
Figure 101550DEST_PATH_IMAGE036
the dimensions of the space are represented by,
Figure 275043DEST_PATH_IMAGE037
representing an initialized spatial attention weight;
s2-2-4: mining said by learnable parameters
Figure 816882DEST_PATH_IMAGE025
In the channel dimension and the
Figure 50418DEST_PATH_IMAGE026
Feature dependency on spatial dimension is generated after activation of an activation function, and attention weight coefficients on channel dimension and spatial dimension are generated, wherein the operation formula is as follows:
Figure 677708DEST_PATH_IMAGE038
Figure 654891DEST_PATH_IMAGE039
Figure 316817DEST_PATH_IMAGE040
Figure 721253DEST_PATH_IMAGE041
wherein
Figure 304681DEST_PATH_IMAGE013
And
Figure 85556DEST_PATH_IMAGE042
the representation of the learnable parameter is,
Figure 336408DEST_PATH_IMAGE019
it is shown that the second activation function is,
Figure 177325DEST_PATH_IMAGE043
which means that the multiplication operations are performed by elements,
Figure 248050DEST_PATH_IMAGE044
and
Figure 567035DEST_PATH_IMAGE045
respectively representing a channel attention weight coefficient and a space attention weight coefficient;
s2-2-5: adjusting the channel attention weight coefficient and the spatial attention weight coefficient respectively
Figure 937974DEST_PATH_IMAGE025
And stationAs described in
Figure 684213DEST_PATH_IMAGE026
The weight of each characteristic point is added, and the method is spliced
Figure 507813DEST_PATH_IMAGE025
And said
Figure 364910DEST_PATH_IMAGE026
Enabling information communication among different groups of the sub-feature maps by using a channel shuffling operation for the sub-feature maps, and outputting the voice feature map processed by the CSSAtt module, wherein the operation formula is as follows:
Figure 855934DEST_PATH_IMAGE046
Figure 773075DEST_PATH_IMAGE047
wherein
Figure 818391DEST_PATH_IMAGE048
A feature map stitching operation in the channel dimension is represented,
Figure 479180DEST_PATH_IMAGE049
a channel shuffle operation is represented that is,
Figure 824710DEST_PATH_IMAGE050
represents the speech feature map after the CSSAtt module processing.
6. The speech recognition-oriented controller speech enhancement method according to claim 1, wherein the step S3 comprises the steps of:
s3-1: constructing a loss function facing a voice enhancement task based on LAE, and directly measuring the output waveform and the real waveform of the controller voice enhancement preliminary model on the time domainError between, noted as LAE loss function
Figure 912752DEST_PATH_IMAGE051
S3-2: constructing a frequency domain loss function facing a voice enhancement task based on the multi-resolution STFT amplitude spectrum, measuring the error of the STFT amplitude spectrum between the output voice of the controller voice enhancement preliminary model and the real voice, and marking as the STFT loss function
Figure 445365DEST_PATH_IMAGE052
S3-3: constructing a characteristic loss function facing to a voice recognition task based on multi-resolution, and recording the characteristic loss function as the characteristic loss function
Figure 909844DEST_PATH_IMAGE053
S3-4: constructing a multitask loss function of the controller voice enhancement preliminary model in a weighted summation mode, wherein the calculation formula is as follows:
Figure 844302DEST_PATH_IMAGE054
wherein
Figure 368824DEST_PATH_IMAGE055
For the purpose of the multi-tasking loss function,
Figure 388733DEST_PATH_IMAGE056
Figure 922482DEST_PATH_IMAGE057
and
Figure 711447DEST_PATH_IMAGE058
respectively represent the LAE loss function
Figure 141291DEST_PATH_IMAGE051
The STFT loss function
Figure 648496DEST_PATH_IMAGE052
And the characteristic loss function
Figure 454778DEST_PATH_IMAGE053
The preset weight of (c).
7. The method for speech enhancement of controller for speech recognition according to claim 6, wherein the step S3-2 comprises the steps of:
s3-2-1: constructing a triple parameter for the STFT operation; wherein the triplet parameters are formed as follows:
[ sampling point, frame shift, window frame ];
s3-2-2: constructing the STFT amplitude spectrum of the speech by using the triplet parameters
Figure 98249DEST_PATH_IMAGE001
The STFT loss function of the STFT amplitude spectrum constructed by the group triplet parameters is
Figure 230153DEST_PATH_IMAGE059
S3-2-3: constructing the STFT loss function as follows:
Figure 959074DEST_PATH_IMAGE060
wherein
Figure 834626DEST_PATH_IMAGE061
Representing the number of groups of the triplet parameters.
8. The speech recognition-oriented controller speech enhancement method according to claim 6, wherein the step S3-3 comprises the following steps:
s3-3-1: performing critical band integration, loudness pre-emphasis, cubic root compression, inverse Fourier transform and linear prediction processing on the STFT amplitude spectrum to obtain perceptual linear prediction acoustic features, establishing a loss function based on the perceptual linear prediction features, measuring an error between output voice of the controller voice enhancement preliminary model and real voice perceptual linear prediction acoustic features, and recording the error as a PLP loss function
Figure 332604DEST_PATH_IMAGE062
S3-3-2: carrying out Mel filtering and logarithmic transformation processing on the STFT amplitude spectrum to obtain acoustic characteristics of a filter bank, establishing a loss function based on the characteristics of the filter bank, measuring errors between filter bank characteristics of output voice and real voice of the controller voice enhanced preliminary model, and recording the errors as FBANK loss function
Figure 369830DEST_PATH_IMAGE063
S3-3-3: discrete cosine transform is carried out on the acoustic features of the filter bank to obtain Mel cepstrum coefficient acoustic features, a loss function based on the Mel cepstrum coefficient features is established, errors between model output voice and real voice Mel cepstrum coefficient features are measured and recorded as MFCC loss functions
Figure 586048DEST_PATH_IMAGE064
S3-3-4: constructing feature loss function for voice recognition task
Figure 734132DEST_PATH_IMAGE053
Figure 352195DEST_PATH_IMAGE065
9. The method for speech enhancement of controller for speech recognition according to claim 2, wherein the step S4 comprises the steps of:
s4-1: randomly acquiring a plurality of original clean speech-noisy speech data pairs from the effective training set as a training set
Figure 825902DEST_PATH_IMAGE066
And extracting a pure noise data waveform according to the difference between the noisy speech waveform and the clean speech waveform, wherein the operation formula is as follows:
Figure 529416DEST_PATH_IMAGE067
Figure 481191DEST_PATH_IMAGE068
wherein the content of the first and second substances,
Figure 953761DEST_PATH_IMAGE069
a clean speech waveform representing the original clean speech data in the original clean speech-noisy speech data pair,
Figure 332790DEST_PATH_IMAGE070
a noisy speech waveform representing the noisy speech data in the original clean speech-noisy speech data pair,
Figure 789179DEST_PATH_IMAGE071
showing pure noise waveforms, all three shapes
Figure 279066DEST_PATH_IMAGE072
B denotes a batch size, C denotes the number of lanes, L denotes a data length,
Figure 871721DEST_PATH_IMAGE073
representing a subtraction operation by eigenvalue;
s4-2: randomly disturbing the distribution of the pure noise waveforms in the training set, and adding the pure noise waveforms and the clean speech waveforms to obtain enhanced noisy speech waveforms, wherein the operational expression is as follows:
Figure 156072DEST_PATH_IMAGE074
Figure 99757DEST_PATH_IMAGE075
wherein
Figure 393336DEST_PATH_IMAGE076
Indicating that the data is being shuffled through the operation,
Figure 840497DEST_PATH_IMAGE077
representing clean speech waveform-pure noise waveform data pairs,
Figure 826908DEST_PATH_IMAGE078
representing the waveform of said enhanced noisy speech,
Figure 726731DEST_PATH_IMAGE079
indicating an addition operation by a characteristic value;
s4-3: combining the clean speech waveform and the enhanced noisy speech waveform into a new training set, and recording as a second training set
Figure 558421DEST_PATH_IMAGE080
S4-4: iteratively updating model parameters of the controller voice-enhanced preliminary model through a gradient descent algorithm based on the second training set and the multitask loss function, verifying whether the controller voice-enhanced preliminary model is converged through the effective verification set in the training process, and outputting the current controller voice-enhanced preliminary model as the controller voice-enhanced model after model training is converged;
the basis for judging the model training convergence is as follows: calculating the multitask loss function of the primary model through the effective verification set every m iteration rounds, and when the multitask loss function does not fall any more after the calculation for n times, considering the multitask loss function as model training convergence; m and n are preset values;
s4-5: testing the model with the labeled active test set.
10. A controller voice enhancement apparatus oriented to voice recognition, comprising at least one processor, and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 9.
CN202210841871.0A 2022-07-18 2022-07-18 Controller voice enhancement method and device facing voice recognition Active CN115240648B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210841871.0A CN115240648B (en) 2022-07-18 2022-07-18 Controller voice enhancement method and device facing voice recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210841871.0A CN115240648B (en) 2022-07-18 2022-07-18 Controller voice enhancement method and device facing voice recognition

Publications (2)

Publication Number Publication Date
CN115240648A true CN115240648A (en) 2022-10-25
CN115240648B CN115240648B (en) 2023-04-07

Family

ID=83673402

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210841871.0A Active CN115240648B (en) 2022-07-18 2022-07-18 Controller voice enhancement method and device facing voice recognition

Country Status (1)

Country Link
CN (1) CN115240648B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112927709A (en) * 2021-02-04 2021-06-08 武汉大学 Voice enhancement method based on time-frequency domain joint loss function
CN113096646A (en) * 2019-12-20 2021-07-09 北京世纪好未来教育科技有限公司 Audio recognition method and device, electronic equipment and storage medium
CN113707134A (en) * 2021-08-17 2021-11-26 北京搜狗科技发展有限公司 Model training method and device for model training
CN114141238A (en) * 2021-11-26 2022-03-04 中国人民解放军陆军工程大学 Voice enhancement method fusing Transformer and U-net network
KR20220030120A (en) * 2020-09-02 2022-03-10 네이버 주식회사 Method and system for training speech recognition models using augmented consistency regularization
WO2022094293A1 (en) * 2020-10-29 2022-05-05 Dolby Laboratories Licensing Corporation Deep-learning based speech enhancement
CN114648982A (en) * 2022-05-24 2022-06-21 四川大学 Controller voice recognition method and device based on comparative learning
US20220199095A1 (en) * 2019-06-21 2022-06-23 Industry-University Cooperation Foundation Hanyang University Method and apparatus for combined learning using feature enhancement based on deep neural network and modified loss function for speaker recognition robust to noisy environments
CN114694670A (en) * 2022-04-06 2022-07-01 华南理工大学 Multi-task network-based microphone array speech enhancement system and method

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220199095A1 (en) * 2019-06-21 2022-06-23 Industry-University Cooperation Foundation Hanyang University Method and apparatus for combined learning using feature enhancement based on deep neural network and modified loss function for speaker recognition robust to noisy environments
CN113096646A (en) * 2019-12-20 2021-07-09 北京世纪好未来教育科技有限公司 Audio recognition method and device, electronic equipment and storage medium
KR20220030120A (en) * 2020-09-02 2022-03-10 네이버 주식회사 Method and system for training speech recognition models using augmented consistency regularization
WO2022094293A1 (en) * 2020-10-29 2022-05-05 Dolby Laboratories Licensing Corporation Deep-learning based speech enhancement
CN112927709A (en) * 2021-02-04 2021-06-08 武汉大学 Voice enhancement method based on time-frequency domain joint loss function
CN113707134A (en) * 2021-08-17 2021-11-26 北京搜狗科技发展有限公司 Model training method and device for model training
CN114141238A (en) * 2021-11-26 2022-03-04 中国人民解放军陆军工程大学 Voice enhancement method fusing Transformer and U-net network
CN114694670A (en) * 2022-04-06 2022-07-01 华南理工大学 Multi-task network-based microphone array speech enhancement system and method
CN114648982A (en) * 2022-05-24 2022-06-21 四川大学 Controller voice recognition method and device based on comparative learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
YI LIN ET AL.: "《A Deep Learning Framework of Autonomous Pilot Agent for Air Traffic Controller Training》" *
YI LIN ET AL.: "《A Unified Framework for Multilingual Speech Recognition in Air Traffic Control Systems》" *
吴向阳 等: "《基于深度学习的空管语音识别》" *
高登峰 等: "《多特征全卷积网络的地空通话语音增强方法》" *

Also Published As

Publication number Publication date
CN115240648B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
Kharitonov et al. Data augmenting contrastive learning of speech representations in the time domain
Alim et al. Some commonly used speech feature extraction algorithms
CN112017644B (en) Sound transformation system, method and application
CN110457432B (en) Interview scoring method, interview scoring device, interview scoring equipment and interview scoring storage medium
CN110782872A (en) Language identification method and device based on deep convolutional recurrent neural network
Darabkh et al. An efficient speech recognition system for arm‐disabled students based on isolated words
CN110853649A (en) Label extraction method, system, device and medium based on intelligent voice technology
CN115019776A (en) Voice recognition model, training method thereof, voice recognition method and device
CN116110405B (en) Land-air conversation speaker identification method and equipment based on semi-supervised learning
CN113823293B (en) Speaker recognition method and system based on voice enhancement
Nunes et al. Additive margin sincnet for speaker recognition
CN114783418B (en) End-to-end voice recognition method and system based on sparse self-attention mechanism
CN109036470A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN108806725A (en) Speech differentiation method, apparatus, computer equipment and storage medium
Nandi et al. Parametric representation of excitation source information for language identification
Thomas et al. Acoustic and data-driven features for robust speech activity detection
CN110390937B (en) Cross-channel voiceprint recognition method based on ArcFace loss algorithm
CN113744715A (en) Vocoder speech synthesis method, device, computer equipment and storage medium
Xie et al. Noisy-to-noisy voice conversion framework with denoising model
CN113571095A (en) Speech emotion recognition method and system based on nested deep neural network
CN115240648B (en) Controller voice enhancement method and device facing voice recognition
CN110619886A (en) End-to-end voice enhancement method for low-resource Tujia language
CN115273890A (en) Tone conversion method, electronic device, and computer-readable storage medium
CN115035904A (en) High-quality vocoder model based on generative antagonistic neural network
CN114724589A (en) Voice quality inspection method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant