CN115240648A

CN115240648A - Controller voice enhancement method and device facing voice recognition

Info

Publication number: CN115240648A
Application number: CN202210841871.0A
Authority: CN
Inventors: 余欣乘; 林毅; 张建伟
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2022-07-18
Filing date: 2022-07-18
Publication date: 2022-10-25
Anticipated expiration: 2042-07-18
Also published as: CN115240648B

Abstract

The invention relates to the field of civil aviation air traffic control and the field of voice enhancement, in particular to a voice recognition-oriented controller voice enhancement method and a voice recognition-oriented controller voice enhancement device. The method takes the preprocessed air traffic control original clean voice-noisy voice data pair collected in a real scene as a data set, builds a controller voice enhancement preliminary model comprising a SASC module and a CSSAtt module, and utilizes a neural network model trained by a multitask loss function facing to a voice enhancement task and a voice recognition task at the same time to enhance the voice of the existing air traffic control noisy controller, thereby eliminating echo influence, improving the definition and intelligibility of the controller voice and effectively increasing the recognition accuracy of the controller voice recognition.

Description

Controller voice enhancement method and device facing voice recognition

Technical Field

The invention relates to the field of civil aviation air traffic control and the field of voice enhancement, in particular to a voice recognition-oriented controller voice enhancement method and device.

Background

In the field of Air Traffic Control (ATC), the main communication method between a controller and a pilot is voice communication, and voice signals are transmitted through Very High Frequency (VHF) radio. The controller issues voice commands to the pilot, and the pilot repeats the voice commands and returns the voice commands to the controller. The voice transmission confirmation mechanism between the controller and the pilot ensures the orderly operation of the air traffic control system. Fig. 1 describes the generation and transmission process of air traffic control speech, which is as follows:

(1) A controller inputs voice by using a microphone, and the voice is transmitted to a radio station in an uplink mode through a ground-to-air communication internal call system and a communication server and is sent to a pilot end;

(2) In order to enable a controller to know whether a voice command is safely transmitted to a pilot, the air traffic control radio system uses a special 'return mechanism', when the voice command is transmitted to a radio station, the radio station transmits the voice command sent by the controller back to a controller earphone along the same radio frequency, so that the controller can hear the voice command sent by the controller;

(3) After receiving the voice command of the controller, the pilot repeats the voice command and transmits the voice command to the controller through a radio station, a communication server and a ground-to-air communication internal phone system in a downlink manner to finish command interaction;

(4) In order to ensure that the control seats have uniform voice interfaces, the internal speech system can combine uplink and downlink speech signals of the controller and downlink speech signals of the pilot in a superposition splicing mode, and the combined speech can be used for subsequent speech processing tasks, such as speech recognition, voiceprint recognition and the like.

The voice command sent by the controller is transmitted in the uplink and the downlink, and a time delay phenomenon occurs when the voice system is superposed, so that the obtained voice signal of the controller is a 'control echo' superposed signal which is specific to the air traffic control voice system. FIGS. 2 and 3 depict waveforms and corresponding waveform diagrams and spectrogram of voices on various transmission lines, wherein the uppermost voice represents upstream voice, the middle voice represents downstream voice, and the lowermost voice represents downstream voiceThe speech of (2) represents mixed speech and the time delay is represented as in figure 2

. As shown in fig. 2 and fig. 3, when the voice signals of the controller are superimposed, the voice waveform and the corresponding spectrogram are accompanied by more "echo" noise (as shown in the block in fig. 3). In addition, due to the complexity of the air traffic condition, the voice signal is also affected by factors such as the collection equipment, the transmission device, the weather environment and the speaker characteristics during the propagation process, and the factors bring more noise data pollution to the voice signal. The noise in the voice band can reduce the intelligibility and the definition of the voice band, and the representation characteristics of the signal are interfered, thereby influencing the subsequent voice recognition task.

The intelligibility and the definition of the voice signal of the controller are reduced, the auditory perception of the subsequent voice echo analysis task is influenced, the accuracy of the voice content acquisition is reduced, and the voice information analysis is not facilitated. In addition, the current voice recognition method shows that the recognition accuracy of the voice of the controller with echo is obviously lower than that of the voice of a pilot without echo, the controller is used as an initiator of ground-air communication information exchange, the voice information of the controller has important significance in the aspect of ensuring the orderly operation of an air traffic control system, and the subsequent voice processing task can be greatly influenced due to the low voice recognition accuracy of the controller. Therefore, there is a need for a controller voice enhancement method and apparatus that can eliminate echo effect, improve voice quality, and improve accuracy of voice recognition.

Disclosure of Invention

The invention aims to solve the problems of poor voice quality and low voice recognition rate of a controller with a control echo in an air traffic control complex radio communication scene in the prior art, and provides a voice recognition-oriented controller voice enhancement method and device.

In order to achieve the above purpose, the invention provides the following technical scheme:

a controller voice enhancement method facing voice recognition comprises the following steps:

s1: acquiring an original clean voice-voice data pair with noise of a ground-air call to form an original data set, and outputting an effective data set after preprocessing and labeling the original data set;

s2: building a controller voice enhancement preliminary model based on a neural network structure;

s3: establishing a multitask loss function of the controller voice enhancement preliminary model based on a controller voice enhancement task and a controller voice recognition task;

s4: iteratively updating model parameters of the controller voice enhancement preliminary model through a gradient descent neural network training algorithm based on the multitask loss function and the effective data set, and outputting a controller voice enhancement model;

s5: and inputting the voice of the controller to be enhanced into the controller voice enhancement model, and outputting corresponding enhanced voice. The method takes the preprocessed air traffic control original clean voice-noisy voice data pair collected in a real scene as a data set, builds a controller voice enhancement preliminary model comprising a SASC module and a CSSAtt module, and utilizes a neural network model trained by a multitask loss function facing to a voice enhancement task and a voice recognition task at the same time to enhance the voice of the existing air traffic control noisy controller, thereby eliminating echo influence, improving the definition and intelligibility of the controller voice and effectively increasing the recognition accuracy of the controller voice recognition.

As a preferable embodiment of the present invention, the step S1 includes the steps of:

s1-1: acquiring an original clean voice-voice data pair with noise of a ground-air communication to form an original data set;

the method for acquiring the original clean voice-noisy voice data pair comprises the following steps:

on the basis of the existing internal speech system, adding an auxiliary internal speech system to each empty pipe seat, and simultaneously acquiring the speech of a controller through the auxiliary internal speech system and the existing internal speech system to obtain the original clean speech-noisy speech data pair;

s1-2: preprocessing an original clean voice-voice data pair with noise in the original data set, and outputting the preprocessed original clean voice-voice data pair with noise; the preprocessing comprises voice activity detection, speaker role classification, redundant data screening and time sequence alignment;

s1-3: randomly dividing the preprocessed original clean voice-voice data pairs with noise into an effective training set, an effective verification set and an effective test set, manually labeling the data pairs of the effective test set, and outputting the effective training set, the effective verification set and the labeled effective test set as effective data sets; and the manual labeling content is an instruction text corresponding to the original clean voice-voice data with noise. According to the invention, a preprocessing method for processing the original voice is designed according to a voice generating and collecting mechanism under a real empty management scene, so that the processing operation efficiency and the accuracy of the voice recognition process after training, testing and even voice enhancement of the model method related by the invention are effectively improved.

As a preferred scheme of the present invention, the controller speech enhancement preliminary model includes a first SCN module, a second SCN module, a plurality of encoder units, and corresponding decoder units; the first SCN module is arranged between the input end of the preliminary model and the input end of the encoder unit; the second SCN module is arranged between the output end of the preliminary model and the output end of the decoder unit; the encoder unit comprises a CNN module and a CSSAtt module; the decoder unit comprises a CNN module and a CSSAtt module; the encoder unit and the corresponding decoder unit are connected through a BilSTM module and an SASC module;

the first SCN module is used for performing feature upsampling on the voice data in the effective data set;

the second SCN module is used for down-sampling the voice feature map output by the decoder unit;

the CNN module is used for extracting a preliminary voice feature map of the voice data and outputting the preliminary voice feature map to the CSSAtt module; namely, the CNN module can extract the local features of the voice signal and synthesize the local features in a network deep layer to obtain the global features of the voice signal;

the BilSTM module is used for capturing the dependency relationship of the time sequence change of the voice data and mining the time sequence correlation between the signal frames of the voice data;

the SASC module is erected between peer layers of the encoder and the decoder and transfers the same-dimension characteristics of the voice data from a shallow network to a deep network in a skipping mode;

and the CSSAtt module is used for guiding the preliminary model to respectively mine features from the channel dimension and the space dimension of the preliminary voice feature map and optimizing the segmentation attention parameter of the channel space.

The invention designs a full-end-to-end controller voice enhancement model, the input and the output of the model are the original voice waveforms, other voice transformation steps are not involved, and the controller voice enhancement model can be directly applied to the enhancement and optimization processing of voice data input into the existing voice recognition model under the real air traffic complex environment on the basis of not retraining the existing voice recognition model.

As a preferred scheme of the present invention, the SASC module includes the following operation steps:

s2-1-1: obtaining the voice data through the second

Obtaining a coding characteristic diagram after coding characteristics of the coder unit

，

Wherein B represents the batch size, C represents the number of channels, and L represents the data length;

s2-1-2: obtaining the voice data through the second

Obtaining a decoding characteristic diagram after decoding characteristics of the decoder unit

，

;

S2-1-3: respectively aiming at the coding feature maps

And said decoding feature map

Performing self-attention operation to obtain the coding feature map

And said decoding feature map

The initial self-attention weight is spliced and activated to obtain a fused self-attention weight, and the operation formula is as follows:

，

，

，

wherein, the first and the second end of the pipe are connected with each other,

it is indicated that the operation is performed by self-attention,

an initial self-attention weight is represented,

a stitching operation in the channel dimension is represented,

representing a first activation function, the ReLU activation function for enhancing the ability of a neural network to fit a non-linear function,

a self-attention weight representing the encoder and decoder peer fusion;

s2-1-4: and performing self-attention operation and activation processing on the fused self-attention weight to obtain a skipping attention weight coefficient of the peer layers of the encoder and the decoder, wherein the operation formula is as follows:

,

it is shown that the second activation function is,

representing a jump attention weight coefficient;

s2-1-5: adjusting the coding feature map according to the skipping attention weight coefficient

The weights of all the characteristic points are spliced when skipping steps, and the decoding characteristic graph is spliced

And outputting the jump connection voice characteristic graph processed by the SASC module, wherein the operation formula is as follows:

，

wherein

Which means that the multiplication operations are performed by elements,

and representing a jump connection voice characteristic diagram.

According to the invention, according to the encoder-decoder structure of the model, an SASC module for speech processing is designed, and the SASC module uses a self-attention mechanism to mine and analyze useful characteristics of a speech characteristic diagram between model peer layers, suppresses redundant characteristics, guides the model to focus on a data characteristic encoding and decoding rule and helps the model to better converge.

As a preferred scheme of the invention, the CSSAtt module comprises the following operating steps:

s2-2-1: inputting a batch of the preliminary voice feature map, dividing the batch of the preliminary voice feature map into G groups of sub-feature maps, and dividing each group of sub-feature maps into two branch sub-feature maps

And

，

(ii) a Wherein the content of the first and second substances,

indicating channel branchingThe sub-feature map is a map of the features,

representing a space branch sub-feature diagram, B representing batch size, C representing channel number, L representing data length, and G representing preset group number;

s2-2-2: based on the

And generating an initialized channel attention weight through an adaptive average pooling operation on the channel dimension, wherein the operation formula is as follows:

，

wherein

An adaptive average pooling operation is represented,

the dimensions of the channels are represented by,

indicating an initialized channel attention weight;

s2-2-3: based on the

Generating an initialized spatial attention weight through a grouping normalization operation on a spatial dimension, wherein the operation formula is as follows:

，

wherein

Which represents the packet normalization operation, is shown,

the dimensions of the space are represented by,

representing an initialized spatial attention weight;

s2-2-4: mining said by learnable parameters

In the channel dimension and the

Feature dependency on the space dimension is achieved, attention weight coefficients on the channel dimension and the space dimension are generated after activation of an activation function, and the operation formula is as follows:

，

，

wherein

And

the representation of the learnable parameter is,

it is shown that the second activation function is,

which means that the multiplication operations are performed by elements,

and

respectively representing a channel attention weight coefficient and a space attention weight coefficient;

s2-2-5: adjusting the channel attention weight coefficient and the spatial attention weight coefficient respectively

And said

The weight of each characteristic point is added, and the method is spliced

And said

Enabling information communication among different groups of the sub-feature maps by using a channel shuffling operation for the sub-feature maps, and outputting the voice feature map processed by the CSSAtt module, wherein the operation formula is as follows:

，

wherein

A feature map stitching operation in the channel dimension is represented,

representing channel shuffle operations，

Representing the voice characteristic map processed by the CSSAtt module; the channel shuffle operation is specifically: and the channel sequence among different sub-feature graphs in the same batch of feature graphs is disturbed, so that the features of different channels are linked, information intercommunication among different sub-feature graphs is realized, and the common features are convenient to learn.

The CSSAtt module for voice processing is designed, and carries out grouping processing on a batch of voice feature maps, channel dimension and space dimension features are respectively extracted from each group of sub-feature maps, and finally the dimension features of different sub-feature maps are fused to realize feature communication among the sub-feature maps, so that feature communication among different samples is facilitated, and the robust capability of the model on data features is improved.

As a preferable embodiment of the present invention, the step S3 includes the steps of:

s3-1: constructing a loss function facing a voice enhancement task based on LAE, directly measuring the error between the output waveform and the real waveform of the controller voice enhancement preliminary model in the time domain, and recording as the LAE loss function

；

S3-2: constructing a frequency domain loss function facing a voice enhancement task based on the multi-resolution STFT amplitude spectrum, measuring the error of the STFT amplitude spectrum between the output voice of the controller voice enhancement preliminary model and the real voice, and marking as the STFT loss function

；

S3-3: constructing a characteristic loss function facing to a voice recognition task based on multi-resolution, and recording the characteristic loss function as the characteristic loss function

；

S3-4: constructing a multitask loss function of the controller voice enhancement preliminary model in a weighted summation mode, wherein the calculation formula is as follows:

wherein

For the purpose of the multi-tasking loss function,

、

and

respectively represent the LAE loss function

The STFT loss function

And the characteristic loss function

The preset weight of (c).

As a preferred embodiment of the present invention, the step S3-2 includes the steps of:

s3-2-1: constructing a triple parameter for the STFT operation; wherein the triplet parameters are formed as follows:

[ sampling point, frame shift, window frame ];

the parameter sampling points represent the number N of sampling points of a voice frame formed by N sampling points in all the sampling points of the voice signal, the parameter frame shift represents the time difference of the initial positions of two adjacent frames, and the parameter window frame represents the type of a window function processed by the voice signal;

s3-2-2: constructing the STFT amplitude spectrum of the speech by using the triplet parameters

The STFT loss function of the STFT amplitude spectrum constructed by the group triplet parameters is

；

S3-2-3: constructing the STFT loss function as follows:

wherein

Representing the number of groups of the triplet parameters.

As a preferred embodiment of the present invention, the step S3-3 comprises the steps of:

s3-3-1: critical frequency band integration, loudness pre-emphasis, cubic root compression, inverse Fourier transform and linear prediction processing are carried out on the STFT amplitude spectrum to obtain perception linear prediction acoustic features, a loss function based on the perception linear prediction features is established, the error between output voice of the controller voice enhancement preliminary model and real voice perception linear prediction acoustic features is measured and marked as PLP loss function

；

S3-3-2: performing Mel filtering and logarithmic transformation processing on the STFT amplitude spectrum to obtain acoustic characteristics of a filter bank, establishing a loss function based on the characteristics of the filter bank, measuring errors between filter bank characteristics of output voice and real voice of the controller voice enhancement preliminary model, and recording as FBANK loss function

；

S3-3-3: discrete cosine transform is carried out on the acoustic features of the filter bank to obtain the acoustic features of the Mel cepstrum coefficient, and the features based on the Mel cepstrum coefficient are establishedThe error between the Mel cepstrum coefficient characteristics of the model output speech and the real speech is measured and recorded as MFCC loss function

；

S3-3-4: constructing feature loss function for voice recognition task

：

。

Wherein the STFT loss function

And the characteristic loss function

Has the same operation form:

wherein

Representing the STFT magnitude spectrum, the PLP signature, the FBANK signature, or the MFCC signature,

which represents a clean speech signal, is,

which represents an enhanced speech signal, is,

calculating an F norm;

as a preferable embodiment of the present invention, the step S4 includes the steps of:

s4-1: randomly acquiring a plurality of original clean speech-noisy speech data pairs from the effective training set as a training set

And extracting a pure noise data waveform according to the difference between the noisy speech waveform and the clean speech waveform, wherein the operation formula is as follows:

，

wherein the content of the first and second substances,

a clean speech waveform representing the original clean speech data in the original clean speech-noisy speech data pair,

a noisy speech waveform representing noisy speech data in the original clean speech-noisy speech data pair,

showing pure noise waveforms, all three shapes

B denotes a batch size, C denotes the number of lanes, L denotes a data length,

representing a subtraction operation by eigenvalue;

s4-2: randomly disturbing the distribution of the pure noise waveforms in the training set, and adding the pure noise waveforms and the clean speech waveforms to obtain enhanced noisy speech waveforms, wherein the operation formula is as follows:

，

wherein

Indicating that the data is being shuffled through the operation,

representing clean speech waveform-pure noise waveform data pairs,

representing the waveform of said enhanced noisy speech,

indicating an addition operation by a characteristic value;

s4-3: combining the clean speech waveform and the enhanced noisy speech waveform into a new training set, and recording as a second training set

；

S4-4: iteratively updating model parameters of the controller voice enhancement preliminary model through a gradient descent algorithm based on the second training set and the multitask loss function, verifying whether the controller voice enhancement preliminary model is converged through the effective verification set in a training process, and outputting the current controller voice enhancement preliminary model as the controller voice enhancement model after model training is converged;

the basis for judging the model training convergence is as follows: calculating the multitask loss function of the primary model through the effective verification set every m iteration rounds, and when the multitask loss function does not fall any more after the calculation for n times, considering the multitask loss function as model training convergence; m and n are preset values;

the invention designs a data enhancement method for a model training stage, which randomly redistributes sample noise characteristics, achieves the purpose of expanding the data volume of a training set, effectively increases the noise robustness of a model and generalizes different noise distributions.

S4-5: testing the model with the labeled valid test set.

S5: and inputting the voice of the controller to be enhanced into the controller voice enhancement model, and outputting corresponding enhanced voice.

A controller speech enhancement device oriented to speech recognition comprising at least one processor and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any of the methods described above.

Compared with the prior art, the invention has the following beneficial effects:

1. the method takes the preprocessed air traffic control original clean voice-noisy voice data pair collected in a real scene as a data set, builds a controller voice enhancement preliminary model comprising a SASC module and a CSSAtt module, and utilizes a neural network model trained by a multitask loss function facing to a voice enhancement task and a voice recognition task at the same time to enhance the voice of the existing air traffic control noisy controller, thereby eliminating echo influence, improving the definition and intelligibility of the controller voice and effectively increasing the recognition accuracy of the controller voice recognition.

2. According to the invention, a preprocessing method for processing original voice is designed according to a voice generation and collection mechanism under a real empty management scene, so that the processing operation efficiency and accuracy of the voice recognition process after training, testing and even voice enhancement of the model method related by the invention are effectively improved.

3. The invention designs a full-end-to-end controller voice enhancement model, the input and the output of the model are the original voice waveforms, other voice transformation steps are not involved, and the controller voice enhancement model can be directly applied to the enhancement and optimization processing of voice data input into the existing voice recognition model under the real air traffic complex environment on the basis of not retraining the existing voice recognition model.

4. According to the invention, an SASC module for speech processing is designed according to an encoder-decoder structure of a model, and the SASC module uses a self-attention mechanism to mine and analyze useful characteristics of a speech characteristic diagram between model peer layers, suppresses redundant characteristics, guides a model to focus on a data characteristic coding and decoding rule, and is beneficial to better convergence of the model.

5. The CSSAtt module for voice processing is designed, and carries out grouping processing on a batch of voice feature maps, channel dimension and space dimension features are respectively extracted from each group of sub-feature maps, and finally the dimension features of different sub-feature maps are fused to realize feature communication among the sub-feature maps, so that feature communication among different samples is facilitated, and the robust capability of the model on data features is improved.

6. The invention designs a data enhancement method for a model training phase, which randomly redistributes the noise characteristics of a sample, realizes the purpose of expanding the data volume of a training set, effectively increases the noise robustness of the model and generalizes different noise distributions.

Drawings

FIG. 1 is a schematic diagram of the generation and transmission of hollow tube speech according to the background art of the present invention.

Fig. 2 is a diagram of the voice signal and the corresponding waveform collected by the hollow tube voice on different transmission lines according to the background art of the present invention.

Fig. 3 shows the speech signal and the corresponding spectrogram of the hollow tube speech collected on different transmission lines according to the background art of the present invention.

Fig. 4 is a schematic flowchart of a speech recognition-oriented controller speech enhancement method according to embodiment 1 of the present invention.

Fig. 5 is a model structure diagram of a controller voice enhancement preliminary model in a voice recognition-oriented controller voice enhancement method according to embodiment 2 of the present invention.

Fig. 6 is a schematic structural diagram of each module of the controller speech enhancement preliminary model in the speech recognition-oriented controller speech enhancement method according to embodiment 2 of the present invention.

Fig. 7 is an experimental description table of a comparison experiment in the controller voice enhancement method for voice recognition according to embodiment 3 of the present invention.

Fig. 8 is a schematic diagram of an experimental result of a controller voice enhancement indicator in the voice enhancement method for a controller facing voice recognition according to embodiment 3 of the present invention.

Fig. 9 is a schematic diagram of an experimental result of controller voice recognition indexes in the voice enhancement method for a controller facing voice recognition according to embodiment 3 of the present invention.

Fig. 10 is a schematic structural diagram of a controller voice enhancement apparatus for voice recognition according to embodiment 4 of the present invention, which uses the controller voice enhancement method for voice recognition according to embodiment 1.

Detailed Description

The present invention will be described in further detail with reference to test examples and specific embodiments. It should be understood that the scope of the above-described subject matter is not limited to the following examples, and any techniques implemented based on the disclosure of the present invention are within the scope of the present invention.

Example 1

As shown in fig. 4, a controller voice enhancement method facing voice recognition includes the following steps:

Example 2

This embodiment is a specific implementation manner of the method described in embodiment 1, and includes the following steps:

s1: the method comprises the steps of obtaining an original clean voice-voice data pair with noise of a ground-air call to form an original data set, preprocessing and labeling the original data set, and outputting an effective data set.

on the basis of the existing internal speech system, an auxiliary internal speech system is added to each empty pipe seat, and meanwhile, the auxiliary internal speech system and the existing internal speech system are used for collecting the speech of a controller to obtain the original clean speech-noisy speech data pair.

S1-2: preprocessing an original clean voice-voice data pair with noise in the original data set, and outputting the preprocessed original clean voice-voice data pair with noise; the preprocessing includes voice activity detection, speaker role classification, redundant data screening, and timing alignment.

S1-2-1: analyzing the collected data of the original data set, and dividing the continuous voice signal into instruction voice segments according to voice activity detection, wherein the duration of the divided voice instruction segments is between 0.1 and 10 s.

S1-2-2: since the internal speech system superimposes and merges the controller up-and-down speech and the pilot down-speech, the presence speech data is a mixed speech containing both the controller speech and the pilot speech. And (3) carrying out role classification on the segmented voice sections by adopting a speaker role classification model, wherein the classification result comprises three types: controller voice, pilot voice, controller pilot mixed voice. Pilot speech and controller pilot mixed speech are discarded, and the invention only uses the controller speech as an experimental sample for subsequent processing.

S1-2-3: screening out mute, noise and data with the duration less than 1s according to the obtained controller voice, and aligning the time sequence to ensure that the duration of clean voice and voice with noise of the same voice instruction text is consistent;

the original clean speech-noisy speech data pair is processed by the steps to obtain an effective original clean speech-noisy speech data pair, which comprises the following characteristics:

(1) The speech data pair includes a language used in the recognition scene.

(2) The voice data pair comprises voices in various pronunciation states; the pronunciation state comprises one or more of slow speech speed, normal speech speed, fast speech speed, comfortable emotion, tense emotion and accent.

(3) The voice data pair comprises a control expression of the air traffic control field specialty.

S1-3: randomly dividing the preprocessed original clean voice-voice data pairs with noise into an effective training set, an effective verification set and an effective test set, manually labeling the data pairs of the effective test set, and outputting the effective training set, the effective verification set and the labeled effective test set as effective data sets; and the manual labeling content is an instruction text corresponding to the original clean voice-voice data with noise.

S1-3-1: and the obtained effective original clean voice-noisy voice data pair is processed according to the following steps of 8:1: the 1-ratio is randomly divided into a training set, a validation set and a test set.

S1-3-2: and manually labeling the test set data and screening out data with unknown semantics by taking the clean voice data of the test set as a reference to obtain the instruction text of each pair of voices of the test set.

S1-3-3: the training set data and the verification set data are stored in pairs, respectively, each pair of data comprising two voices, clean voice and noisy voice, respectively. And organizing and storing the test set data and the corresponding instruction text thereof to form an effective test set with labels.

S2: as shown in fig. 5, a controller voice enhancement preliminary model is built based on a neural network structure, and the controller voice enhancement preliminary model includes a first SCN module, a second SCN module, a plurality of encoder units and corresponding decoder units; the first SCN module is arranged between the input end of the preliminary model and the input end of the encoder unit; the second SCN module is arranged between the output end of the preliminary model and the output end of the decoder unit; the encoder unit comprises a CNN module and a CSSAtt module; the decoder unit comprises a CNN module and a CSSAtt module; the encoder unit and the corresponding decoder unit are connected through a BilSTM module and an SASC module.

The SCN module utilizes the characteristics of a sinc filter in signal processing and adopts a sinc interpolation convolution network to extract the characteristics of the voice signal, so that data points of the voice signal which are lost due to sampling can be reconstructed, and the data integrity of the voice signal is ensured.

The first SCN module is used for performing feature upsampling on the voice data in the effective data set.

And the second SCN module is used for down-sampling the voice characteristic diagram output by the decoder unit.

The CNN module is used for extracting a preliminary voice feature map of the voice data and outputting the preliminary voice feature map to the CSSAtt module. The CNN module can extract local features of the voice signal and synthesize the local features deep in the network to obtain global features of the voice signal. The CNN module specifically includes:

a convolutional (deconvolution) network, in which

Which represents the size of the convolution kernel,

the representation step size, the ReLU activation function,

convolutional networks, GLU activation functions.

The BilSTM module is used for capturing the dependency relationship of the time sequence change of the voice data and mining the time sequence correlation between the signal frames of the voice data.

The SASC module is erected between peer layers of the encoder and the decoder, can guide voice features to pay attention to useful features of the SASC module, inhibits redundant features, and transmits the same-dimension features of the voice data from a shallow network to a deep network in a skipping mode, so that the deep network can learn the shallow features and recover details of the shallow network; the model structure is shown in fig. 6, and specifically includes the following processing steps:

s2-1-1: obtaining the voice data through the second

，

s2-1-2: obtaining the voice data through the second

，

;

S2-1-3: respectively aiming at the coding feature maps

And said decoding feature map

Performing self-attention operation to obtain the coding feature map

And said decoding feature map

，

，

，

wherein the content of the first and second substances,

it is indicated that the operation is self-attentive,

an initial self-attention weight is represented,

a stitching operation in the channel dimension is represented,

representing a first activation function, said activation functionFor improving the ability of neural networks to fit nonlinear functions,

a self-attention weight representing the encoder and decoder peer fusion;

,

wherein the content of the first and second substances,

it is shown that the second activation function is,

representing the jump attention weight coefficient, in this case

、

And

the same neural network structure is adopted, and all the neural network structures are 1x1 convolutional networks; the first activation function and the second activation function are any two different activation functions (e.g., a Sigmoid activation function, a tanh activation function, a ReLU activation function, a leak ReLU function, an ELU activation function, a Mish activation function, a Swish activation function, a siru activation function, and the like).

S2-1-5: according to the jump notesAdjusting the encoded feature map by an intention weighting factor

The weights of all the characteristic points are spliced during the step skipping process, and the decoding characteristic graph is spliced

，

wherein

Which means that the multiplication operations are performed by elements,

and representing a jump connection voice characteristic diagram.

The CSSAtt module of the controller voice enhancement preliminary model is used for guiding the model to respectively concern useful information from the channel dimension and the space dimension of the voice feature map, mining features and optimizing the segmentation attention parameter of the channel space. The model structure is shown in fig. 6, and specifically includes the following processing steps:

And

，

(ii) a Wherein, the first and the second end of the pipe are connected with each other,

a sub-feature map of the channel branch is shown,

s2-2-2: based on the

，

wherein

An adaptive average pooling operation is represented,

the dimensions of the channels are shown in the figure,

indicating an initialized channel attention weight;

s2-2-3: based on the

，

wherein

Which represents the packet normalization operation, is shown,

the dimensions of the space are represented by,

representing an initialized spatial attention weight;

s2-2-4: mining said by learnable parameters

In the channel dimension and the

，

，

wherein

And

the representation of the learnable parameter is,

which means that the multiplication operations are performed by elements,

and

s2-2-5: respectively adjusting the channels according to the channel attention weight coefficients and the spatial attention weight coefficients

And said

The weight of each characteristic point is added, and the method is spliced

And said

Outputting the speech feature map processed by the CSSAtt module for a sub-feature map and enabling information communication between different groups of the sub-feature maps using a channel shuffle operation, the operation being:

，

wherein

Features expressed in channel dimensionsThe operation of splicing the symbolic map is carried out,

a channel shuffle operation is represented that is,

the step S3 includes the steps of:

s3-1: constructing a loss function facing a voice enhancement task based on LAE (Least Absolute Error), directly measuring the Error between the output waveform and the real waveform of the controller voice enhancement preliminary model in the time domain, and recording as the LAE loss function

；

S3-2: constructing a frequency domain loss function facing a voice enhancement task based on a multi-resolution STFT (Short Time Fourier Transform) amplitude spectrum, measuring an error of the STFT amplitude spectrum between output voice of the controller voice enhancement preliminary model and real voice, and recording the error as the STFT loss function

；

[ sampling points, frame Shift, window frame ]

The parameter sampling points represent the number N of sampling points of a voice frame formed by N sampling points in all the sampling points of the voice signal, the parameter frame represents the time difference of the initial positions of two adjacent frames, and the parameter window frame represents the type of a window function processed by the voice signal;

in the present invention, the triplet values are [512, 100, hamming window ], [1024, 200, hamming window ], [256, 50, hamming window ].

Taking the first group of triplets as an example, 512 samples are taken as the parameter samples, that is, 512 samples are taken as a frame. The parameter frame shift is set to 100, that is, 100 samples are frame shifts, and if there are not 512 samples, the samples are padded with zeros. The parameter window frame selects a hamming window function.

S3-2-2: constructing the STFT amplitude spectrum of the voice by using the triple parameters, and recording the number

；

S3-2-3: constructing the STFT loss function as follows:

wherein

And representing the group number of the triple parameters, and taking M as 3 in the invention.

；

S3-3-1: performing critical band integration, loudness pre-emphasis, cubic root compression, inverse Fourier transform and Linear prediction processing on the STFT amplitude spectrum to obtain Perceptual Linear Prediction (PLP) acoustic features, and establishing the acoustic features based on the PLPThe loss function of the perception linear prediction characteristic measures the error between the output voice of the controller voice enhancement preliminary model and the real voice perception linear prediction acoustic characteristic and records as a PLP loss function

；

S3-3-2: carrying out Mel filtering and logarithmic transformation on the STFT amplitude spectrum to obtain acoustic characteristics of a Filter Bank (FBANK), establishing a loss function based on the characteristics of the Filter Bank, measuring errors between the Filter Bank characteristics of output voice of the controller voice enhancement preliminary model and real voice, and marking as FBANK loss function

；

S3-3-3: performing discrete cosine transform on the acoustic features of the filter bank to obtain Mel Cepstral coefficient acoustic features, establishing a loss function based on Mel-frequency Cepstral coeffients (MFCC) features, and measuring the error between the Mel Cepstral coefficient features of the model output voice and the real voice, and recording the error as an MFCC loss function

；

S3-3-4: constructing feature loss function for voice recognition task

：

。

Wherein the STFT loss function

And the characteristic loss function

Has the same operation form:

wherein

which represents a clean speech signal, is,

which represents an enhanced speech signal, is,

calculating an F norm;

wherein

For the purpose of the multi-tasking loss function,

、

and

respectively represent the LAE loss function

The STFT loss function

And the characteristic loss function

The preset weights in the present invention are all preset to 1.

，

wherein the content of the first and second substances,

showing pure noise waveforms, all three shapes

B denotes a batch size, C denotes the number of lanes, L denotes a data length,

representing a subtraction operation by eigenvalue;

，

wherein

Indicating that the data is being shuffled through the operation,

representing clean speech waveform-pure noise waveform data pairs,

representing the waveform of said enhanced noisy speech,

indicating an addition operation by a characteristic value;

；

S4-4: based on the second training set and the multitask loss function, model parameters of the controller voice enhancement preliminary model are updated through a gradient descent algorithm in an iterative mode, whether the controller voice enhancement preliminary model is converged is verified through the effective verification set in the training process, and the current controller voice enhancement preliminary model is output to be the controller voice enhancement model after model training is converged

The basis for judging the model training convergence is as follows: calculating the multitask loss function of the primary model through the effective verification set every m iteration rounds, and when the multitask loss function does not fall any more after the calculation for n times, considering the multitask loss function as model training convergence; in the invention, m is set to 10, n is set to 5;

s4-5: testing the model with the labeled valid test set.

Example 3

The embodiment is an actual operation analysis of the method of the present invention under the following data conditions, and is used for verifying the feasibility and performance of the technical scheme of the present invention, and specifically includes the following steps:

1. preparing data: voice data are collected in a real control scene, preprocessing is carried out according to a preprocessing scheme provided by the invention to form an effective data set required by the voice enhancement method, a training set, a verification set and a test set are formed according to the data set division step of the method, and the data set is specifically described as follows:

training set: 47253 pieces of data (42.83 hours) in total, including 42189 pieces of chinese data (37.28 hours), 5064 pieces of english data (5.55 hours);

and (4) verification set: 4764 pieces of data (4.31 hours) in total, including 4188 pieces of Chinese data (3.69 hours) and 558 pieces of English data (0.62 hours);

and (3) test set: 6514 pieces of data (5.62 hours) in total, containing 6012 pieces of Chinese data (5.08 hours), 502 pieces of English data (0.54 hours);

the training set and the verification set are all taken from voice data on the same date, the test set is taken from voice data on different dates from the training set and the verification set, and the sampling rate of all data is 8KHz. The test results of this embodiment are the results of speech enhancement and speech recognition performed on the test set.

2. Speech enhancement baseline model: in this embodiment, the model formed by the SCN module, the CNN module, and the BiLSTM module in step S2 is used as a baseline model, the loss function is a non-multiresolution loss function only oriented to speech enhancement in step S3-1 and step S3-2, and the data enhancement mechanism of the present invention is used in the model training phase. The model input and output are all original waveforms.

The baseline model and the invention model were implemented using a pytorech framework. The hyper-parameter configuration for model training is described as follows:

(1) An optimizer: an Adam optimizer is adopted, the initial learning rate is 0.0003, and the learning rate attenuation rate is 0.999;

(2) Model hyper-parameters: the number of characteristic channels is 48, the size of a convolution kernel is 8, and the step length is 4;

(3) Batch training size: 32.

the hardware environment adopted by the experiment is as follows: the Ubuntu Linux 16.04 operating system comprises a CPU (central processing unit) of 2 multiplied by Intel Core i7-6800K, a display card of 2 multiplied by NVIDIA GeForce RTX 2080Ti, a display memory of 2 multiplied by 11GB and a memory of 64GB.

Under the above training data and configuration conditions, a total of 5 sets of experiments were performed to respectively prove the advantages of the model and the loss function proposed by the present invention, which are specifically as follows:

a1: training a baseline model on the effective training data;

a2: on the basis of A1, changing a non-multiresolution loss function into a multiresolution loss function, and training on the effective training data;

a3: on the basis of A2, adding an SASC module and a CSSAtt module, and training on the training data;

a4: on the basis of A2, adopting a multi-task loss function facing to a voice enhancement task and a voice recognition task at the same time, and training on the training data;

a5: on the basis of A2, adding the module A3 and adopting the multitask loss function A4, and training on the training data;

the experimental description is shown in fig. 7.

3. Evaluation index of voice enhancement effect:

the experiment adopts objective voice quality evaluation indexes to measure the model voice enhancement effect, and the experiment is as follows:

(1) PESQ (Perception evaluation of speed qualification): perception voice quality evaluation, wherein the value of the perception voice quality evaluation is between 0~5, and the higher the value is, the better the enhancement effect is;

(2) CSIG (Mean opinion score (MOS) prediction of the signal discrimination attachment to the speed signal): the mean opinion score of the voice distortion is between 1~5, and the higher the value is, the better the enhancement effect is;

(3) CBAK (MOS prediction of the intervention of background noise): the average opinion score of the background noise invasiveness is between 1~5, and the higher the value is, the better the enhancement effect is;

(4) COVL (MOS prediction of the over effect): the average opinion score of the overall effect is between 1~5, and the higher the numerical value is, the better the enhancement effect is;

(5) STOI (Short-Time object intelligentity): the short-time objective intelligibility is a percentage value, the value of the percentage value is between 0 and 100, and a higher numerical value represents a better enhancement effect.

And respectively calculating the noisy speech and the enhanced speech with the clean speech in pairs to obtain the evaluation index scores by taking the clean speech as a reference, wherein the higher the calculation score is, the more similar the speech waveform characteristics of the noisy speech and the enhanced speech to the clean speech, namely, the better the speech quality is reflected.

4. And (3) a voice recognition model: in the experiment, the conventional DeepSpeech2 acoustic model (DS 2) is used as a voice recognition effect verification model, and the DS2 model does not need to be retrained. Respectively passing the experimental results obtained by the 5 groups of experiments in the embodiment through a DS2 model to obtain instruction texts, passing clean/noisy test data through the DS2 model to obtain clean/noisy instruction texts, and analyzing and comparing all instruction text results by taking the clean instruction texts as a reference;

5. evaluation index of voice recognition effect:

the Chinese characters and English letters based Character Error Rate (CER) is adopted to measure the voice recognition effect, the lower the numerical value is, the better the voice recognition effect is represented, and the CER calculation mode is as follows:

in which

For the length of the real instruction text,

、

and

representing the insertion, deletion and replacement operands required to convert the predicted instruction text to the real instruction text, respectively.

6. The experimental results are as follows:

the experimental results of the present invention are shown in fig. 8 and fig. 9, and it can be known from the experimental results that both the module and the loss function mechanism proposed by the present invention can improve the effects of speech enhancement and speech recognition on the data set of the present embodiment. Specifically, the method comprises the following steps:

(1) Experiments A1 and A2 show that after a multi-resolution mechanism of a loss function is introduced, a voice enhancement effect and a voice recognition effect are improved compared with a baseline model, and the fact that a plurality of groups of Fourier transform triple parameters are set can construct a voice amplitude spectrum from multiple aspects and multiple scales is beneficial to deeply mining effective information of voice by the model, so that a voice enhancement task and a voice recognition task are supported for research.

(2) It can be known from experiment A3 that the SASC module and the CSSAtt module proposed by the present invention are introduced on the basis of experiment A2 to enhance noisy data, and objective evaluation score indexes of the corresponding experiments are superior to those of the reference model A2 and the model A4 using multitask loss, which indicates that the module proposed by the present invention is beneficial to long-distance transmission and reconstruction of speech features when the model network is deep, and also beneficial to capturing speech information from multiple dimensions when the model analyzes a speech feature map, thereby improving robustness of the model and making the enhancement effect better.

(3) It can be known from experiment A4 that the multitask loss provided by the present invention is introduced on the basis of experiment A2, the voice recognition effect is better than that of other models which do not adopt the multitask loss, and all experiments are tested by using the existing voice recognition model. The method shows that on the basis of not needing to retrain the speech recognition model, the characteristics of the speech recognition task are directly considered on the front edge enhancement task, and the model is utilized to learn the common characteristic representation of the noisy speech and the clean speech on the recognition, so that the optimized speech data have better recognition effect.

(4) As can be seen from experiment A5, the multiresolution mechanism, the model module and the adoption of the multitask loss function, which are disclosed by the invention, are introduced simultaneously, so that the baseline model can obtain the optimal speech enhancement and speech recognition performance on the test set of the embodiment, and the effectiveness of the method provided by the invention is proved.

Example 4

As shown in fig. 10, a controller voice enhancement apparatus for voice recognition comprises at least one processor, and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a speech recognition oriented policer speech enhancement method as described in the previous embodiments. The input and output interface can comprise a display, a keyboard, a mouse and a USB interface and is used for inputting and outputting data; the power supply is used for supplying electric energy to the electronic equipment.

Those skilled in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as a removable Memory device, a Read Only Memory (ROM), a magnetic disk, or an optical disk.

When the integrated unit of the present invention is implemented in the form of a software functional unit and sold or used as a separate product, it may also be stored in a computer-readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A controller voice enhancement method facing voice recognition is characterized by comprising the following steps:

s1: acquiring an original clean voice-voice with noise data pair of a ground-air call to form an original data set, and outputting an effective data set after preprocessing and labeling the original data set;

2. The method for speech enhancement of controller for speech recognition according to claim 1, wherein the step S1 comprises the steps of:

3. The controller voice enhancement method facing voice recognition according to claim 2, wherein the controller voice enhancement preliminary model comprises a first SCN module, a second SCN module, a plurality of encoder units and corresponding decoder units; the first SCN module is arranged between the input end of the preliminary model and the input end of the encoder unit; the second SCN module is arranged between the output end of the preliminary model and the output end of the decoder unit; the encoder unit comprises a CNN module and a CSSAtt module; the decoder unit comprises a CNN module and a CSSAtt module; the encoder unit and the corresponding decoder unit are connected through a BilSTM module and an SASC module;

the CNN module is used for extracting a preliminary voice feature map of the voice data and outputting the preliminary voice feature map to the CSSAtt module;

4. The speech enhancement method for controller facing speech recognition according to claim 3, wherein the SASC module comprises the following operation steps:

s2-1-1: obtaining the voice data through the second

，

s2-1-2: obtaining the voice data through the second

，

；

S2-1-3: respectively aiming at the coding feature maps

And said decoding feature map

Performing self-attention operation to obtain the coding feature map

And said decoding feature map

，

，

，

wherein the content of the first and second substances,

it is indicated that the operation is self-attentive,

an initial self-attention weight is represented,

a stitching operation in the channel dimension is represented,

a first activation function is represented that is,

a self-attention weight representing the encoder to decoder peer fusion;

s2-1-4: and performing self-attention operation and activation processing on the fused self-attention weight to obtain a skip attention weight coefficient of peer layers of an encoder and a decoder, wherein the operation formula is as follows:

,

wherein the content of the first and second substances,

a second activation function is represented that is,

representing a skipping attention weight coefficient;

，

wherein

Which means that the multiplication operations are performed by elements,

and representing a jump connection voice characteristic diagram.

5. The speech enhancement method for controllers facing speech recognition according to claim 3, wherein the CSSAtt module comprises the following operation steps:

And

，

(ii) a Wherein the content of the first and second substances,

a sub-feature map of the channel branch is shown,

s2-2-2: based on the

，

wherein

An adaptive average pooling operation is represented,

the dimensions of the channels are shown in the figure,

indicating initialized channel attention weights;

s2-2-3: based on the

Generating an initialized spatial attention weight through grouping normalization operation on a spatial dimension, wherein the operation formula is as follows:

，

wherein

Which represents the packet normalization operation, is shown,

the dimensions of the space are represented by,

representing an initialized spatial attention weight;

s2-2-4: mining said by learnable parameters

In the channel dimension and the

Feature dependency on spatial dimension is generated after activation of an activation function, and attention weight coefficients on channel dimension and spatial dimension are generated, wherein the operation formula is as follows:

，

，

wherein

And

the representation of the learnable parameter is,

it is shown that the second activation function is,

which means that the multiplication operations are performed by elements,

and

And stationAs described in

The weight of each characteristic point is added, and the method is spliced

And said

，

wherein

A feature map stitching operation in the channel dimension is represented,

a channel shuffle operation is represented that is,

represents the speech feature map after the CSSAtt module processing.

6. The speech recognition-oriented controller speech enhancement method according to claim 1, wherein the step S3 comprises the steps of:

s3-1: constructing a loss function facing a voice enhancement task based on LAE, and directly measuring the output waveform and the real waveform of the controller voice enhancement preliminary model on the time domainError between, noted as LAE loss function

；

；

；

wherein

For the purpose of the multi-tasking loss function,

、

and

respectively represent the LAE loss function

The STFT loss function

And the characteristic loss function

The preset weight of (c).

7. The method for speech enhancement of controller for speech recognition according to claim 6, wherein the step S3-2 comprises the steps of:

[ sampling point, frame shift, window frame ];

；

S3-2-3: constructing the STFT loss function as follows:

wherein

Representing the number of groups of the triplet parameters.

8. The speech recognition-oriented controller speech enhancement method according to claim 6, wherein the step S3-3 comprises the following steps:

s3-3-1: performing critical band integration, loudness pre-emphasis, cubic root compression, inverse Fourier transform and linear prediction processing on the STFT amplitude spectrum to obtain perceptual linear prediction acoustic features, establishing a loss function based on the perceptual linear prediction features, measuring an error between output voice of the controller voice enhancement preliminary model and real voice perceptual linear prediction acoustic features, and recording the error as a PLP loss function

；

S3-3-2: carrying out Mel filtering and logarithmic transformation processing on the STFT amplitude spectrum to obtain acoustic characteristics of a filter bank, establishing a loss function based on the characteristics of the filter bank, measuring errors between filter bank characteristics of output voice and real voice of the controller voice enhanced preliminary model, and recording the errors as FBANK loss function

；

S3-3-3: discrete cosine transform is carried out on the acoustic features of the filter bank to obtain Mel cepstrum coefficient acoustic features, a loss function based on the Mel cepstrum coefficient features is established, errors between model output voice and real voice Mel cepstrum coefficient features are measured and recorded as MFCC loss functions

；

S3-3-4: constructing feature loss function for voice recognition task

：

。

9. The method for speech enhancement of controller for speech recognition according to claim 2, wherein the step S4 comprises the steps of:

，

wherein the content of the first and second substances,

a noisy speech waveform representing the noisy speech data in the original clean speech-noisy speech data pair,

showing pure noise waveforms, all three shapes

B denotes a batch size, C denotes the number of lanes, L denotes a data length,

representing a subtraction operation by eigenvalue;

s4-2: randomly disturbing the distribution of the pure noise waveforms in the training set, and adding the pure noise waveforms and the clean speech waveforms to obtain enhanced noisy speech waveforms, wherein the operational expression is as follows:

，

wherein

Indicating that the data is being shuffled through the operation,

representing clean speech waveform-pure noise waveform data pairs,

representing the waveform of said enhanced noisy speech,

indicating an addition operation by a characteristic value;

；

S4-4: iteratively updating model parameters of the controller voice-enhanced preliminary model through a gradient descent algorithm based on the second training set and the multitask loss function, verifying whether the controller voice-enhanced preliminary model is converged through the effective verification set in the training process, and outputting the current controller voice-enhanced preliminary model as the controller voice-enhanced model after model training is converged;

s4-5: testing the model with the labeled active test set.

10. A controller voice enhancement apparatus oriented to voice recognition, comprising at least one processor, and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 9.