CN114495909B

CN114495909B - End-to-end bone-qi guiding voice joint recognition method

Info

Publication number: CN114495909B
Application number: CN202210153909.5A
Authority: CN
Inventors: 王谋; 陈俊淇; 张晓雷; 王逸平
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2022-02-20
Filing date: 2022-02-20
Publication date: 2024-04-30
Anticipated expiration: 2042-02-20
Also published as: CN114495909A

Abstract

The invention discloses an end-to-end bone and air conduction voice joint recognition method, which comprises the steps of firstly acquiring synchronous air conduction and bone conduction voice data to construct a data set, and outputting the data set as a corresponding text; data enhancement and acoustic feature extraction are carried out on the air conduction and bone conduction voice signals; then, an end-to-end deep neural network model based on Conformer is built, and the model consists of three parts, namely two branch networks for processing air conduction and bone conduction voice and a fusion network based on a multi-mode Transducer; and training the neural network, and finally obtaining a corresponding recognition result through the trained network. Compared with the traditional method for carrying out voice recognition by only using the air-guide voice signal, the invention can obviously reduce the error rate of voice recognition and improve the overall recognition performance of the system by adopting the combined recognition method.

Description

End-to-end bone-qi guiding voice joint recognition method

Technical Field

The invention belongs to the technical field of voice recognition, and particularly relates to a bone-qi guiding voice joint recognition method.

Background

In recent decades, robust automatic speech recognition has been significantly developed and has been applied to various fields such as smart phones, smart home appliances, automobiles, etc. thanks to the rising and progress of deep learning. The robust speech recognition algorithm based on deep learning can be mainly divided into two types, one is to remove noise at the front end of the system, including speech enhancement, extracting noise robust features and the like, and the other is to design a robust recognition model capable of adapting to different noise scenes at the rear end of the system. However, to date, these deep learning-based speech recognition methods are all based on air-guided speech. Due to the conductive nature of the speech in air, the speech is susceptible to interference from ambient noise, which severely degrades the recognition performance of the system at low signal-to-noise ratios, especially when non-stationary noise such as wind noise is present. At this time, we can consider to introduce other modes to perform joint recognition, so as to improve the performance of the system.

Bone conduction speech is a speech signal obtained by picking up vibration signals of the human skull and skin using a bone conduction microphone. Compared with the traditional air conduction voice, the bone conduction voice is not easy to be dry-dyed by noise in the surrounding environment, so that the environment noise can be resisted from a sound source, and voice information can be well reserved in a low signal-to-noise ratio environment. But bone conduction speech itself has a number of drawbacks. First, the high frequency part of the bone conduction voice is severely attenuated due to the high frequency attenuation of the voice vibration signal by the human tissue. Although the frequency characteristics of bone conduction microphones of different manufacturers are inconsistent, the collected voice is seriously attenuated from the part above 600Hz, and even is completely absent. The absence of high frequency parts presents a serious challenge for speech recognition systems. Secondly, because of friction between human skin and bone conduction microphone and motion of human body, the bone conduction voice often contains a certain self-noise, which further increases the recognition difficulty of bone conduction voice. Finally, bone-conduction speech tends to lose parts of unvoiced sounds, fricatives, etc. in speech, and also reduces the performance of the speech recognition system.

Because of the above characteristics of bone-conduction speech, speech recognition using only bone-conduction speech still faces numerous challenges. But bone conduction speech has some complementarity with air conduction speech. Therefore, the patent utilizes air conduction voice and bone conduction voice simultaneously, and carries out joint voice recognition through the deep learning model. Because there has not been previously disclosed a large-scale bone-air guide voice database for deep learning voice recognition, there has been no work to date based on deep learning of end-to-end bone-air guide joint voice recognition.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides an end-to-end bone and air conduction voice joint recognition method, which comprises the steps of firstly acquiring synchronous air conduction and bone conduction voice data to construct a data set, and outputting the data set as a corresponding text; data enhancement and acoustic feature extraction are carried out on the air conduction and bone conduction voice signals; then, an end-to-end deep neural network model based on Conformer is built, and the model consists of three parts, namely two branch networks for processing air conduction and bone conduction voice and a fusion network based on a multi-mode Transducer; and training the neural network, and finally obtaining a corresponding recognition result through the trained network. Compared with the traditional method for carrying out voice recognition by only using the air-guide voice signal, the invention can obviously reduce the error rate of voice recognition and improve the overall recognition performance of the system by adopting the combined recognition method.

The technical scheme adopted by the invention for solving the technical problems comprises the following steps:

Step 1: acquiring synchronous air conduction and bone conduction voice data (x _a,x_b) to construct a data set, wherein x _a is pure air conduction voice, x _b is bone conduction voice recorded synchronously, and outputting the data set as a corresponding text y;

Adding noise to the air-guide voice to obtain Wherein/>N _a is environmental noise, which is air conduction voice with noise; the final dataset is/>The data set is further divided into a training set, a verification set and a test set;

step 2: data enhancement and feature extraction;

step 2-1: preliminary data enhancement is carried out on the change of the speech speed of the air conduction and bone conduction voice signals;

Step 2-2: respectively extracting acoustic characteristics from the air conduction and bone conduction voice signals with the speed changed;

Step 2-3: performing data enhancement again on the acoustic features extracted in the step 2-2 by using a SpecAugment method;

Step 3: building an end-to-end deep neural network model based on Conformer; the model consists of three parts, namely two branch networks for processing air conduction and bone conduction voice and a fusion network based on a multi-mode Transducer;

Step 3-1: both branch networks of air-guided and bone-guided speech are Conformer network architectures, including Conformer encoder and guided decoder;

The Conformer encoder is composed of a plurality of blocks, each block comprises two FFN modules, a multi-head self-attention module and a convolution module; the Truncated decoder is composed of a plurality of blocks, and each block comprises a multi-head self-attention module, a multi-head self-attention module of a mask and an FFN module;

The acoustic characteristics of the air conduction and bone conduction voice enhanced in the step 2-3 are respectively converted into an air conduction voice characteristic vector c _l and a bone conduction voice characteristic vector g _l through a Conformer encoder and a transmitted decoder in sequence;

Step 3-2: the inputs of the multi-mode transduction fusion network are an air conduction voice feature vector c _l and a bone conduction voice feature vector g _l, wherein the air conduction voice feature vector c _l and the bone conduction voice feature vector g _l are obtained after air conduction voice and bone conduction voice are converted through a branch network;

Firstly, carrying out linear feature transformation on c _l to obtain key and value matrixes which are respectively expressed as K and V; performing linear feature transformation on g _l to obtain a query matrix, wherein the query matrix is denoted as Q;

k=c _lW^K,V＝c_lW^V, where W ^Q,W^K,W^V is a learnable linear transformation matrix, respectively;

The Q and the K are sent to Scaling Sparsemax modules to respectively obtain the weighted weights [ z ^a,z^b ] of the air conduction and bone conduction characteristics, and the specific calculation formula is as follows:

wherein SSP ()'s are scaling Sparsemax operations; s is a scale factor, and a specific calculation formula is as follows: s=1+relu (Linear (||x|, 2), where Linear represents the Linear transformation, the I < x > is the two norms of the input vector, and ReLU ()' is the activation function, l epsilon { a, b };

The characteristics after fusion with V are as follows:

r_l＝(z_lV)^T+FFN(LayerNorm((z_lV)^T))

The fused characteristic r _l passes through an output layer to obtain the final attention-based probability p _att (w), wherein w is a predicted text sequence, namely the output of the multi-mode converter fusion network;

step 4: training a neural network;

The training of the network is divided into two steps: training two branch networks of air conduction and bone conduction voice respectively by adopting a CTC loss function by using training set data and verification set data, and then adding a multi-mode Transducer fusion network to train the whole network by adopting the CTC loss function;

Step 5: testing a model;

And (3) sending the test set data into the trained network obtained in the step (4) to obtain a corresponding identification result.

Preferably, in the step 2-1, the speech rate of the air conduction and bone conduction speech signals is changed to 0.9 times and 1.1 times as much as the original speech rate.

Preferably, the acoustic features extracted in the step 2-2 are 80-dimensional Mel-bank features.

Preferably, the Conformer encoder consists of 12 blocks; the quantized decoder consists of 6 blocks.

The beneficial effects of the invention are as follows:

The invention can realize end-to-end joint voice recognition by simultaneously utilizing the air-guide voice with noise and the bone conduction voice. Compared with the traditional method for carrying out voice recognition by only using the air conduction voice signal, the combined recognition method can obviously reduce the error rate of voice recognition, especially under the condition of low signal-to-noise ratio. Compared with the mode of simply directly splicing the characteristics of the air conduction voice and the bone conduction voice, the multi-mode Transducer used in the invention can adaptively allocate channel weights to fuse two paths of signals according to the characteristics of the air conduction and the bone conduction, thereby improving the overall recognition performance of the system.

Drawings

FIG. 1 is a system frame diagram of the method of the present invention.

FIG. 2 is a diagram of a multi-modal converter fusion network.

Detailed Description

The invention will be further described with reference to the drawings and examples.

The invention aims to provide a voice joint recognition method of end-to-end multiple sensors based on deep learning, in particular to a bone-qi guiding joint voice recognition method, which can directly take a bone-qi guiding voice signal with synchronous time domain as the input of a system, thereby directly outputting a corresponding voice recognition result.

An end-to-end bone-qi guiding voice joint recognition method comprises the following steps:

Step 1: acquiring synchronous air conduction and bone conduction voice data (x _a,x_b) to construct a data set, wherein x _a is pure air conduction voice recorded in a sound elimination laboratory or a relatively quiet environment, x _b is bone conduction voice recorded synchronously, and outputting the corresponding text y;

Adding noise to the air-guide voice according to a certain signal-to-noise ratio to obtain Wherein/>N _a is environmental noise, which is air conduction voice with noise; the final dataset is/>The data set is further divided into a training set, a verification set and a test set;

step 2: data enhancement and feature extraction;

The Conformer encoder is made up of a plurality of blocks, each block containing two position-wise feed-forward (FFN) modules, a multi-headed self-attention (MHA) module, and a convolution module; the Truncated decoder is composed of a plurality of blocks, each block comprising a multi-head self-attention module, a multi-head self-attention module (MASKED MHA) of a mask, and an FFN module;

The characteristics after fusion with V are as follows:

r_l＝(z_lV)^T+FFN(LayerNorm((z_lV)^T))

step 4: training a neural network;

Step 5: testing a model;

Specific examples:

S1: the data set is constructed by acquiring synchronous bone conduction and air conduction voice data (x _a,x_b), wherein x _a is pure air conduction voice recorded in a sound attenuation laboratory or in a quieter environment, and x _b is bone conduction voice recorded synchronously. All speech is downsampled to 16khz,16bit quantization. The input data of the model are air guide and bone conduction voice with noise, and a text y corresponding to the voice is output. Because the bone conduction voice does not introduce environmental noise, only the air conduction voice is added with noise according to a certain range of signal to noise ratio, namely Wherein/>N _a is ambient noise, which is noisy air-conduction speech. The final dataset is/>Then further set 84% of the dataset as training set, 8% as validation set, and the remaining 8% as test set.

S2: data enhancement and feature extraction

S21: the speech speed of the voice signal is changed to perform preliminary data enhancement, namely, the speech speed of the original voice is changed to be 0.9 times and 1.1 times of the original speech speed.

S22, extracting 80-dimensional Mel-bank characteristics from the air conduction voice and the bone conduction voice respectively.

S23, carrying out data enhancement again on the Mel-bank characteristics by using a SpecAugment method.

S3: and constructing an end-to-end deep neural network model based on Conformer. As shown in fig. 1, the model consists of three modules, namely two branch networks that handle air-guided and bone-guided speech, respectively, and a multi-modal Transducer-based fusion network.

S31: the branched networks of air conduction and bone conduction speech are similar, and are Conformer network architectures, including Conformer encoder and guided decoder. The encoder is made up of a plurality of blocks, each block containing two position-wise feed-forward (FFN) modules, a multi-headed self-attention Module (MHA) and a convolution module. Specifically, the encoder is composed of 12 blocks, wherein the convolution kernel size of the convolution module is 15, the number of heads of the multi-head self-attention is 8, the number of intermediate layer nodes of the FFN is 2048, and the output dimension D _m is 256. The Truncated decoder is also made up of a plurality of blocks, each block containing a multi-headed self-attention module, a multi-headed self-attention module (MASKED MHA) for a mask, and an FFN module. Specifically, the decoder is made up of 6 blocks, and other parameter configurations are consistent with the encoder. Through the two branched networks, the acoustic features of air-guided and bone-conducted speech are converted into two feature vectors, namely c _l and g _l in fig. 1.

S32, the structure of the multi-mode Transducer is shown in fig. 2, the main structure of the Transducer is similar to that of a Transducer, and the input is feature vectors c _l and g _l of air conduction and bone conduction voice after the transformation of a branch network. First, c _l and g _l are subjected to linear feature transformation to obtain query, key and value matrices, which respectively correspond to QKV in fig. 2, specifically,K=c _lW^K,V＝c_lW^V, where W ^Q,W^K,W^V is a learnable linear transformation matrix. The Q and the K are sent to Scaling Sparsemax modules to obtain the weighted weights [ z ^a,z^b ] of the air conduction and bone conduction characteristics respectively, and the specific calculation formula is as follows:

Wherein SSP (x, s) is scaling Sparsemax operation, s is scale factor, the specific calculation formula is s=1+relu (Linear (|x||, 2), wherein Linear is a Linear transformation, i x is a two-norm of the input vector, and ReLU is an activation function:

r_l＝(z_lV)^T+FFN(LayerNorm((z_lV)^T))。

the fused features pass through an output layer, and the final attention-based probability p _att (w) can be obtained, wherein w is a predicted text sequence, namely the output of the whole multi-mode Transducer.

And S4, optimizing the neural network. The training of the whole network is divided into two steps, namely, the branch network of each of the air conduction voice and the bone conduction voice is optimized firstly, and then the whole network is optimized together with the multi-mode Transducer. The loss function for both the branched network and the overall network optimization is CTC loss. The network was optimized with Adam optimizer, and the training times were set to 50 epochs.

S5: and (5) model testing. And (3) sending the test data into the trained network obtained in the step (S4) to obtain a corresponding identification result.

Claims

1. The end-to-end bone and qi guiding voice joint recognition method is characterized by comprising the following steps of:

step 2: data enhancement and feature extraction;

The characteristics after fusion with V are as follows:

r_l＝(z_lV)^T+FFN(LayerNorm((z_lV)^T))

step 4: training a neural network;

Step 5: testing a model;

2. The end-to-end bone conduction voice joint recognition method according to claim 1, wherein in the step 2-1, the speech speed of the air conduction and bone conduction voice signals is changed to 0.9 times and 1.1 times as much as the original speech speed.

3. The end-to-end bone-qi-guiding-speech joint recognition method according to claim 1, wherein the acoustic features extracted in the step 2-2 are 80-dimensional Mel-bank features.

4. The end-to-end bone-air conduction speech joint recognition method according to claim 1, wherein said Conformer encoder consists of 12 blocks; the quantized decoder consists of 6 blocks.