CN114495909B - End-to-end bone-qi guiding voice joint recognition method - Google Patents
End-to-end bone-qi guiding voice joint recognition method Download PDFInfo
- Publication number
- CN114495909B CN114495909B CN202210153909.5A CN202210153909A CN114495909B CN 114495909 B CN114495909 B CN 114495909B CN 202210153909 A CN202210153909 A CN 202210153909A CN 114495909 B CN114495909 B CN 114495909B
- Authority
- CN
- China
- Prior art keywords
- voice
- bone
- air conduction
- bone conduction
- air
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 24
- 210000000988 bone and bone Anatomy 0.000 claims abstract description 71
- 238000012549 training Methods 0.000 claims abstract description 20
- 230000004927 fusion Effects 0.000 claims abstract description 19
- 230000001360 synchronised effect Effects 0.000 claims abstract description 7
- 238000013528 artificial neural network Methods 0.000 claims abstract description 6
- 238000000605 extraction Methods 0.000 claims abstract description 6
- 238000003062 neural network model Methods 0.000 claims abstract description 6
- 238000012545 processing Methods 0.000 claims abstract description 5
- 239000013598 vector Substances 0.000 claims description 24
- 230000009466 transformation Effects 0.000 claims description 16
- 238000012360 testing method Methods 0.000 claims description 12
- 239000011159 matrix material Substances 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 8
- 238000012795 verification Methods 0.000 claims description 6
- 230000004913 activation Effects 0.000 claims description 4
- 230000007613 environmental effect Effects 0.000 claims description 4
- 230000026683 transduction Effects 0.000 claims description 3
- 238000010361 transduction Methods 0.000 claims description 3
- 230000006870 function Effects 0.000 description 8
- 238000013135 deep learning Methods 0.000 description 6
- 230000002238 attenuated effect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 210000003625 skull Anatomy 0.000 description 1
- 210000001519 tissue Anatomy 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Telephone Function (AREA)
- Details Of Audible-Bandwidth Transducers (AREA)
Abstract
The invention discloses an end-to-end bone and air conduction voice joint recognition method, which comprises the steps of firstly acquiring synchronous air conduction and bone conduction voice data to construct a data set, and outputting the data set as a corresponding text; data enhancement and acoustic feature extraction are carried out on the air conduction and bone conduction voice signals; then, an end-to-end deep neural network model based on Conformer is built, and the model consists of three parts, namely two branch networks for processing air conduction and bone conduction voice and a fusion network based on a multi-mode Transducer; and training the neural network, and finally obtaining a corresponding recognition result through the trained network. Compared with the traditional method for carrying out voice recognition by only using the air-guide voice signal, the invention can obviously reduce the error rate of voice recognition and improve the overall recognition performance of the system by adopting the combined recognition method.
Description
Technical Field
The invention belongs to the technical field of voice recognition, and particularly relates to a bone-qi guiding voice joint recognition method.
Background
In recent decades, robust automatic speech recognition has been significantly developed and has been applied to various fields such as smart phones, smart home appliances, automobiles, etc. thanks to the rising and progress of deep learning. The robust speech recognition algorithm based on deep learning can be mainly divided into two types, one is to remove noise at the front end of the system, including speech enhancement, extracting noise robust features and the like, and the other is to design a robust recognition model capable of adapting to different noise scenes at the rear end of the system. However, to date, these deep learning-based speech recognition methods are all based on air-guided speech. Due to the conductive nature of the speech in air, the speech is susceptible to interference from ambient noise, which severely degrades the recognition performance of the system at low signal-to-noise ratios, especially when non-stationary noise such as wind noise is present. At this time, we can consider to introduce other modes to perform joint recognition, so as to improve the performance of the system.
Bone conduction speech is a speech signal obtained by picking up vibration signals of the human skull and skin using a bone conduction microphone. Compared with the traditional air conduction voice, the bone conduction voice is not easy to be dry-dyed by noise in the surrounding environment, so that the environment noise can be resisted from a sound source, and voice information can be well reserved in a low signal-to-noise ratio environment. But bone conduction speech itself has a number of drawbacks. First, the high frequency part of the bone conduction voice is severely attenuated due to the high frequency attenuation of the voice vibration signal by the human tissue. Although the frequency characteristics of bone conduction microphones of different manufacturers are inconsistent, the collected voice is seriously attenuated from the part above 600Hz, and even is completely absent. The absence of high frequency parts presents a serious challenge for speech recognition systems. Secondly, because of friction between human skin and bone conduction microphone and motion of human body, the bone conduction voice often contains a certain self-noise, which further increases the recognition difficulty of bone conduction voice. Finally, bone-conduction speech tends to lose parts of unvoiced sounds, fricatives, etc. in speech, and also reduces the performance of the speech recognition system.
Because of the above characteristics of bone-conduction speech, speech recognition using only bone-conduction speech still faces numerous challenges. But bone conduction speech has some complementarity with air conduction speech. Therefore, the patent utilizes air conduction voice and bone conduction voice simultaneously, and carries out joint voice recognition through the deep learning model. Because there has not been previously disclosed a large-scale bone-air guide voice database for deep learning voice recognition, there has been no work to date based on deep learning of end-to-end bone-air guide joint voice recognition.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides an end-to-end bone and air conduction voice joint recognition method, which comprises the steps of firstly acquiring synchronous air conduction and bone conduction voice data to construct a data set, and outputting the data set as a corresponding text; data enhancement and acoustic feature extraction are carried out on the air conduction and bone conduction voice signals; then, an end-to-end deep neural network model based on Conformer is built, and the model consists of three parts, namely two branch networks for processing air conduction and bone conduction voice and a fusion network based on a multi-mode Transducer; and training the neural network, and finally obtaining a corresponding recognition result through the trained network. Compared with the traditional method for carrying out voice recognition by only using the air-guide voice signal, the invention can obviously reduce the error rate of voice recognition and improve the overall recognition performance of the system by adopting the combined recognition method.
The technical scheme adopted by the invention for solving the technical problems comprises the following steps:
Step 1: acquiring synchronous air conduction and bone conduction voice data (x a,xb) to construct a data set, wherein x a is pure air conduction voice, x b is bone conduction voice recorded synchronously, and outputting the data set as a corresponding text y;
Adding noise to the air-guide voice to obtain Wherein/>N a is environmental noise, which is air conduction voice with noise; the final dataset is/>The data set is further divided into a training set, a verification set and a test set;
step 2: data enhancement and feature extraction;
step 2-1: preliminary data enhancement is carried out on the change of the speech speed of the air conduction and bone conduction voice signals;
Step 2-2: respectively extracting acoustic characteristics from the air conduction and bone conduction voice signals with the speed changed;
Step 2-3: performing data enhancement again on the acoustic features extracted in the step 2-2 by using a SpecAugment method;
Step 3: building an end-to-end deep neural network model based on Conformer; the model consists of three parts, namely two branch networks for processing air conduction and bone conduction voice and a fusion network based on a multi-mode Transducer;
Step 3-1: both branch networks of air-guided and bone-guided speech are Conformer network architectures, including Conformer encoder and guided decoder;
The Conformer encoder is composed of a plurality of blocks, each block comprises two FFN modules, a multi-head self-attention module and a convolution module; the Truncated decoder is composed of a plurality of blocks, and each block comprises a multi-head self-attention module, a multi-head self-attention module of a mask and an FFN module;
The acoustic characteristics of the air conduction and bone conduction voice enhanced in the step 2-3 are respectively converted into an air conduction voice characteristic vector c l and a bone conduction voice characteristic vector g l through a Conformer encoder and a transmitted decoder in sequence;
Step 3-2: the inputs of the multi-mode transduction fusion network are an air conduction voice feature vector c l and a bone conduction voice feature vector g l, wherein the air conduction voice feature vector c l and the bone conduction voice feature vector g l are obtained after air conduction voice and bone conduction voice are converted through a branch network;
Firstly, carrying out linear feature transformation on c l to obtain key and value matrixes which are respectively expressed as K and V; performing linear feature transformation on g l to obtain a query matrix, wherein the query matrix is denoted as Q;
k=c lWK,V=clWV, where W Q,WK,WV is a learnable linear transformation matrix, respectively;
The Q and the K are sent to Scaling Sparsemax modules to respectively obtain the weighted weights [ z a,zb ] of the air conduction and bone conduction characteristics, and the specific calculation formula is as follows:
wherein SSP ()'s are scaling Sparsemax operations; s is a scale factor, and a specific calculation formula is as follows: s=1+relu (Linear (||x|, 2), where Linear represents the Linear transformation, the I < x > is the two norms of the input vector, and ReLU ()' is the activation function, l epsilon { a, b };
The characteristics after fusion with V are as follows:
rl=(zlV)T+FFN(LayerNorm((zlV)T))
The fused characteristic r l passes through an output layer to obtain the final attention-based probability p att (w), wherein w is a predicted text sequence, namely the output of the multi-mode converter fusion network;
step 4: training a neural network;
The training of the network is divided into two steps: training two branch networks of air conduction and bone conduction voice respectively by adopting a CTC loss function by using training set data and verification set data, and then adding a multi-mode Transducer fusion network to train the whole network by adopting the CTC loss function;
Step 5: testing a model;
And (3) sending the test set data into the trained network obtained in the step (4) to obtain a corresponding identification result.
Preferably, in the step 2-1, the speech rate of the air conduction and bone conduction speech signals is changed to 0.9 times and 1.1 times as much as the original speech rate.
Preferably, the acoustic features extracted in the step 2-2 are 80-dimensional Mel-bank features.
Preferably, the Conformer encoder consists of 12 blocks; the quantized decoder consists of 6 blocks.
The beneficial effects of the invention are as follows:
The invention can realize end-to-end joint voice recognition by simultaneously utilizing the air-guide voice with noise and the bone conduction voice. Compared with the traditional method for carrying out voice recognition by only using the air conduction voice signal, the combined recognition method can obviously reduce the error rate of voice recognition, especially under the condition of low signal-to-noise ratio. Compared with the mode of simply directly splicing the characteristics of the air conduction voice and the bone conduction voice, the multi-mode Transducer used in the invention can adaptively allocate channel weights to fuse two paths of signals according to the characteristics of the air conduction and the bone conduction, thereby improving the overall recognition performance of the system.
Drawings
FIG. 1 is a system frame diagram of the method of the present invention.
FIG. 2 is a diagram of a multi-modal converter fusion network.
Detailed Description
The invention will be further described with reference to the drawings and examples.
The invention aims to provide a voice joint recognition method of end-to-end multiple sensors based on deep learning, in particular to a bone-qi guiding joint voice recognition method, which can directly take a bone-qi guiding voice signal with synchronous time domain as the input of a system, thereby directly outputting a corresponding voice recognition result.
An end-to-end bone-qi guiding voice joint recognition method comprises the following steps:
Step 1: acquiring synchronous air conduction and bone conduction voice data (x a,xb) to construct a data set, wherein x a is pure air conduction voice recorded in a sound elimination laboratory or a relatively quiet environment, x b is bone conduction voice recorded synchronously, and outputting the corresponding text y;
Adding noise to the air-guide voice according to a certain signal-to-noise ratio to obtain Wherein/>N a is environmental noise, which is air conduction voice with noise; the final dataset is/>The data set is further divided into a training set, a verification set and a test set;
step 2: data enhancement and feature extraction;
step 2-1: preliminary data enhancement is carried out on the change of the speech speed of the air conduction and bone conduction voice signals;
Step 2-2: respectively extracting acoustic characteristics from the air conduction and bone conduction voice signals with the speed changed;
Step 2-3: performing data enhancement again on the acoustic features extracted in the step 2-2 by using a SpecAugment method;
Step 3: building an end-to-end deep neural network model based on Conformer; the model consists of three parts, namely two branch networks for processing air conduction and bone conduction voice and a fusion network based on a multi-mode Transducer;
Step 3-1: both branch networks of air-guided and bone-guided speech are Conformer network architectures, including Conformer encoder and guided decoder;
The Conformer encoder is made up of a plurality of blocks, each block containing two position-wise feed-forward (FFN) modules, a multi-headed self-attention (MHA) module, and a convolution module; the Truncated decoder is composed of a plurality of blocks, each block comprising a multi-head self-attention module, a multi-head self-attention module (MASKED MHA) of a mask, and an FFN module;
The acoustic characteristics of the air conduction and bone conduction voice enhanced in the step 2-3 are respectively converted into an air conduction voice characteristic vector c l and a bone conduction voice characteristic vector g l through a Conformer encoder and a transmitted decoder in sequence;
Step 3-2: the inputs of the multi-mode transduction fusion network are an air conduction voice feature vector c l and a bone conduction voice feature vector g l, wherein the air conduction voice feature vector c l and the bone conduction voice feature vector g l are obtained after air conduction voice and bone conduction voice are converted through a branch network;
Firstly, carrying out linear feature transformation on c l to obtain key and value matrixes which are respectively expressed as K and V; performing linear feature transformation on g l to obtain a query matrix, wherein the query matrix is denoted as Q;
k=c lWK,V=clWV, where W Q,WK,WV is a learnable linear transformation matrix, respectively;
The Q and the K are sent to Scaling Sparsemax modules to respectively obtain the weighted weights [ z a,zb ] of the air conduction and bone conduction characteristics, and the specific calculation formula is as follows:
wherein SSP ()'s are scaling Sparsemax operations; s is a scale factor, and a specific calculation formula is as follows: s=1+relu (Linear (||x|, 2), where Linear represents the Linear transformation, the I < x > is the two norms of the input vector, and ReLU ()' is the activation function, l epsilon { a, b };
The characteristics after fusion with V are as follows:
rl=(zlV)T+FFN(LayerNorm((zlV)T))
The fused characteristic r l passes through an output layer to obtain the final attention-based probability p att (w), wherein w is a predicted text sequence, namely the output of the multi-mode converter fusion network;
step 4: training a neural network;
The training of the network is divided into two steps: training two branch networks of air conduction and bone conduction voice respectively by adopting a CTC loss function by using training set data and verification set data, and then adding a multi-mode Transducer fusion network to train the whole network by adopting the CTC loss function;
Step 5: testing a model;
And (3) sending the test set data into the trained network obtained in the step (4) to obtain a corresponding identification result.
Specific examples:
S1: the data set is constructed by acquiring synchronous bone conduction and air conduction voice data (x a,xb), wherein x a is pure air conduction voice recorded in a sound attenuation laboratory or in a quieter environment, and x b is bone conduction voice recorded synchronously. All speech is downsampled to 16khz,16bit quantization. The input data of the model are air guide and bone conduction voice with noise, and a text y corresponding to the voice is output. Because the bone conduction voice does not introduce environmental noise, only the air conduction voice is added with noise according to a certain range of signal to noise ratio, namely Wherein/>N a is ambient noise, which is noisy air-conduction speech. The final dataset is/>Then further set 84% of the dataset as training set, 8% as validation set, and the remaining 8% as test set.
S2: data enhancement and feature extraction
S21: the speech speed of the voice signal is changed to perform preliminary data enhancement, namely, the speech speed of the original voice is changed to be 0.9 times and 1.1 times of the original speech speed.
S22, extracting 80-dimensional Mel-bank characteristics from the air conduction voice and the bone conduction voice respectively.
S23, carrying out data enhancement again on the Mel-bank characteristics by using a SpecAugment method.
S3: and constructing an end-to-end deep neural network model based on Conformer. As shown in fig. 1, the model consists of three modules, namely two branch networks that handle air-guided and bone-guided speech, respectively, and a multi-modal Transducer-based fusion network.
S31: the branched networks of air conduction and bone conduction speech are similar, and are Conformer network architectures, including Conformer encoder and guided decoder. The encoder is made up of a plurality of blocks, each block containing two position-wise feed-forward (FFN) modules, a multi-headed self-attention Module (MHA) and a convolution module. Specifically, the encoder is composed of 12 blocks, wherein the convolution kernel size of the convolution module is 15, the number of heads of the multi-head self-attention is 8, the number of intermediate layer nodes of the FFN is 2048, and the output dimension D m is 256. The Truncated decoder is also made up of a plurality of blocks, each block containing a multi-headed self-attention module, a multi-headed self-attention module (MASKED MHA) for a mask, and an FFN module. Specifically, the decoder is made up of 6 blocks, and other parameter configurations are consistent with the encoder. Through the two branched networks, the acoustic features of air-guided and bone-conducted speech are converted into two feature vectors, namely c l and g l in fig. 1.
S32, the structure of the multi-mode Transducer is shown in fig. 2, the main structure of the Transducer is similar to that of a Transducer, and the input is feature vectors c l and g l of air conduction and bone conduction voice after the transformation of a branch network. First, c l and g l are subjected to linear feature transformation to obtain query, key and value matrices, which respectively correspond to QKV in fig. 2, specifically,K=c lWK,V=clWV, where W Q,WK,WV is a learnable linear transformation matrix. The Q and the K are sent to Scaling Sparsemax modules to obtain the weighted weights [ z a,zb ] of the air conduction and bone conduction characteristics respectively, and the specific calculation formula is as follows:
Wherein SSP (x, s) is scaling Sparsemax operation, s is scale factor, the specific calculation formula is s=1+relu (Linear (|x||, 2), wherein Linear is a Linear transformation, i x is a two-norm of the input vector, and ReLU is an activation function:
rl=(zlV)T+FFN(LayerNorm((zlV)T))。
the fused features pass through an output layer, and the final attention-based probability p att (w) can be obtained, wherein w is a predicted text sequence, namely the output of the whole multi-mode Transducer.
And S4, optimizing the neural network. The training of the whole network is divided into two steps, namely, the branch network of each of the air conduction voice and the bone conduction voice is optimized firstly, and then the whole network is optimized together with the multi-mode Transducer. The loss function for both the branched network and the overall network optimization is CTC loss. The network was optimized with Adam optimizer, and the training times were set to 50 epochs.
S5: and (5) model testing. And (3) sending the test data into the trained network obtained in the step (S4) to obtain a corresponding identification result.
Claims (4)
1. The end-to-end bone and qi guiding voice joint recognition method is characterized by comprising the following steps of:
Step 1: acquiring synchronous air conduction and bone conduction voice data (x a,xb) to construct a data set, wherein x a is pure air conduction voice, x b is bone conduction voice recorded synchronously, and outputting the data set as a corresponding text y;
Adding noise to the air-guide voice to obtain Wherein/>N a is environmental noise, which is air conduction voice with noise; the final dataset is/>The data set is further divided into a training set, a verification set and a test set;
step 2: data enhancement and feature extraction;
step 2-1: preliminary data enhancement is carried out on the change of the speech speed of the air conduction and bone conduction voice signals;
Step 2-2: respectively extracting acoustic characteristics from the air conduction and bone conduction voice signals with the speed changed;
Step 2-3: performing data enhancement again on the acoustic features extracted in the step 2-2 by using a SpecAugment method;
Step 3: building an end-to-end deep neural network model based on Conformer; the model consists of three parts, namely two branch networks for processing air conduction and bone conduction voice and a fusion network based on a multi-mode Transducer;
Step 3-1: both branch networks of air-guided and bone-guided speech are Conformer network architectures, including Conformer encoder and guided decoder;
The Conformer encoder is composed of a plurality of blocks, each block comprises two FFN modules, a multi-head self-attention module and a convolution module; the Truncated decoder is composed of a plurality of blocks, and each block comprises a multi-head self-attention module, a multi-head self-attention module of a mask and an FFN module;
The acoustic characteristics of the air conduction and bone conduction voice enhanced in the step 2-3 are respectively converted into an air conduction voice characteristic vector c l and a bone conduction voice characteristic vector g l through a Conformer encoder and a transmitted decoder in sequence;
Step 3-2: the inputs of the multi-mode transduction fusion network are an air conduction voice feature vector c l and a bone conduction voice feature vector g l, wherein the air conduction voice feature vector c l and the bone conduction voice feature vector g l are obtained after air conduction voice and bone conduction voice are converted through a branch network;
Firstly, carrying out linear feature transformation on c l to obtain key and value matrixes which are respectively expressed as K and V; performing linear feature transformation on g l to obtain a query matrix, wherein the query matrix is denoted as Q;
k=c lWK,V=clWV, where W Q,WK,WV is a learnable linear transformation matrix, respectively;
The Q and the K are sent to Scaling Sparsemax modules to respectively obtain the weighted weights [ z a,zb ] of the air conduction and bone conduction characteristics, and the specific calculation formula is as follows:
wherein SSP ()'s are scaling Sparsemax operations; s is a scale factor, and a specific calculation formula is as follows: s=1+relu (Linear (||x|, 2), where Linear represents the Linear transformation, the I < x > is the two norms of the input vector, and ReLU ()' is the activation function, l epsilon { a, b };
The characteristics after fusion with V are as follows:
rl=(zlV)T+FFN(LayerNorm((zlV)T))
The fused characteristic r l passes through an output layer to obtain the final attention-based probability p att (w), wherein w is a predicted text sequence, namely the output of the multi-mode converter fusion network;
step 4: training a neural network;
The training of the network is divided into two steps: training two branch networks of air conduction and bone conduction voice respectively by adopting a CTC loss function by using training set data and verification set data, and then adding a multi-mode Transducer fusion network to train the whole network by adopting the CTC loss function;
Step 5: testing a model;
And (3) sending the test set data into the trained network obtained in the step (4) to obtain a corresponding identification result.
2. The end-to-end bone conduction voice joint recognition method according to claim 1, wherein in the step 2-1, the speech speed of the air conduction and bone conduction voice signals is changed to 0.9 times and 1.1 times as much as the original speech speed.
3. The end-to-end bone-qi-guiding-speech joint recognition method according to claim 1, wherein the acoustic features extracted in the step 2-2 are 80-dimensional Mel-bank features.
4. The end-to-end bone-air conduction speech joint recognition method according to claim 1, wherein said Conformer encoder consists of 12 blocks; the quantized decoder consists of 6 blocks.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210153909.5A CN114495909B (en) | 2022-02-20 | 2022-02-20 | End-to-end bone-qi guiding voice joint recognition method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210153909.5A CN114495909B (en) | 2022-02-20 | 2022-02-20 | End-to-end bone-qi guiding voice joint recognition method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114495909A CN114495909A (en) | 2022-05-13 |
CN114495909B true CN114495909B (en) | 2024-04-30 |
Family
ID=81483047
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210153909.5A Active CN114495909B (en) | 2022-02-20 | 2022-02-20 | End-to-end bone-qi guiding voice joint recognition method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114495909B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116030823B (en) * | 2023-03-30 | 2023-06-16 | 北京探境科技有限公司 | Voice signal processing method and device, computer equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007251354A (en) * | 2006-03-14 | 2007-09-27 | Saitama Univ | Microphone and sound generation method |
CN108986834A (en) * | 2018-08-22 | 2018-12-11 | 中国人民解放军陆军工程大学 | The blind Enhancement Method of bone conduction voice based on codec framework and recurrent neural network |
CN112786064A (en) * | 2020-12-30 | 2021-05-11 | 西北工业大学 | End-to-end bone-qi-conduction speech joint enhancement method |
-
2022
- 2022-02-20 CN CN202210153909.5A patent/CN114495909B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007251354A (en) * | 2006-03-14 | 2007-09-27 | Saitama Univ | Microphone and sound generation method |
CN108986834A (en) * | 2018-08-22 | 2018-12-11 | 中国人民解放军陆军工程大学 | The blind Enhancement Method of bone conduction voice based on codec framework and recurrent neural network |
CN112786064A (en) * | 2020-12-30 | 2021-05-11 | 西北工业大学 | End-to-end bone-qi-conduction speech joint enhancement method |
Non-Patent Citations (2)
Title |
---|
基于门控卷积网络与CTC的端到端语音识别;杨德举;马良荔;谭琳珊;裴晶晶;;计算机工程与设计;20200916(第09期);全文 * |
骨导麦克风语音盲增强技术研究现状及展望;张雄伟;郑昌艳;曹铁勇;杨吉斌;邢益搏;;数据采集与处理;20180915(第05期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN114495909A (en) | 2022-05-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110600017B (en) | Training method of voice processing model, voice recognition method, system and device | |
CN109949821B (en) | Method for removing reverberation of far-field voice by using U-NET structure of CNN | |
CN110085245B (en) | Voice definition enhancing method based on acoustic feature conversion | |
CN109427328B (en) | Multichannel voice recognition method based on filter network acoustic model | |
CN109887489B (en) | Speech dereverberation method based on depth features for generating countermeasure network | |
CN103258533B (en) | Novel model domain compensation method in remote voice recognition | |
CN105741849A (en) | Voice enhancement method for fusing phase estimation and human ear hearing characteristics in digital hearing aid | |
CN1494712A (en) | Distributed voice recognition system using acoustic feature vector modification | |
CN103229238A (en) | System and method for producing an audio signal | |
CN111833896A (en) | Voice enhancement method, system, device and storage medium for fusing feedback signals | |
CN105448302B (en) | A kind of the speech reverberation removing method and system of environment self-adaption | |
CN110047478B (en) | Multi-channel speech recognition acoustic modeling method and device based on spatial feature compensation | |
WO2022012206A1 (en) | Audio signal processing method, device, equipment, and storage medium | |
CN112786064B (en) | End-to-end bone qi conduction voice joint enhancement method | |
Yuliani et al. | Speech enhancement using deep learning methods: A review | |
CN103208291A (en) | Speech enhancement method and device applicable to strong noise environments | |
KR20080064557A (en) | Apparatus and method for improving speech intelligibility | |
CN114495909B (en) | End-to-end bone-qi guiding voice joint recognition method | |
CN109243429A (en) | A kind of pronunciation modeling method and device | |
CN111142066A (en) | Direction-of-arrival estimation method, server, and computer-readable storage medium | |
CN110867178B (en) | Multi-channel far-field speech recognition method | |
CN113823273A (en) | Audio signal processing method, audio signal processing device, electronic equipment and storage medium | |
CN112185405B (en) | Bone conduction voice enhancement method based on differential operation and combined dictionary learning | |
CN116030823B (en) | Voice signal processing method and device, computer equipment and storage medium | |
CN203165457U (en) | Voice acquisition device used for noisy environment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |