CN114495909B - End-to-end bone-qi guiding voice joint recognition method - Google Patents

End-to-end bone-qi guiding voice joint recognition method Download PDF

Info

Publication number
CN114495909B
CN114495909B CN202210153909.5A CN202210153909A CN114495909B CN 114495909 B CN114495909 B CN 114495909B CN 202210153909 A CN202210153909 A CN 202210153909A CN 114495909 B CN114495909 B CN 114495909B
Authority
CN
China
Prior art keywords
voice
bone
air conduction
bone conduction
air
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210153909.5A
Other languages
Chinese (zh)
Other versions
CN114495909A (en
Inventor
王谋
陈俊淇
张晓雷
王逸平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202210153909.5A priority Critical patent/CN114495909B/en
Publication of CN114495909A publication Critical patent/CN114495909A/en
Application granted granted Critical
Publication of CN114495909B publication Critical patent/CN114495909B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephone Function (AREA)
  • Details Of Audible-Bandwidth Transducers (AREA)

Abstract

The invention discloses an end-to-end bone and air conduction voice joint recognition method, which comprises the steps of firstly acquiring synchronous air conduction and bone conduction voice data to construct a data set, and outputting the data set as a corresponding text; data enhancement and acoustic feature extraction are carried out on the air conduction and bone conduction voice signals; then, an end-to-end deep neural network model based on Conformer is built, and the model consists of three parts, namely two branch networks for processing air conduction and bone conduction voice and a fusion network based on a multi-mode Transducer; and training the neural network, and finally obtaining a corresponding recognition result through the trained network. Compared with the traditional method for carrying out voice recognition by only using the air-guide voice signal, the invention can obviously reduce the error rate of voice recognition and improve the overall recognition performance of the system by adopting the combined recognition method.

Description

End-to-end bone-qi guiding voice joint recognition method
Technical Field
The invention belongs to the technical field of voice recognition, and particularly relates to a bone-qi guiding voice joint recognition method.
Background
In recent decades, robust automatic speech recognition has been significantly developed and has been applied to various fields such as smart phones, smart home appliances, automobiles, etc. thanks to the rising and progress of deep learning. The robust speech recognition algorithm based on deep learning can be mainly divided into two types, one is to remove noise at the front end of the system, including speech enhancement, extracting noise robust features and the like, and the other is to design a robust recognition model capable of adapting to different noise scenes at the rear end of the system. However, to date, these deep learning-based speech recognition methods are all based on air-guided speech. Due to the conductive nature of the speech in air, the speech is susceptible to interference from ambient noise, which severely degrades the recognition performance of the system at low signal-to-noise ratios, especially when non-stationary noise such as wind noise is present. At this time, we can consider to introduce other modes to perform joint recognition, so as to improve the performance of the system.
Bone conduction speech is a speech signal obtained by picking up vibration signals of the human skull and skin using a bone conduction microphone. Compared with the traditional air conduction voice, the bone conduction voice is not easy to be dry-dyed by noise in the surrounding environment, so that the environment noise can be resisted from a sound source, and voice information can be well reserved in a low signal-to-noise ratio environment. But bone conduction speech itself has a number of drawbacks. First, the high frequency part of the bone conduction voice is severely attenuated due to the high frequency attenuation of the voice vibration signal by the human tissue. Although the frequency characteristics of bone conduction microphones of different manufacturers are inconsistent, the collected voice is seriously attenuated from the part above 600Hz, and even is completely absent. The absence of high frequency parts presents a serious challenge for speech recognition systems. Secondly, because of friction between human skin and bone conduction microphone and motion of human body, the bone conduction voice often contains a certain self-noise, which further increases the recognition difficulty of bone conduction voice. Finally, bone-conduction speech tends to lose parts of unvoiced sounds, fricatives, etc. in speech, and also reduces the performance of the speech recognition system.
Because of the above characteristics of bone-conduction speech, speech recognition using only bone-conduction speech still faces numerous challenges. But bone conduction speech has some complementarity with air conduction speech. Therefore, the patent utilizes air conduction voice and bone conduction voice simultaneously, and carries out joint voice recognition through the deep learning model. Because there has not been previously disclosed a large-scale bone-air guide voice database for deep learning voice recognition, there has been no work to date based on deep learning of end-to-end bone-air guide joint voice recognition.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides an end-to-end bone and air conduction voice joint recognition method, which comprises the steps of firstly acquiring synchronous air conduction and bone conduction voice data to construct a data set, and outputting the data set as a corresponding text; data enhancement and acoustic feature extraction are carried out on the air conduction and bone conduction voice signals; then, an end-to-end deep neural network model based on Conformer is built, and the model consists of three parts, namely two branch networks for processing air conduction and bone conduction voice and a fusion network based on a multi-mode Transducer; and training the neural network, and finally obtaining a corresponding recognition result through the trained network. Compared with the traditional method for carrying out voice recognition by only using the air-guide voice signal, the invention can obviously reduce the error rate of voice recognition and improve the overall recognition performance of the system by adopting the combined recognition method.
The technical scheme adopted by the invention for solving the technical problems comprises the following steps:
Step 1: acquiring synchronous air conduction and bone conduction voice data (x a,xb) to construct a data set, wherein x a is pure air conduction voice, x b is bone conduction voice recorded synchronously, and outputting the data set as a corresponding text y;
Adding noise to the air-guide voice to obtain Wherein/>N a is environmental noise, which is air conduction voice with noise; the final dataset is/>The data set is further divided into a training set, a verification set and a test set;
step 2: data enhancement and feature extraction;
step 2-1: preliminary data enhancement is carried out on the change of the speech speed of the air conduction and bone conduction voice signals;
Step 2-2: respectively extracting acoustic characteristics from the air conduction and bone conduction voice signals with the speed changed;
Step 2-3: performing data enhancement again on the acoustic features extracted in the step 2-2 by using a SpecAugment method;
Step 3: building an end-to-end deep neural network model based on Conformer; the model consists of three parts, namely two branch networks for processing air conduction and bone conduction voice and a fusion network based on a multi-mode Transducer;
Step 3-1: both branch networks of air-guided and bone-guided speech are Conformer network architectures, including Conformer encoder and guided decoder;
The Conformer encoder is composed of a plurality of blocks, each block comprises two FFN modules, a multi-head self-attention module and a convolution module; the Truncated decoder is composed of a plurality of blocks, and each block comprises a multi-head self-attention module, a multi-head self-attention module of a mask and an FFN module;
The acoustic characteristics of the air conduction and bone conduction voice enhanced in the step 2-3 are respectively converted into an air conduction voice characteristic vector c l and a bone conduction voice characteristic vector g l through a Conformer encoder and a transmitted decoder in sequence;
Step 3-2: the inputs of the multi-mode transduction fusion network are an air conduction voice feature vector c l and a bone conduction voice feature vector g l, wherein the air conduction voice feature vector c l and the bone conduction voice feature vector g l are obtained after air conduction voice and bone conduction voice are converted through a branch network;
Firstly, carrying out linear feature transformation on c l to obtain key and value matrixes which are respectively expressed as K and V; performing linear feature transformation on g l to obtain a query matrix, wherein the query matrix is denoted as Q;
k=c lWK,V=clWV, where W Q,WK,WV is a learnable linear transformation matrix, respectively;
The Q and the K are sent to Scaling Sparsemax modules to respectively obtain the weighted weights [ z a,zb ] of the air conduction and bone conduction characteristics, and the specific calculation formula is as follows:
wherein SSP ()'s are scaling Sparsemax operations; s is a scale factor, and a specific calculation formula is as follows: s=1+relu (Linear (||x|, 2), where Linear represents the Linear transformation, the I < x > is the two norms of the input vector, and ReLU ()' is the activation function, l epsilon { a, b };
The characteristics after fusion with V are as follows:
rl=(zlV)T+FFN(LayerNorm((zlV)T))
The fused characteristic r l passes through an output layer to obtain the final attention-based probability p att (w), wherein w is a predicted text sequence, namely the output of the multi-mode converter fusion network;
step 4: training a neural network;
The training of the network is divided into two steps: training two branch networks of air conduction and bone conduction voice respectively by adopting a CTC loss function by using training set data and verification set data, and then adding a multi-mode Transducer fusion network to train the whole network by adopting the CTC loss function;
Step 5: testing a model;
And (3) sending the test set data into the trained network obtained in the step (4) to obtain a corresponding identification result.
Preferably, in the step 2-1, the speech rate of the air conduction and bone conduction speech signals is changed to 0.9 times and 1.1 times as much as the original speech rate.
Preferably, the acoustic features extracted in the step 2-2 are 80-dimensional Mel-bank features.
Preferably, the Conformer encoder consists of 12 blocks; the quantized decoder consists of 6 blocks.
The beneficial effects of the invention are as follows:
The invention can realize end-to-end joint voice recognition by simultaneously utilizing the air-guide voice with noise and the bone conduction voice. Compared with the traditional method for carrying out voice recognition by only using the air conduction voice signal, the combined recognition method can obviously reduce the error rate of voice recognition, especially under the condition of low signal-to-noise ratio. Compared with the mode of simply directly splicing the characteristics of the air conduction voice and the bone conduction voice, the multi-mode Transducer used in the invention can adaptively allocate channel weights to fuse two paths of signals according to the characteristics of the air conduction and the bone conduction, thereby improving the overall recognition performance of the system.
Drawings
FIG. 1 is a system frame diagram of the method of the present invention.
FIG. 2 is a diagram of a multi-modal converter fusion network.
Detailed Description
The invention will be further described with reference to the drawings and examples.
The invention aims to provide a voice joint recognition method of end-to-end multiple sensors based on deep learning, in particular to a bone-qi guiding joint voice recognition method, which can directly take a bone-qi guiding voice signal with synchronous time domain as the input of a system, thereby directly outputting a corresponding voice recognition result.
An end-to-end bone-qi guiding voice joint recognition method comprises the following steps:
Step 1: acquiring synchronous air conduction and bone conduction voice data (x a,xb) to construct a data set, wherein x a is pure air conduction voice recorded in a sound elimination laboratory or a relatively quiet environment, x b is bone conduction voice recorded synchronously, and outputting the corresponding text y;
Adding noise to the air-guide voice according to a certain signal-to-noise ratio to obtain Wherein/>N a is environmental noise, which is air conduction voice with noise; the final dataset is/>The data set is further divided into a training set, a verification set and a test set;
step 2: data enhancement and feature extraction;
step 2-1: preliminary data enhancement is carried out on the change of the speech speed of the air conduction and bone conduction voice signals;
Step 2-2: respectively extracting acoustic characteristics from the air conduction and bone conduction voice signals with the speed changed;
Step 2-3: performing data enhancement again on the acoustic features extracted in the step 2-2 by using a SpecAugment method;
Step 3: building an end-to-end deep neural network model based on Conformer; the model consists of three parts, namely two branch networks for processing air conduction and bone conduction voice and a fusion network based on a multi-mode Transducer;
Step 3-1: both branch networks of air-guided and bone-guided speech are Conformer network architectures, including Conformer encoder and guided decoder;
The Conformer encoder is made up of a plurality of blocks, each block containing two position-wise feed-forward (FFN) modules, a multi-headed self-attention (MHA) module, and a convolution module; the Truncated decoder is composed of a plurality of blocks, each block comprising a multi-head self-attention module, a multi-head self-attention module (MASKED MHA) of a mask, and an FFN module;
The acoustic characteristics of the air conduction and bone conduction voice enhanced in the step 2-3 are respectively converted into an air conduction voice characteristic vector c l and a bone conduction voice characteristic vector g l through a Conformer encoder and a transmitted decoder in sequence;
Step 3-2: the inputs of the multi-mode transduction fusion network are an air conduction voice feature vector c l and a bone conduction voice feature vector g l, wherein the air conduction voice feature vector c l and the bone conduction voice feature vector g l are obtained after air conduction voice and bone conduction voice are converted through a branch network;
Firstly, carrying out linear feature transformation on c l to obtain key and value matrixes which are respectively expressed as K and V; performing linear feature transformation on g l to obtain a query matrix, wherein the query matrix is denoted as Q;
k=c lWK,V=clWV, where W Q,WK,WV is a learnable linear transformation matrix, respectively;
The Q and the K are sent to Scaling Sparsemax modules to respectively obtain the weighted weights [ z a,zb ] of the air conduction and bone conduction characteristics, and the specific calculation formula is as follows:
wherein SSP ()'s are scaling Sparsemax operations; s is a scale factor, and a specific calculation formula is as follows: s=1+relu (Linear (||x|, 2), where Linear represents the Linear transformation, the I < x > is the two norms of the input vector, and ReLU ()' is the activation function, l epsilon { a, b };
The characteristics after fusion with V are as follows:
rl=(zlV)T+FFN(LayerNorm((zlV)T))
The fused characteristic r l passes through an output layer to obtain the final attention-based probability p att (w), wherein w is a predicted text sequence, namely the output of the multi-mode converter fusion network;
step 4: training a neural network;
The training of the network is divided into two steps: training two branch networks of air conduction and bone conduction voice respectively by adopting a CTC loss function by using training set data and verification set data, and then adding a multi-mode Transducer fusion network to train the whole network by adopting the CTC loss function;
Step 5: testing a model;
And (3) sending the test set data into the trained network obtained in the step (4) to obtain a corresponding identification result.
Specific examples:
S1: the data set is constructed by acquiring synchronous bone conduction and air conduction voice data (x a,xb), wherein x a is pure air conduction voice recorded in a sound attenuation laboratory or in a quieter environment, and x b is bone conduction voice recorded synchronously. All speech is downsampled to 16khz,16bit quantization. The input data of the model are air guide and bone conduction voice with noise, and a text y corresponding to the voice is output. Because the bone conduction voice does not introduce environmental noise, only the air conduction voice is added with noise according to a certain range of signal to noise ratio, namely Wherein/>N a is ambient noise, which is noisy air-conduction speech. The final dataset is/>Then further set 84% of the dataset as training set, 8% as validation set, and the remaining 8% as test set.
S2: data enhancement and feature extraction
S21: the speech speed of the voice signal is changed to perform preliminary data enhancement, namely, the speech speed of the original voice is changed to be 0.9 times and 1.1 times of the original speech speed.
S22, extracting 80-dimensional Mel-bank characteristics from the air conduction voice and the bone conduction voice respectively.
S23, carrying out data enhancement again on the Mel-bank characteristics by using a SpecAugment method.
S3: and constructing an end-to-end deep neural network model based on Conformer. As shown in fig. 1, the model consists of three modules, namely two branch networks that handle air-guided and bone-guided speech, respectively, and a multi-modal Transducer-based fusion network.
S31: the branched networks of air conduction and bone conduction speech are similar, and are Conformer network architectures, including Conformer encoder and guided decoder. The encoder is made up of a plurality of blocks, each block containing two position-wise feed-forward (FFN) modules, a multi-headed self-attention Module (MHA) and a convolution module. Specifically, the encoder is composed of 12 blocks, wherein the convolution kernel size of the convolution module is 15, the number of heads of the multi-head self-attention is 8, the number of intermediate layer nodes of the FFN is 2048, and the output dimension D m is 256. The Truncated decoder is also made up of a plurality of blocks, each block containing a multi-headed self-attention module, a multi-headed self-attention module (MASKED MHA) for a mask, and an FFN module. Specifically, the decoder is made up of 6 blocks, and other parameter configurations are consistent with the encoder. Through the two branched networks, the acoustic features of air-guided and bone-conducted speech are converted into two feature vectors, namely c l and g l in fig. 1.
S32, the structure of the multi-mode Transducer is shown in fig. 2, the main structure of the Transducer is similar to that of a Transducer, and the input is feature vectors c l and g l of air conduction and bone conduction voice after the transformation of a branch network. First, c l and g l are subjected to linear feature transformation to obtain query, key and value matrices, which respectively correspond to QKV in fig. 2, specifically,K=c lWK,V=clWV, where W Q,WK,WV is a learnable linear transformation matrix. The Q and the K are sent to Scaling Sparsemax modules to obtain the weighted weights [ z a,zb ] of the air conduction and bone conduction characteristics respectively, and the specific calculation formula is as follows:
Wherein SSP (x, s) is scaling Sparsemax operation, s is scale factor, the specific calculation formula is s=1+relu (Linear (|x||, 2), wherein Linear is a Linear transformation, i x is a two-norm of the input vector, and ReLU is an activation function:
rl=(zlV)T+FFN(LayerNorm((zlV)T))。
the fused features pass through an output layer, and the final attention-based probability p att (w) can be obtained, wherein w is a predicted text sequence, namely the output of the whole multi-mode Transducer.
And S4, optimizing the neural network. The training of the whole network is divided into two steps, namely, the branch network of each of the air conduction voice and the bone conduction voice is optimized firstly, and then the whole network is optimized together with the multi-mode Transducer. The loss function for both the branched network and the overall network optimization is CTC loss. The network was optimized with Adam optimizer, and the training times were set to 50 epochs.
S5: and (5) model testing. And (3) sending the test data into the trained network obtained in the step (S4) to obtain a corresponding identification result.

Claims (4)

1. The end-to-end bone and qi guiding voice joint recognition method is characterized by comprising the following steps of:
Step 1: acquiring synchronous air conduction and bone conduction voice data (x a,xb) to construct a data set, wherein x a is pure air conduction voice, x b is bone conduction voice recorded synchronously, and outputting the data set as a corresponding text y;
Adding noise to the air-guide voice to obtain Wherein/>N a is environmental noise, which is air conduction voice with noise; the final dataset is/>The data set is further divided into a training set, a verification set and a test set;
step 2: data enhancement and feature extraction;
step 2-1: preliminary data enhancement is carried out on the change of the speech speed of the air conduction and bone conduction voice signals;
Step 2-2: respectively extracting acoustic characteristics from the air conduction and bone conduction voice signals with the speed changed;
Step 2-3: performing data enhancement again on the acoustic features extracted in the step 2-2 by using a SpecAugment method;
Step 3: building an end-to-end deep neural network model based on Conformer; the model consists of three parts, namely two branch networks for processing air conduction and bone conduction voice and a fusion network based on a multi-mode Transducer;
Step 3-1: both branch networks of air-guided and bone-guided speech are Conformer network architectures, including Conformer encoder and guided decoder;
The Conformer encoder is composed of a plurality of blocks, each block comprises two FFN modules, a multi-head self-attention module and a convolution module; the Truncated decoder is composed of a plurality of blocks, and each block comprises a multi-head self-attention module, a multi-head self-attention module of a mask and an FFN module;
The acoustic characteristics of the air conduction and bone conduction voice enhanced in the step 2-3 are respectively converted into an air conduction voice characteristic vector c l and a bone conduction voice characteristic vector g l through a Conformer encoder and a transmitted decoder in sequence;
Step 3-2: the inputs of the multi-mode transduction fusion network are an air conduction voice feature vector c l and a bone conduction voice feature vector g l, wherein the air conduction voice feature vector c l and the bone conduction voice feature vector g l are obtained after air conduction voice and bone conduction voice are converted through a branch network;
Firstly, carrying out linear feature transformation on c l to obtain key and value matrixes which are respectively expressed as K and V; performing linear feature transformation on g l to obtain a query matrix, wherein the query matrix is denoted as Q;
k=c lWK,V=clWV, where W Q,WK,WV is a learnable linear transformation matrix, respectively;
The Q and the K are sent to Scaling Sparsemax modules to respectively obtain the weighted weights [ z a,zb ] of the air conduction and bone conduction characteristics, and the specific calculation formula is as follows:
wherein SSP ()'s are scaling Sparsemax operations; s is a scale factor, and a specific calculation formula is as follows: s=1+relu (Linear (||x|, 2), where Linear represents the Linear transformation, the I < x > is the two norms of the input vector, and ReLU ()' is the activation function, l epsilon { a, b };
The characteristics after fusion with V are as follows:
rl=(zlV)T+FFN(LayerNorm((zlV)T))
The fused characteristic r l passes through an output layer to obtain the final attention-based probability p att (w), wherein w is a predicted text sequence, namely the output of the multi-mode converter fusion network;
step 4: training a neural network;
The training of the network is divided into two steps: training two branch networks of air conduction and bone conduction voice respectively by adopting a CTC loss function by using training set data and verification set data, and then adding a multi-mode Transducer fusion network to train the whole network by adopting the CTC loss function;
Step 5: testing a model;
And (3) sending the test set data into the trained network obtained in the step (4) to obtain a corresponding identification result.
2. The end-to-end bone conduction voice joint recognition method according to claim 1, wherein in the step 2-1, the speech speed of the air conduction and bone conduction voice signals is changed to 0.9 times and 1.1 times as much as the original speech speed.
3. The end-to-end bone-qi-guiding-speech joint recognition method according to claim 1, wherein the acoustic features extracted in the step 2-2 are 80-dimensional Mel-bank features.
4. The end-to-end bone-air conduction speech joint recognition method according to claim 1, wherein said Conformer encoder consists of 12 blocks; the quantized decoder consists of 6 blocks.
CN202210153909.5A 2022-02-20 2022-02-20 End-to-end bone-qi guiding voice joint recognition method Active CN114495909B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210153909.5A CN114495909B (en) 2022-02-20 2022-02-20 End-to-end bone-qi guiding voice joint recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210153909.5A CN114495909B (en) 2022-02-20 2022-02-20 End-to-end bone-qi guiding voice joint recognition method

Publications (2)

Publication Number Publication Date
CN114495909A CN114495909A (en) 2022-05-13
CN114495909B true CN114495909B (en) 2024-04-30

Family

ID=81483047

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210153909.5A Active CN114495909B (en) 2022-02-20 2022-02-20 End-to-end bone-qi guiding voice joint recognition method

Country Status (1)

Country Link
CN (1) CN114495909B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116030823B (en) * 2023-03-30 2023-06-16 北京探境科技有限公司 Voice signal processing method and device, computer equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007251354A (en) * 2006-03-14 2007-09-27 Saitama Univ Microphone and sound generation method
CN108986834A (en) * 2018-08-22 2018-12-11 中国人民解放军陆军工程大学 The blind Enhancement Method of bone conduction voice based on codec framework and recurrent neural network
CN112786064A (en) * 2020-12-30 2021-05-11 西北工业大学 End-to-end bone-qi-conduction speech joint enhancement method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007251354A (en) * 2006-03-14 2007-09-27 Saitama Univ Microphone and sound generation method
CN108986834A (en) * 2018-08-22 2018-12-11 中国人民解放军陆军工程大学 The blind Enhancement Method of bone conduction voice based on codec framework and recurrent neural network
CN112786064A (en) * 2020-12-30 2021-05-11 西北工业大学 End-to-end bone-qi-conduction speech joint enhancement method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于门控卷积网络与CTC的端到端语音识别;杨德举;马良荔;谭琳珊;裴晶晶;;计算机工程与设计;20200916(第09期);全文 *
骨导麦克风语音盲增强技术研究现状及展望;张雄伟;郑昌艳;曹铁勇;杨吉斌;邢益搏;;数据采集与处理;20180915(第05期);全文 *

Also Published As

Publication number Publication date
CN114495909A (en) 2022-05-13

Similar Documents

Publication Publication Date Title
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
CN109949821B (en) Method for removing reverberation of far-field voice by using U-NET structure of CNN
CN110085245B (en) Voice definition enhancing method based on acoustic feature conversion
CN109427328B (en) Multichannel voice recognition method based on filter network acoustic model
CN109887489B (en) Speech dereverberation method based on depth features for generating countermeasure network
CN103258533B (en) Novel model domain compensation method in remote voice recognition
CN105741849A (en) Voice enhancement method for fusing phase estimation and human ear hearing characteristics in digital hearing aid
CN1494712A (en) Distributed voice recognition system using acoustic feature vector modification
CN103229238A (en) System and method for producing an audio signal
CN111833896A (en) Voice enhancement method, system, device and storage medium for fusing feedback signals
CN105448302B (en) A kind of the speech reverberation removing method and system of environment self-adaption
CN110047478B (en) Multi-channel speech recognition acoustic modeling method and device based on spatial feature compensation
WO2022012206A1 (en) Audio signal processing method, device, equipment, and storage medium
CN112786064B (en) End-to-end bone qi conduction voice joint enhancement method
Yuliani et al. Speech enhancement using deep learning methods: A review
CN103208291A (en) Speech enhancement method and device applicable to strong noise environments
KR20080064557A (en) Apparatus and method for improving speech intelligibility
CN114495909B (en) End-to-end bone-qi guiding voice joint recognition method
CN109243429A (en) A kind of pronunciation modeling method and device
CN111142066A (en) Direction-of-arrival estimation method, server, and computer-readable storage medium
CN110867178B (en) Multi-channel far-field speech recognition method
CN113823273A (en) Audio signal processing method, audio signal processing device, electronic equipment and storage medium
CN112185405B (en) Bone conduction voice enhancement method based on differential operation and combined dictionary learning
CN116030823B (en) Voice signal processing method and device, computer equipment and storage medium
CN203165457U (en) Voice acquisition device used for noisy environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant