CN114495909A - End-to-end bone-qi-guide voice joint identification method - Google Patents

End-to-end bone-qi-guide voice joint identification method Download PDF

Info

Publication number
CN114495909A
CN114495909A CN202210153909.5A CN202210153909A CN114495909A CN 114495909 A CN114495909 A CN 114495909A CN 202210153909 A CN202210153909 A CN 202210153909A CN 114495909 A CN114495909 A CN 114495909A
Authority
CN
China
Prior art keywords
voice
speech
bone
conduction
air conduction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210153909.5A
Other languages
Chinese (zh)
Other versions
CN114495909B (en
Inventor
王谋
陈俊淇
张晓雷
王逸平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202210153909.5A priority Critical patent/CN114495909B/en
Priority claimed from CN202210153909.5A external-priority patent/CN114495909B/en
Publication of CN114495909A publication Critical patent/CN114495909A/en
Application granted granted Critical
Publication of CN114495909B publication Critical patent/CN114495909B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Abstract

The invention discloses an end-to-end bone air conduction voice joint identification method, which comprises the steps of firstly obtaining synchronous air conduction and bone conduction voice data to construct a data set and outputting the data set as a corresponding text; then data enhancement and acoustic feature extraction are carried out on the air conduction voice signal and the bone conduction voice signal; then constructing a Conformer-based end-to-end deep neural network model which consists of three parts, namely two branch networks for processing air conduction and bone conduction voice and a multi-modal Transducer-based fusion network; and training the neural network, and finally obtaining a corresponding recognition result through the trained network. Compared with the traditional method for performing voice recognition by only utilizing the air conduction voice signal, the method for performing combined recognition can obviously reduce the error rate of voice recognition and improve the overall recognition performance of the system.

Description

End-to-end bone-qi-guide voice joint identification method
Technical Field
The invention belongs to the technical field of voice recognition, and particularly relates to a bone air conduction voice joint recognition method.
Background
In recent decades, thanks to the rise and progress of deep learning, robust automatic speech recognition has been remarkably developed and has been applied to various fields such as smart phones, smart home appliances, automobiles, and the like. The robust speech recognition algorithm based on deep learning can be mainly divided into two types, one is to remove noise at the front end of the system, including speech enhancement, extraction of noise robust features and the like, and the other is to design a robust recognition model capable of adapting to different noise scenes at the rear end of the system. However, to date, these deep learning based speech recognition methods have all been based on air conduction speech. Due to the conduction characteristic of the voice in the air, the voice is easily interfered by the environmental noise, so that the recognition performance of the system is seriously reduced when the signal to noise ratio is low, especially when non-stationary noise such as wind noise exists. At this time, other modes can be considered to be introduced for joint identification, and the performance of the system is improved.
Bone conduction voice is a voice signal obtained by picking up vibration signals of the skull and skin of a human body using a bone conduction microphone. Compared with traditional air conduction voice, bone conduction voice is not easily affected by noise in the surrounding environment, so that environmental noise can be resisted from a sound source, and voice information can be kept well in the environment with low signal-to-noise ratio. However, bone conduction speech itself has many disadvantages. First, the high frequency portion of bone conduction speech is severely attenuated due to the high frequency attenuation of the speech vibration signal by human tissue. Although the frequency characteristics of the bone conduction microphones of different manufacturers are inconsistent, the collected voice is generally severely attenuated from the part above 600Hz, and even completely lost. The absence of high frequency parts poses serious challenges for speech recognition systems. Secondly, when bone conduction voice is collected, friction between human skin and a bone conduction microphone, motion of a human body and the like cause certain self-noise to be included in the bone conduction voice, and therefore the identification difficulty of the bone conduction voice is further increased. Finally, bone conduction speech often loses the unvoiced sound, fricatives, etc. of the speech, and also degrades the performance of the speech recognition system.
Due to the above characteristics of bone conduction speech, speech recognition using bone conduction speech alone still faces numerous challenges. However, bone conduction speech and air conduction speech have some complementarity. Therefore, the patent utilizes air conduction voice and bone conduction voice at the same time, and carries out combined voice recognition through a deep learning model. Since there is no previously disclosed large-scale bone air-guide speech database available for deep-learning speech recognition, there has been no work to date on deep-learning based end-to-end bone air-guide speech recognition.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides an end-to-end bone air conduction voice joint identification method, which comprises the steps of firstly obtaining synchronous air conduction and bone conduction voice data to construct a data set and outputting the data set as a corresponding text; then, data enhancement and acoustic feature extraction are carried out on the air conduction voice signal and the bone conduction voice signal; then, constructing an end-to-end deep neural network model based on a former, wherein the model comprises three parts, namely two branch networks for processing air conduction and bone conduction voice and a fusion network based on a multi-mode Transducer; and then training the neural network, and finally obtaining a corresponding recognition result through the trained network. Compared with the traditional method for performing voice recognition by only utilizing the air conduction voice signal, the method for performing combined recognition can obviously reduce the error rate of voice recognition and improve the overall recognition performance of the system.
The technical scheme adopted by the invention for solving the technical problem comprises the following steps:
step 1: acquiring synchronized air conduction and bone conduction speech data (x)a,xb) Constructing a data set, wherein xaIs pure air conduction voice, xbOutputting the bone conduction voice which is synchronously recorded as a corresponding text y;
adding noise to the leading voice, i.e.
Figure BDA0003511685510000021
Wherein
Figure BDA0003511685510000022
For noisy air-conduction speech, naIs ambient noise; the final data set is
Figure BDA0003511685510000023
Further dividing the data set into a training set, a verification set and a test set;
step 2: data enhancement and feature extraction;
step 2-1: carrying out preliminary data enhancement on the change of the speech rate of the air conduction speech signal and the bone conduction speech signal;
step 2-2: respectively extracting acoustic features from the air conduction voice signal and the bone conduction voice signal with the changed speech speed;
step 2-3: performing data enhancement again on the acoustic characteristics extracted in the step 2-2 by using a SpecAugment method;
and step 3: constructing an end-to-end deep neural network model based on a former; the model consists of three parts, namely two branch networks for processing air conduction and bone conduction voice and a fusion network based on a multi-mode Transducer;
step 3-1: the two branched networks of air conduction and bone conduction voice are both a former network architecture and comprise a former encoder and a Truncated decoder;
the Conformer encoder is composed of a plurality of blocks, and each block comprises two FFN modules, a multi-head self-attention module and a convolution module; the Truncated decoder is composed of a plurality of blocks, and each block comprises a multi-head self-attention module, a multi-head self-attention module of a mask and an FFN module;
the acoustic features of the air conduction voice and the bone conduction voice enhanced in the step 2-3 are converted into air conduction voice feature vectors c through a former encoder and a Truncated decoder respectively in sequencelAnd bone conduction speech feature vector gl
Step 3-2: the input of the multi-mode Transducer fusion network is an air conduction voice feature vector c obtained by converting air conduction and bone conduction voice through a branch networklAnd bone conduction speech feature vector gl
First, for clLinear feature transformation is carried out to obtain key and value matrixes which are respectively expressed as K and V; for glPerforming linear characteristic transformation to obtain a query matrix, which is expressed as Q;
Figure BDA0003511685510000031
K=clWK,V=clWVwherein W isQ,WK,WVRespectively, learning linear transformation matrixes;
sending Q and K into Scaling Sparsemax module to obtain weighting weight [ z ] of air conduction and bone conduction characteristics respectivelya,zb]The specific calculation formula is as follows:
Figure BDA0003511685510000032
wherein SSP (.) is scaling Sparsemax operation; s is a scale factor, and the specific calculation formula is as follows: s ═ 1+ ReLU (Linear (| x |,2), where Linear represents the Linear transformation, | | x | | is the two-norm of the input vector, ReLU (.) is the activation function, l ∈ { a, b };
the characteristics after the fusion with V are as follows:
rl=(zlV)T+FFN(LayerNorm((zlV)T))
fused features rlThen the final probability p based on attention is obtained through an output layeratt(w), wherein w is a predicted text sequence, namely the output of the multi-modal driver fusion network;
and 4, step 4: training a neural network;
the network training is divided into two steps: using training set data and verification set data, firstly respectively training two branch networks of air conduction and bone conduction voice by adopting a CTC loss function, and then adding a multi-mode Transducer fusion network to train the whole network by adopting the CTC loss function;
and 5: testing the model;
and (4) sending the test set data into the trained network obtained in the step (4) to obtain a corresponding recognition result.
Preferably, the speech rate of the air conduction and bone conduction speech signals is changed to 0.9 times and 1.1 times of the original speech rate in step 2-1.
Preferably, the acoustic features extracted in the step 2-2 are 80-dimensional Mel-bank features.
Preferably, the former encoder consists of 12 blocks; the Truncated decoder consists of 6 blocks.
The invention has the following beneficial effects:
the invention can simultaneously utilize the air conduction voice and the bone conduction voice with noise to realize end-to-end joint voice recognition. Compared with the traditional method of performing voice recognition only by using the air conduction voice signal, the method of combined recognition can obviously reduce the error rate of voice recognition, especially under the condition of low signal-to-noise ratio. Compared with a mode of simply splicing the characteristics of the air conduction voice and the bone conduction voice, the multi-mode Transducer used in the invention can adaptively distribute the channel weight to perform fusion of two paths of signals according to the characteristics of the air conduction and the bone conduction, thereby improving the overall recognition performance of the system.
Drawings
FIG. 1 is a system block diagram of the method of the present invention.
Fig. 2 is a diagram of a multi-modal driver converged network architecture.
Detailed Description
The invention is further illustrated with reference to the following figures and examples.
The invention aims to provide an end-to-end multi-sensor voice joint recognition method based on deep learning, in particular to a bone-qi joint voice recognition method, which can directly take bone-qi conduction voice signals with synchronous time domains as the input of a system so as to directly output corresponding voice recognition results.
An end-to-end bone-air-guide voice joint identification method comprises the following steps:
step 1: acquiring synchronized air conduction and bone conduction speech data (x)a,xb) Constructing a data set, wherein xaFor pure air-conduction speech, x, recorded in anechoic laboratories or in quieter environmentsbOutputting the bone conduction voice which is synchronously recorded as a corresponding text y;
adding noise to the air conduction speech according to a certain signal-to-noise ratio, namely
Figure BDA0003511685510000041
Wherein
Figure BDA0003511685510000042
For noisy air-conduction speech, naIs ambient noise; the final data set is
Figure BDA0003511685510000043
Further dividing the data set into a training set, a verification set and a test set;
step 2: data enhancement and feature extraction;
step 2-1: carrying out preliminary data enhancement on the change of the speech rate of the air conduction speech signal and the bone conduction speech signal;
step 2-2: respectively extracting acoustic features from the air conduction voice signal and the bone conduction voice signal with the changed speech speed;
step 2-3: performing data enhancement again on the acoustic characteristics extracted in the step 2-2 by using a SpecAugment method;
and step 3: constructing an end-to-end deep neural network model based on a former; the model consists of three parts, namely two branch networks for processing air conduction and bone conduction voice and a fusion network based on a multi-mode Transducer;
step 3-1: the two branched networks of air conduction and bone conduction voice are both a former network architecture and comprise a former encoder and a Truncated decoder;
the former encoder is composed of a plurality of blocks, each block comprising two position-wise fed-forward (FFN) modules, a multi-headed self-attention (MHA) module, and a convolution module; the Truncated decoder is composed of a plurality of blocks, each block comprises a multi-headed self-attention module, a masked multi-headed self-attention module (masked MHA) and an FFN module;
the acoustic features of the air conduction voice and the bone conduction voice enhanced in the step 2-3 are converted into air conduction voice feature vectors c through a former encoder and a Truncated decoder respectively in sequencelAnd bone conduction speech feature vector gl
Step 3-2: the input of the multi-mode Transducer fusion network is an air conduction voice feature vector c obtained by converting air conduction and bone conduction voice through a branch networklAnd bone conduction speech feature vector gl
First, for clLinear feature transformation is carried out to obtain key and value matrixes which are respectively expressed as K and V; for glPerforming linear characteristic transformation to obtain a query matrix, which is expressed as Q;
Figure BDA0003511685510000051
K=clWK,V=clWVwherein W isQ,WK,WVRespectively, learning linear transformation matrixes;
sending Q and K into a Scaling Sparsemax module to obtain the weighted weight [ z ] of the air conduction characteristic and the bone conduction characteristic respectivelya,zb]The specific calculation formula is as follows:
Figure BDA0003511685510000052
wherein SSP (.) is scaling Sparsemax operation; s is a scale factor, and the specific calculation formula is as follows: s ═ 1+ ReLU (Linear (| x |,2), where Linear represents the Linear transformation, | | x | | is the two-norm of the input vector, ReLU (.) is the activation function, l ∈ { a, b };
the characteristics after the fusion with V are as follows:
rl=(zlV)T+FFN(LayerNorm((zlV)T))
fused features rlThen the final probability p based on attention is obtained through an output layeratt(w), wherein w is a predicted text sequence, namely the output of the multi-modal driver fusion network;
and 4, step 4: training a neural network;
the network training is divided into two steps: using training set data and verification set data, firstly respectively training two branch networks of air conduction and bone conduction voice by adopting a CTC loss function, and then adding a multi-mode Transducer fusion network to train the whole network by adopting the CTC loss function;
and 5: testing the model;
and (4) sending the test set data into the trained network obtained in the step (4) to obtain a corresponding recognition result.
The specific embodiment is as follows:
s1: obtaining synchronized bone conduction and air conduction speech data (x)a,xb) Constructing a data set, wherein xaFor pure air-conduction speech, x, recorded in anechoic laboratories or in quieter environmentsbFor synchronously recorded bone conduction speech. All speech is down-sampled to 16kHz with 16bit quantization. The input data of the model are noisy air conduction and bone conduction speech, and a text y corresponding to the speech is output. Because the bone conduction voice does not introduce environmental noise, the noise is added to the air conduction voice according to the signal-to-noise ratio in a certain range, namely the noise is added to the air conduction voice
Figure BDA0003511685510000061
Wherein
Figure BDA0003511685510000062
For noisy air conduction speech, naIs ambient noise. The final data set is
Figure BDA0003511685510000063
Then 84% of the data set is further set as training set, 8% as validation set, and the remaining 8% as test set.
S2: data enhancement and feature extraction
S21: the speech speed of the speech signal is changed to perform preliminary data enhancement, namely, the speech speed of the original speech is changed to 0.9 times and 1.1 times of the original speech.
And S22, extracting 80-dimensional Mel-bank characteristics from the air conduction voice and the bone conduction voice respectively.
S23 data enhancement was performed again on the Mel-bank characteristics using the SpecAugment method.
S3: and constructing an end-to-end deep neural network model based on the former. As shown in FIG. 1, the model consists of three modules, namely two branch networks that handle air-conduction and bone-conduction speech, respectively, and a multi-modal Transducer-based fusion network.
S31: similar branched networks for air-conduction and bone-conduction speech are both formed by a former network architecture, including a former encoder and a Truncated decoder. The encoder is composed of a plurality of blocks, each block comprising two position-wise feed-forward (FFN) modules, a multi-headed self-attention Module (MHA) and a convolution module. Specifically, the encoder is composed of 12 blocks, wherein the convolution kernel size of the convolution module is 15, the number of multi-head self-attention heads is 8, the number of middle-layer nodes of the FFN is 2048, and the output dimension D ismIs 256. The Truncated decoder is also composed of a plurality of blocks, each block containing a multi-headed self-attention module, a masked multi-headed self-attention module (masked MHA) and an FFN module. Specifically, the decoder is composed of 6 blocks, and other parameter configurations are consistent with the encoder. Through the two branch networks, the acoustic features of air conduction and bone conduction speech are converted into two feature vectors, i.e. c in fig. 1lAnd gl
S32 Structure of multimodal Transducer is shown in FIG. 2, its main structure is similar to Transducer, and the input is feature vector c of air conduction and bone conduction speech converted by branch networklAnd gl. First, for clAnd glLinear feature transformation is performed to obtain query, key and value matrices, which respectively correspond to QKV in fig. 2, specifically,
Figure BDA0003511685510000064
K=clWK,V=clWVwherein W isQ,WK,WVIs a learnable linear transformation matrix. Sending Q and K into Scaling Sparsemax module to obtain weighting weight [ z ] of air conduction and bone conduction characteristics respectivelya,zb]The specific calculation formula is as follows:
Figure BDA0003511685510000071
where SSP (x, s) is scaling Sparsemax operation, s is a scale factor, and its specific calculation formula is s ═ 1+ ReLU (Linear (| x |,2), where Linear is a Linear transformation, | | x | | | is a two-norm of an input vector, and ReLU is an activation function.
rl=(zlV)T+FFN(LayerNorm((zlV)T))。
The fused features pass through an output layer, and the final probability p based on attention can be obtainedatt(w), wherein w is the predicted text sequence, i.e. the output of the entire multimodal driver.
And S4, optimizing the neural network. The training of the whole network is divided into two steps, wherein the respective branch networks of the air conduction and the bone conduction voice are optimized respectively, and then the whole network is optimized together with the multi-mode Transducer. The loss function for both the branch network and the overall network optimization is the CTC loss. The network was optimized with an Adam optimizer and the training times were set to 50 epochs.
S5: and (5) testing the model. And (5) sending the test data into the trained network obtained in S4 to obtain a corresponding recognition result.

Claims (4)

1. An end-to-end bone-air-guide voice joint identification method is characterized by comprising the following steps:
step 1: acquiring synchronized air conduction and bone conduction speech data (x)a,xb) Constructing a data set, wherein xaIs pure air conduction voice, xbOutputting the bone conduction voice which is synchronously recorded as a corresponding text y;
adding noise to the leading voice, i.e.
Figure FDA0003511685500000011
Wherein
Figure FDA0003511685500000012
For noisy air-conduction speech, naIs ambient noise; the final data set is
Figure FDA0003511685500000013
Further dividing the data set into a training set, a verification set and a test set;
step 2: data enhancement and feature extraction;
step 2-1: carrying out preliminary data enhancement on the change of the speech rate of the air conduction speech signal and the bone conduction speech signal;
step 2-2: respectively extracting acoustic features from the air conduction voice signal and the bone conduction voice signal with the changed speech speed;
step 2-3: performing data enhancement again on the acoustic characteristics extracted in the step 2-2 by using a SpecAugment method;
and step 3: constructing an end-to-end deep neural network model based on a former; the model consists of three parts, namely two branch networks for processing air conduction and bone conduction voice and a fusion network based on a multi-mode Transducer;
step 3-1: the two branched networks of air conduction and bone conduction voice are both a former network architecture and comprise a former encoder and a Truncated decoder;
the Conformer encoder is composed of a plurality of blocks, and each block comprises two FFN modules, a multi-head self-attention module and a convolution module; the Truncated decoder is composed of a plurality of blocks, and each block comprises a multi-head self-attention module, a multi-head self-attention module of a mask and an FFN module;
the acoustic features of the air conduction voice and the bone conduction voice enhanced in the step 2-3 are converted into air conduction voice feature vectors c through a former encoder and a Truncated decoder respectively in sequencelAnd bone conduction speech feature vector gl
Step 3-2: the input of the multi-mode Transducer fusion network is an air conduction voice feature vector c obtained by converting air conduction and bone conduction voice through a branch networklAnd bone conduction speech feature vector gl
First, for clLinear feature transformation is carried out to obtain key and value matrixes which are respectively expressed as K and V; for glPerforming linear characteristic transformation to obtain a query matrix, which is expressed as Q;
Figure FDA0003511685500000014
K=clWK,V=clWVwherein W isQ,WK,WVRespectively, learning linear transformation matrixes;
sending Q and K into Scaling Sparsemax module to obtain weighting weight [ z ] of air conduction and bone conduction characteristics respectivelya,zb]The specific calculation formula is as follows:
Figure FDA0003511685500000015
wherein SSP (.) is scaling Sparsemax operation; s is a scale factor, and the specific calculation formula is as follows: s ═ 1+ ReLU (Linear (| x |,2), where Linear represents the Linear transformation, | | x | | is the two-norm of the input vector, ReLU (.) is the activation function, l ∈ { a, b };
the characteristics after the fusion with V are as follows:
rl=(zlV)T+FFN(LayerNorm((zlV)T))
fused features rlThen the final probability p based on attention is obtained through an output layeratt(w), wherein w is a predicted text sequence, namely the output of the multi-modal driver fusion network;
and 4, step 4: training a neural network;
the network training is divided into two steps: using training set data and verification set data, firstly respectively training two branch networks of air conduction and bone conduction voice by adopting a CTC loss function, and then adding a multi-mode Transducer fusion network to train the whole network by adopting the CTC loss function;
and 5: testing the model;
and (4) sending the test set data into the trained network obtained in the step (4) to obtain a corresponding recognition result.
2. The end-to-end bone-air conduction speech joint recognition method of claim 1, wherein the speech speed of the air conduction and bone conduction speech signals is changed to 0.9 times and 1.1 times of the original speech speed in step 2-1.
3. The end-to-end bone-air-guide speech joint identification method according to claim 1, wherein the acoustic features extracted in step 2-2 are 80-dimensional Mel-bank features.
4. The end-to-end bone-air-guide speech joint identification method of claim 1, wherein the Conformer encoder is composed of 12 blocks; the Truncated decoder consists of 6 blocks.
CN202210153909.5A 2022-02-20 End-to-end bone-qi guiding voice joint recognition method Active CN114495909B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210153909.5A CN114495909B (en) 2022-02-20 End-to-end bone-qi guiding voice joint recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210153909.5A CN114495909B (en) 2022-02-20 End-to-end bone-qi guiding voice joint recognition method

Publications (2)

Publication Number Publication Date
CN114495909A true CN114495909A (en) 2022-05-13
CN114495909B CN114495909B (en) 2024-04-30

Family

ID=

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116030823A (en) * 2023-03-30 2023-04-28 北京探境科技有限公司 Voice signal processing method and device, computer equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007251354A (en) * 2006-03-14 2007-09-27 Saitama Univ Microphone and sound generation method
CN108986834A (en) * 2018-08-22 2018-12-11 中国人民解放军陆军工程大学 The blind Enhancement Method of bone conduction voice based on codec framework and recurrent neural network
CN112786064A (en) * 2020-12-30 2021-05-11 西北工业大学 End-to-end bone-qi-conduction speech joint enhancement method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007251354A (en) * 2006-03-14 2007-09-27 Saitama Univ Microphone and sound generation method
CN108986834A (en) * 2018-08-22 2018-12-11 中国人民解放军陆军工程大学 The blind Enhancement Method of bone conduction voice based on codec framework and recurrent neural network
CN112786064A (en) * 2020-12-30 2021-05-11 西北工业大学 End-to-end bone-qi-conduction speech joint enhancement method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张雄伟;郑昌艳;曹铁勇;杨吉斌;邢益搏;: "骨导麦克风语音盲增强技术研究现状及展望", 数据采集与处理, no. 05, 15 September 2018 (2018-09-15) *
杨德举;马良荔;谭琳珊;裴晶晶;: "基于门控卷积网络与CTC的端到端语音识别", 计算机工程与设计, no. 09, 16 September 2020 (2020-09-16) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116030823A (en) * 2023-03-30 2023-04-28 北京探境科技有限公司 Voice signal processing method and device, computer equipment and storage medium
CN116030823B (en) * 2023-03-30 2023-06-16 北京探境科技有限公司 Voice signal processing method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109949821B (en) Method for removing reverberation of far-field voice by using U-NET structure of CNN
CN109887489B (en) Speech dereverberation method based on depth features for generating countermeasure network
CN112151059A (en) Microphone array-oriented channel attention weighted speech enhancement method
CN103258533B (en) Novel model domain compensation method in remote voice recognition
CN109427328B (en) Multichannel voice recognition method based on filter network acoustic model
WO2022012206A1 (en) Audio signal processing method, device, equipment, and storage medium
CN107993670A (en) Microphone array voice enhancement method based on statistical model
CN110047478B (en) Multi-channel speech recognition acoustic modeling method and device based on spatial feature compensation
CN105448302A (en) Environment adaptive type voice reverberation elimination method and system
CN111142066A (en) Direction-of-arrival estimation method, server, and computer-readable storage medium
WO2020170907A1 (en) Signal processing device, learning device, signal processing method, learning method, and program
CN110867178B (en) Multi-channel far-field speech recognition method
Huang et al. A wearable bone-conducted speech enhancement system for strong background noises
Qi et al. Exploring deep hybrid tensor-to-vector network architectures for regression based speech enhancement
CN114360571A (en) Reference-based speech enhancement method
CN112185405B (en) Bone conduction voice enhancement method based on differential operation and combined dictionary learning
CN116030823B (en) Voice signal processing method and device, computer equipment and storage medium
CN113823273A (en) Audio signal processing method, audio signal processing device, electronic equipment and storage medium
CN111681649B (en) Speech recognition method, interaction system and achievement management system comprising system
CN113409804A (en) Multichannel frequency domain speech enhancement algorithm based on variable-span generalized subspace
CN114495909B (en) End-to-end bone-qi guiding voice joint recognition method
CN107592600B (en) Pickup screening method and pickup device based on distributed microphones
CN114495909A (en) End-to-end bone-qi-guide voice joint identification method
CN110544485A (en) method for performing far-field speech dereverberation by using SE-ED network of CNN
Ye et al. Efficient gated convolutional recurrent neural networks for real-time speech enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant