WO2023044962A1 - 一种基于语音信号时域和频域的特征提取方法、装置、回声消除方法及装置 - Google Patents

一种基于语音信号时域和频域的特征提取方法、装置、回声消除方法及装置 Download PDF

Info

Publication number
WO2023044962A1
WO2023044962A1 PCT/CN2021/122350 CN2021122350W WO2023044962A1 WO 2023044962 A1 WO2023044962 A1 WO 2023044962A1 CN 2021122350 W CN2021122350 W CN 2021122350W WO 2023044962 A1 WO2023044962 A1 WO 2023044962A1
Authority
WO
WIPO (PCT)
Prior art keywords
time
domain
frequency
weight vector
feature
Prior art date
Application number
PCT/CN2021/122350
Other languages
English (en)
French (fr)
Inventor
涂卫平
韩畅
刘雅洁
肖立
杨玉红
刘陈建树
Original Assignee
武汉大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 武汉大学 filed Critical 武汉大学
Publication of WO2023044962A1 publication Critical patent/WO2023044962A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Definitions

  • the invention relates to the field of audio signal processing, in particular to a feature extraction method and device based on time domain and frequency domain of speech signals, and an echo cancellation method and device.
  • acoustic echo occurs when the far-end signal played by the near-end speaker is picked up by the near-end microphone and sent back to the far-end.
  • Acoustic echo greatly affects the customer's call experience and voice follow-up processing such as speech recognition, so how to eliminate acoustic echo as much as possible without introducing near-end voice distortion has become a research hotspot in the field of voice front-end processing at home and abroad.
  • deep learning methods have achieved great success in the field of echo cancellation beyond traditional adaptive filtering methods.
  • the present invention proposes a feature extraction method, device, and echo cancellation method and device based on the time domain and frequency domain of speech signals, which are used to solve or at least partially solve the problem that the feature information extracted in the existing method is not comprehensive enough, and the final echo cancellation effect is not good. technical problems.
  • the feature extraction device based on the time domain and frequency domain of the speech signal that is, the attention module based on the time domain and frequency domain of the speech signal
  • the echo cancellation device that is, the acoustic echo cancellation model based on the convolutional neural network
  • the first aspect of the present invention provides a feature extraction method based on speech signal time domain and frequency domain, including:
  • S1 Calculate the time weight vector according to the intermediate mapping feature, and expand the time weight vector to the dimension equal to the intermediate mapping feature, wherein the intermediate mapping feature is the time-frequency feature of the speech signal through a multi-layer convolutional neural network After transformation, the time weight vector contains important time frame information in speech features;
  • S2 Perform a Hadamard product of the intermediate mapping feature and the time weight vector to obtain a time-domain weighted mapping feature
  • S3 Calculate a frequency weight vector according to the time-domain weighted mapping feature, and expand the frequency weight vector to a dimension equal to the time-domain weighted mapping feature, wherein the frequency weight vector contains speech Important frequency information in features;
  • S4 Perform a Hadamard product of the frequency weight vector and the time-domain weighted mapping feature to obtain a time-domain and frequency-domain weighted mapping feature.
  • step S1 includes:
  • S1.1 Perform global maximum pooling and average pooling on the intermediate mapping features based on channel and frequency dimensions to obtain the first weight vector after maximum pooling and the second weight vector after average pooling, two The weight vectors are equal in size, retaining the important channel and frequency dimension information in each time frame of the speech feature,
  • S1.2 Send the maximum-pooled first weight vector and the average-pooled second weight vector to the first long-short-term memory network to learn temporal features while maintaining the causal dependence of time series The weight of , get two updated weight vectors;
  • step S3 includes:
  • the second aspect of the present invention provides a feature extraction device based on the time domain and frequency domain of speech signals, the device is an attention module, including:
  • a time-domain attention module is used to calculate a time weight vector according to the intermediate mapping feature, and expand the time weight vector to a dimension equal to the intermediate mapping feature, wherein the intermediate mapping feature is passed through by the time-frequency feature of the speech signal After multi-layer convolutional neural network transformation, the time weight vector contains important time frame information in speech features;
  • a time-domain weighting module configured to perform a Hadamard product on the intermediate mapping feature and the time weight vector to obtain a time-domain weighted mapping feature
  • a frequency-domain attention module configured to calculate a frequency weight vector according to the time-domain weighted mapping feature, and expand the frequency weight vector to a dimension equal to the time-domain weighted mapping feature, wherein frequency
  • the weight vector contains important frequency information in speech features
  • a frequency-domain weighting module configured to perform a Hadamard product on the frequency weight vector and the time-domain weighted mapping feature to obtain the final time-domain and frequency-domain weighted mapping feature.
  • the third aspect of the present invention provides an echo cancellation method, including:
  • the real and imaginary parts of the far-end reference signal and the near-end microphone signal are calculated by short-time Fourier transform, and the real and imaginary parts of the far-end reference signal and the near-end microphone signal are stacked in the channel dimension to form a four-dimensional input channel initial acoustic characteristics;
  • Two-dimensional convolution based on the complex number field is used for the initial acoustic features to obtain intermediate mapping features
  • Time-series feature learning is performed on the intermediate mapping features to obtain features modeled over time;
  • the real part and imaginary part of the near-end microphone signal are masked by using the complex number domain ratio mask, and the masked real part and imaginary part are subjected to an inverse short-time Fourier transform to obtain an echo-canceled signal.
  • the fourth aspect of the present invention provides an echo cancellation device, the device is an acoustic echo cancellation model based on a convolutional neural network, and the model includes:
  • the preprocessing module is used to calculate the real part and imaginary part of the far-end reference signal and the near-end microphone signal by short-time Fourier transform, and stack the real part and imaginary part of the far-end reference signal and the near-end microphone signal in the channel dimension together form the initial acoustic features of the 4D input channel;
  • An encoder based on two-dimensional convolution in the complex field is used to obtain intermediate mapping features based on two-dimensional convolution in the complex field for the initial acoustic features;
  • the attention module is used to perform feature extraction on the intermediate mapping features, and obtain the mapping features weighted in the time domain and frequency domain;
  • the second long-short-term memory network is used to perform time-series feature learning on intermediate mapping features to obtain time-modeled features
  • a decoder based on a two-dimensional transposed convolution in the complex domain for obtaining a complex domain ratio mask based on features modeled over time and mapped features weighted in the time and frequency domains;
  • a transform module configured to use the complex domain ratio mask to mask the real part and the imaginary part of the near-end microphone signal, and subject the masked real part and imaginary part to an inverse short-time Fourier transform to obtain an echo-cancelled signal .
  • the encoder based on two-dimensional convolution in the complex field includes six layers of two-dimensional convolution modules in the complex field, wherein each two-dimensional convolution block in the complex field includes a complex convolution layer, a complex batch regression Normalization layers and activation functions.
  • the decoder based on two-dimensional transposed convolution in the complex field includes six two-dimensional transposed convolution blocks in the complex field, and each two-dimensional transposed convolution block in the complex field includes a complex transposed convolution Product layers, complex batch normalization layers, and activation functions.
  • a feature extraction method based on the time domain and frequency domain of the speech signal provided by the present invention can adaptively weight the speech features in the time domain and frequency domain, and can fully retain the feature information in the time domain and frequency domain, so that the extraction The feature information is more comprehensive.
  • the echo cancellation method and device provided by the present invention can conveniently embed the attention module into the acoustic echo cancellation task based on the convolutional neural network, and adaptively weight the speech features in the time domain and frequency domain, thereby improving The effect of acoustic echo cancellation.
  • Fig. 1 is the frame diagram of the acoustic echo cancellation model based on the convolutional neural network in the implementation of the present invention
  • Fig. 2 is the flow chart of the encoder based on the two-dimensional convolution module in the complex field in the implementation of the present invention
  • Fig. 3 is a flow chart of a two-dimensional convolution block in the complex field in the implementation of the present invention.
  • Fig. 4 is the flow chart of the attention module based on time domain and frequency domain weighting in the implementation of the present invention
  • Fig. 5 is the flow chart based on temporal domain attention module in the implementation of the present invention.
  • Fig. 6 is the flowchart based on frequency domain attention module in the implementation of the present invention.
  • Fig. 7 is a flow chart of the decoder of the two-dimensional transposed convolution module in the complex field in the implementation of the present invention.
  • the present invention uses the attention module to adaptively weight the speech features in the time domain and frequency domain to improve the performance of the convolutional neural network-based acoustic echo cancellation model.
  • the short-time Fourier transform is used to calculate the real and imaginary parts of the far-end reference signal and the near-end microphone signal, and then the intermediate mapping features are calculated based on the complex domain two-dimensional convolutional encoder module, and then the intermediate mapping is modeled based on the long-short-term memory network. Mapping temporal dependencies of features.
  • the encoder and decoder are connected through an attention module weighted based on the time domain and frequency domain of the speech signal, so as to realize the adaptive weighting of features in two dimensions of time and frequency.
  • the decoder module based on the two-dimensional transposed convolution in the complex domain outputs the complex domain ratio mask, and then masks the real and imaginary parts of the near-end microphone signal, and passes the masked real and imaginary parts through inverse short-time Fourier The Lie transform yields the estimated near-end clean speech.
  • the attention module based on the time domain and frequency domain weighting of the speech signal can be easily embedded in the acoustic echo cancellation task based on the convolutional neural network, and the speech feature is adaptively analyzed in the time domain. and frequency domain weighting to improve the effect of acoustic echo cancellation.
  • the embodiment of the present invention provides a feature extraction method based on the speech signal time domain and frequency domain, including:
  • S1 Calculate the time weight vector according to the intermediate mapping feature, and expand the time weight vector to the dimension equal to the intermediate mapping feature, wherein the intermediate mapping feature is the time-frequency feature of the speech signal through a multi-layer convolutional neural network After transformation, the time weight vector contains important time frame information in speech features;
  • S2 Perform a Hadamard product of the intermediate mapping feature and the time weight vector to obtain a time-domain weighted mapping feature
  • S3 Calculate a frequency weight vector according to the time-domain weighted mapping feature, and expand the frequency weight vector to a dimension equal to the time-domain weighted mapping feature, wherein the frequency weight vector contains speech Important frequency information in features;
  • S4 Perform a Hadamard product of the frequency weight vector and the time-domain weighted mapping feature to obtain a time-domain and frequency-domain weighted mapping feature.
  • the time-frequency characteristics of the speech signal can be calculated by short-time Fourier transform, and then the intermediate mapping feature is obtained through the transformation of the multi-layer convolutional neural network.
  • the organization mode of the intermediate mapping feature is (batch size, time dimension, channel dimension, frequency dimension).
  • step S1 includes:
  • S1.1 Perform global maximum pooling and average pooling on the intermediate mapping features based on channel and frequency dimensions to obtain the first weight vector after maximum pooling and the second weight vector after average pooling, two The weight vectors are equal in size, retaining the important channel and frequency dimension information in each time frame of the speech feature,
  • S1.2 Send the maximum-pooled first weight vector and the average-pooled second weight vector to the first long-short-term memory network to learn temporal features while maintaining the causal dependence of time series The weight of , get two updated weight vectors;
  • the first weight vector and the second weight vector retain the important channel and frequency dimension information in each time frame of the speech feature, and then the time frame of the feature can be weighted according to these information to highlight the important time frame.
  • the intermediate mapping features are max-pooled to retain the most significant information of the channel and frequency, and then weight the time axis according to the retained information, so that the weight of the time point with rich channel and frequency dimension information is also larger.
  • the maximum pooling is used, all the secondary important information of the channel dimension and the frequency dimension will be lost, so that the information loss is too much, so the information retained by the average pooling is used as a supplement.
  • step S3 includes:
  • this embodiment provides a feature extraction device based on the time domain and frequency domain of speech signals, the device is an attention module, including:
  • a time-domain attention module is used to calculate a time weight vector according to the intermediate mapping feature, and expand the time weight vector to a dimension equal to the intermediate mapping feature, wherein the intermediate mapping feature is passed through by the time-frequency feature of the speech signal After multi-layer convolutional neural network transformation, the time weight vector contains important time frame information in speech features;
  • a time-domain weighting module configured to perform a Hadamard product on the intermediate mapping feature and the time weight vector to obtain a time-domain weighted mapping feature
  • a frequency-domain attention module configured to calculate a frequency weight vector according to the time-domain weighted mapping feature, and expand the frequency weight vector to a dimension equal to the time-domain weighted mapping feature, wherein frequency
  • the weight vector contains important frequency information in speech features
  • a frequency-domain weighting module configured to perform a Hadamard product on the frequency weight vector and the time-domain weighted mapping feature to obtain the final time-domain and frequency-domain weighted mapping feature.
  • the device introduced in the second embodiment of the present invention is the device used to implement the feature extraction method based on the time domain and frequency domain of the speech signal in the first embodiment of the present invention, so based on the method introduced in the first embodiment of the present invention, the technical field Those skilled in the art can understand the specific structure and deformation of the device, so details will not be repeated here. All devices used in the method of Embodiment 1 of the present invention belong to the intended protection scope of the present invention.
  • the present invention can be easily embedded in the acoustic echo cancellation task based on convolutional neural network, and adaptively weight the speech features in the time domain and frequency domain, so as to improve the acoustic echo Eliminate the effect.
  • this embodiment provides an echo cancellation method, including:
  • S101 Use the short-time Fourier transform to calculate the real and imaginary parts of the far-end reference signal and the near-end microphone signal, and stack the real and imaginary parts of the far-end reference signal and the near-end microphone signal in the channel dimension to form a four-dimensional input initial acoustic characteristics of the channel;
  • S103 Perform feature extraction on the intermediate mapping features to obtain mapping features weighted in the time domain and frequency domain;
  • S104 Perform temporal feature learning on the intermediate mapping features to obtain time-modeled features
  • S105 Obtain a complex number domain ratio mask according to the time-modeled features and the time-domain and frequency-domain weighted mapping features;
  • S106 Use the complex domain ratio mask to mask the real part and the imaginary part of the near-end microphone signal, and perform an inverse short-time Fourier transform on the masked real part and imaginary part to obtain an echo-canceled signal.
  • the organizational mode of the initial acoustic features in step S101 is (batch size, 4, frequency dimension, time dimension);
  • the frame length, frame shift, and short-time Fourier transform length can be adjusted as required.
  • the far-end reference signal and the near-end microphone signal can be divided into multiple time intervals according to 25 milliseconds per frame. Frames, and every two adjacent time frames have an overlap of 15 milliseconds, and then apply a 512-point short-time Fourier transform to the far-end microphone signal and the near-end echo signal, which results in 257 frequency bins.
  • Step S102 Pass the initial acoustic features of step S101 through an encoder composed of two-dimensional convolution modules in the complex field, wherein the dimensions of the intermediate mapping features output by the two-dimensional convolution modules in the complex field of each layer are different.
  • Step S103 Send the features output in step S102 to six attention modules based on time domain and frequency domain weighting.
  • Step S1031 Input the intermediate mapping feature of step S102 into the temporal attention module as shown in Figure 5 to obtain a time weight vector, and expand it to a dimension equal to the intermediate mapping feature of step S101; wherein the temporal attention module Specifically, global maximum pooling and average pooling are performed on the intermediate mapping features in step S102 based on the channel and frequency dimensions to obtain two equal-sized weight vectors, one obtained through maximum pooling and the other through average After pooling, the two weight vectors are sent to the long-short-term memory network to update the weight vectors, and finally the two updated weight vectors are added pointwise and passed through the sigmoid activation function to obtain the time weight vector;
  • Step S1032 Perform the Hadamard product of the intermediate mapping feature in step S102 and the time weight vector in step S1031 to obtain the time-domain weighted mapping feature;
  • Step S1033 Input the time-weighted mapping feature of step S1032 into the frequency domain attention module as shown in Figure 6 to obtain a frequency weight vector, and expand it to be equal to the time-weighted mapping feature of step S1032 dimension.
  • the frequency-domain attention module performs maximum pooling and average pooling on the time-domain weighted mapping features of step S1032 based on the channel dimension to obtain two weight vectors of equal size, one is obtained through maximum pooling One is obtained by average pooling, and then the two weight vectors are stacked according to the channel dimension, and the fusion weight vector is obtained by using the one-dimensional convolutional network and the batch normalization layer, and finally the fusion weight vector is passed through
  • the sigmoid activation function gets the frequency weight vector;
  • Step S1034 Perform Hadamard product of the frequency weight vector in step S1033 and the time-domain weighted mapping feature in step S1032 to obtain the final time-domain and frequency-domain weighted mapping feature.
  • Step S104 input the output features of the encoder in step S102 into the second long-short-term memory network, and output the features modeled over time;
  • the parameters of the second long-short-term memory network can be adjusted as required.
  • the present invention sets up two-layer long-short-term memory network, each layer has 800 hidden units, and the output layer is a fully connected network composed of 257 neurons.
  • Step S105 Send the output of step S104 to the decoder based on two-dimensional transposed convolution in the complex domain, and at the same time send the outputs of the six time-domain and frequency-domain weighted attention modules in step S103 to the decoder's six Layer complex transpose convolution module to obtain complex field ratio mask;
  • Step S106 Use the complex domain ratio mask in step S105 to mask the real part and imaginary part of the near-end microphone signal, and the masked real part and imaginary part undergo inverse short-time Fourier transform to obtain the signal after echo cancellation.
  • the method introduced in the third embodiment of the present invention is based on the method implemented based on the feature extraction method based on the time domain and frequency domain of the speech signal in the first embodiment of the present invention, it is based on the method introduced in the first embodiment of the present invention. Those skilled in the art can understand the specific implementation steps of the method, so details are not repeated here.
  • this embodiment provides an echo cancellation device, the device is an acoustic echo cancellation model based on a convolutional neural network, and the model includes:
  • the preprocessing module is used to calculate the real part and imaginary part of the far-end reference signal and the near-end microphone signal by short-time Fourier transform, and stack the real part and imaginary part of the far-end reference signal and the near-end microphone signal in the channel dimension together form the initial acoustic features of the 4D input channel;
  • An encoder based on two-dimensional convolution in the complex field is used to obtain intermediate mapping features based on two-dimensional convolution in the complex field for the initial acoustic features;
  • the attention module is used to perform feature extraction on the intermediate mapping features, and obtain the mapping features weighted in the time domain and frequency domain;
  • the second long-short-term memory network is used to perform time-series feature learning on intermediate mapping features to obtain time-modeled features
  • a decoder based on a two-dimensional transposed convolution in the complex domain for obtaining a complex domain ratio mask based on features modeled over time and mapped features weighted in the time and frequency domains;
  • a transform module configured to use the complex domain ratio mask to mask the real part and the imaginary part of the near-end microphone signal, and subject the masked real part and imaginary part to an inverse short-time Fourier transform to obtain an echo-cancelled signal .
  • FIG. 1 is a frame diagram of an acoustic echo cancellation model based on a convolutional neural network in the implementation of the present invention.
  • the encoder based on two-dimensional convolution in the complex field includes six layers of two-dimensional convolution modules in the complex field, wherein each two-dimensional convolution block in the complex field includes a complex convolution layer, a complex batch regression Normalization layers and activation functions.
  • a convolutional neural network-based encoder is used, in which the network parameters such as the number of layers of the convolutional neural network, the number of channels of each input and output, the size of the convolution kernel, and the step size can be adjusted as needed.
  • the encoder consists of six layers of complex two-dimensional convolutional blocks.
  • Each complex two-dimensional convolutional block includes a complex convolutional layer, a complex batch normalization layer, and an activation function as shown in Figure 3.
  • Each The number of input channels of the two-dimensional convolutional block of the layer is ⁇ 4, 32, 64, 128, 256, 256 ⁇
  • the step size of each convolutional neural network is (3, 2) in time and frequency dimensions, and the step size is (2, 1).
  • Complex batch normalization can be viewed as a problem of whitening two-dimensional vectors.
  • the activation function is PReLU, and its formula is:
  • x represents the input variable of the activation function
  • a represents the slope parameter
  • the decoder based on two-dimensional transposed convolution in the complex field includes six two-dimensional transposed convolution blocks in the complex field, and each two-dimensional transposed convolution block in the complex field includes a complex transposed convolution Product layers, complex batch normalization layers, and activation functions.
  • the output of the second LSTM network is fed into a decoder based on two-dimensional transposed convolution in the complex domain, and the outputs of six weighted attention modules based on time domain and frequency domain are respectively fed into the decoder
  • the six-layer complex transpose convolution module to obtain the complex domain ratio mask
  • the decoder and encoder have a symmetrical structure.
  • the decoder based on the two-dimensional transposed convolution in the complex field is composed of six two-dimensional transposed convolution blocks in the complex field as shown in Figure 7, and each two-dimensional transposed convolution in the complex field
  • the convolution block includes a complex transposed convolution layer, a complex batch normalization layer, and an activation function.
  • the complex transposed convolution layer is similar to the complex convolution layer, except that the convolution operation is changed to a transposed convolution operation.
  • the number of input channels of the dimensional transposed convolutional block is ⁇ 512, 512, 256, 128, 64, 4 ⁇ .
  • the input of the six-layer complex domain two-dimensional transposed convolution block is formed by stacking the output of the previous layer network and the corresponding time-frequency weighted intermediate mapping features along the channel dimension.
  • the final output of the decoder is the complex domain ratio mask code.
  • the complex domain ratio mask (CRM) is defined as follows:
  • Y r and Yi represent the real part and imaginary part of the near-end microphone signal, respectively
  • S r and S i represent the real part and imaginary part of the near-end clean speech, respectively.
  • the real and imaginary parts of the near-end microphone signal are masked by using the complex number domain ratio mask obtained by the decoder based on the two-dimensional transposed convolution in the complex number domain, and the masked real and imaginary parts are subjected to inverse short-time Fourier transform Obtain the signal after echo cancellation;
  • the complex domain ratio mask estimated by the decoder can be the complex representation of the near-end clean speech calculated by the following formula:
  • Embodiment 4 of the present invention is the device used to implement the echo cancellation method in Embodiment 3 of the present invention, based on the method described in Embodiment 1 of the present invention, those skilled in the art can understand the specifics of the device structure and deformation, so no more details here. All devices used in the method of the third embodiment of the present invention belong to the scope of protection of the present invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

一种基于语音信号时域和频域的特征提取方法、装置、回声消除方法及装置,基于语音信号时域和频域的特征提取方法包括:首先将语音信号经过短时傅里叶变换得到时频域特征,然后利用多层卷积神经网络得到中间映射特征,之后基于时域注意力模块得到时间权值向量,并将其扩展到与中间映射特征相同的维度后做哈达玛乘积,得到经过时域加权的映射特征,然后利用频域注意力模块得到频率权值向量,并将其扩展到与经过时间加权的映射特征相同的维度后做哈达玛乘积,得到最终的经过时域和频域加权的映射特征。时域和频域注意力模块可以很容易地嵌入到基于卷积神经网络的声学回声消除模型中,使模型自适应学习时频域特征的权重,以此提升模型性能的效果。

Description

一种基于语音信号时域和频域的特征提取方法、装置、回声消除方法及装置 技术领域
本发明涉及音频信号处理领域,尤其涉及一种基于语音信号时域和频域的特征提取方法、装置、回声消除方法及装置。
背景技术
在双向语音通信中,近端扬声器播放的远端信号被近端麦克风采集并重新送回远端时就产生了声学回声。声学回声极大的影响了客户的通话体验和语音后续处理比如语音识别的效果,所以如何尽量消除声学回声并且不引入近端语音的失真成为国内外语音前端处理领域的研究热点。近年来,深度学习方法在回声消除领域超越传统的自适应滤波方法取得了极大的成功。
本申请发明人在实施本发明的过程中,发现现有技术中存在如下技术问题:
目前在时频域的基于卷积神经网络的声学回声消除模型中,最常见的方法之一是卷积循环网络,其缺点是这种模型主要考虑的是对特征沿时间轴建模长时依赖关系,而没有考虑到频率分布对模型的影响,因而导致提取的特征信息不够全面,最终回声消除效果不佳。
发明内容
本发明提出一种基于语音信号时域和频域的特征提取方法、装置、回声消除方法及装置,用于解决或者至少部分解决现有方法中提取的特征信息不够全面,最终回声消除效果不佳的技术问题。其中,基于语音信号时域和频域的特征提取装置(即基于语音信号时域和频域的注意力模块)可以方便地嵌入回声消除装置(即基于卷积神经网络的声学回声消除模型)中,使模型自适应学习时频域特征的权重,以此提升模型性能的效果。
为了解决上述技术问题,本发明第一方面提供了一种基于语音信号时域和频域的特征提取方法,包括:
S1:根据中间映射特征计算得到时间权值向量,并将时间权值向量扩展到与 所述中间映射特征相等的维度,其中,中间映射特征由语音信号的时频特征经过多层卷积神经网络变换后得到,时间权值向量中包含语音特征中重要的时间帧信息;
S2:将所述中间映射特征与所述时间权值向量做哈达玛乘积,得到经过时域加权的映射特征;
S3:根据所述经过时域加权的映射特征计算得到频率权值向量,并将频率权值向量扩展到与所述经过时域加权的映射特征相等的维度,其中,频率权值向量中包含语音特征中重要的频率信息;
S4:将所述频率权值向量与所述经过时域加权的映射特征做哈达玛乘积,得到经过时域和频域加权的映射特征。
在一种实施方式中,步骤S1包括:
S1.1:对所述中间映射特征基于通道和频率维进行全局最大池化和平均池化,得到经过最大池化的第一权值向量和经过平均池化的第二权值向量,两个权值向量大小相等,保留有语音特征每个时间帧中重要的通道和频率维的信息,
S1.2:将经过最大池化的第一权值向量和经过平均池化的第二权值向量分别送入第一长短时记忆网络,以在保持时间序列因果依赖的情况下,学习时序特征的权重,得到两个更新后的权值向量;
S1.3:将所述两个更新后的权值向量按点相加并经过sigmoid激活函数得到时间权值向量。
在一种实施方式中,步骤S3包括:
S3.1:对所述经过时域加权的映射特征基于通道维进行全局最大池化和平均池化,得到经过最大池化的第三权值向量和经过平均池化的第四权值向量,两个权值向量大小相等,保留有经过时域加权的映射特征的重要的通道维的信息,
S3.2:将第三权值向量和第四权值向量按照通道维进行堆叠,再利用一维卷积神经网络和批归一化层得到融合权值向量,以学习特征各频率的重要程度;
S3.3:将所述融合权值向量经过sigmoid激活函数得到频率权值向量。
基于同样的发明构思,本发明第二方面提供了一种基于语音信号时域和频域的特征提取装置,所述装置为注意力模块,包括:
时域注意力模块,用于根据中间映射特征计算得到时间权值向量,并将时间 权值向量扩展到与所述中间映射特征相等的维度,其中,中间映射特征由语音信号的时频特征经过多层卷积神经网络变换后得到,时间权值向量中包含语音特征中重要的时间帧信息;
时域加权模块,用于将所述中间映射特征与所述时间权值向量做哈达玛乘积,得到经过时域加权的映射特征;
频域注意力模块,用于根据所述经过时域加权的映射特征计算得到频率权值向量,并将频率权值向量扩展到与所述经过时域加权的映射特征相等的维度,其中,频率权值向量中包含语音特征中重要的频率信息;
频域加权模块,用于将所述频率权值向量与所述经过时域加权的映射特征做哈达玛乘积,得到最终经过时域和频域加权的映射特征。
基于同样的发明构思,本发明第三方面提供了一种回声消除方法,包括:
采用短时傅里叶变换计算远端参考信号和近端麦克风信号的实部和虚部,将远端参考信号和近端麦克风信号的实部和虚部以通道维堆叠起来形成四维输入通道的初始声学特征;
对初始声学特征采用基于复数域二维卷积,得到中间映射特征;
对中间映射特征采用权利要求1所述的特征提取方法进行特征提取,得到经过时域和频域加权的映射特征;
对中间映射特征进行时序特征学习,得到经过时间建模的特征;
根据经过时间建模的特征和经过时域和频域加权的映射特征,得到复数域比值掩码;
利用所述复数域比值掩码对近端麦克风信号的实部和虚部进行掩蔽,将掩蔽后的实部和虚部经过逆短时傅里叶变换得到回声消除后的信号。
基于同样的发明构思,本发明第四方面提供了一种回声消除装置,所述装置为基于卷积神经网络的声学回声消除模型,所述模型包括:
预处理模块,用于采用短时傅里叶变换计算远端参考信号和近端麦克风信号的实部和虚部,将远端参考信号和近端麦克风信号的实部和虚部以通道维堆叠起来形成四维输入通道的初始声学特征;
基于复数域二维卷积的编码器,用于对初始声学特征采用基于复数域二维卷积,得到中间映射特征;
注意力模块,用于对中间映射特征进行特征提取,得到经过时域和频域加权的映射特征;
第二长短时记忆网络,用于对中间映射特征进行时序特征学习,得到经过时间建模的特征;
基于复数域二维转置卷积的解码器,用于根据经过时间建模的特征和经过时域和频域加权的映射特征,得到复数域比值掩码;
变换模块,用于利用所述复数域比值掩码对近端麦克风信号的实部和虚部进行掩蔽,将掩蔽后的实部和虚部经过逆短时傅里叶变换得到回声消除后的信号。
在一种实施方式中,所述基于复数域二维卷积的编码器包括六层复数域二维卷积模块,其中,每个复数域二维卷积块包括复数卷积层、复数批归一化层和激活函数。
在一种实施方式中,所述基于复数域二维转置卷积的解码器包括六个复数域二维转置卷积块,每个复数域二维转置卷积块包括复数转置卷积层、复数批归一化层和激活函数。
本申请实施例中的上述一个或多个技术方案,至少具有如下一种或多种技术效果:
本发明提供的一种基于语音信号时域和频域的特征提取方法,可以自适应地对语音特征进行时域和频域的加权,可以充分保留时间域和频率域的特征信息,从而使得提取的特征信息更为全面。
本发明提供的回声消除方法及装置,可以方便地将注意力模块地嵌入到基于卷积神经网络的声学回声消除任务中,并自适应地对语音特征进行时域和频域加权,以此提升声学回声消除的效果。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明实施中基于卷积神经网络的声学回声消除模型的框架图;
图2为本发明实施中基于复数域二维卷积模块的编码器流程图;
图3为本发明实施中复数域二维卷积块的流程图;
图4为本发明实施中基于时域和频域加权的注意力模块的流程图;
图5为本发明实施中基于时域注意力模块的流程图;
图6为本发明实施中基于频域注意力模块的流程图;
图7为本发明实施中复数域二维转置卷积模块的解码器流程图。
具体实施方式
本申请发明人通过大量的研究与实践发现:
根据听觉动态注意力理论,人类总是倾向于用动态神经元回路自适应地调整注意力以感知复杂的环境,比如语音通话过程中如果声学回声占主导,用户就需要集中更多的注意力克服回声的干扰并理解对方通话内容的含义。此外,语音信号的频谱包含了丰富的频率成分,共振峰在低频区域中分布集中,而在高频区域分布稀疏,因此需要用不同的权重区分不同的频谱区域。受此启发,本发明利用注意力模块自适应地对语音特征进行时域和频域的加权以此提高基于卷积神经网络的声学回声消除模型的性能。
本发明的主要构思如下:
首先利用短时傅里叶变换计算远端参考信号和近端麦克风信号的实部和虚部,之后基于复数域二维卷积编码器模块计算中间映射特征,接着基于长短时记忆网络建模中间映射特征的时间依赖关系。此外,将编码器和解码器通过基于语音信号时域和频域加权的注意力模块相连,以此实现对特征在时间和频率两个维度自适应地加权。最后基于复数域二维转置卷积的解码器模块输出复数域比值掩码,进而对近端麦克风信号的实部和虚部进行掩蔽,将掩蔽后的实部和虚部经过逆短时傅里叶变换得到估计出的近端干净语音。
通过本发明提供的上述方法可知,基于语音信号时域和频域加权的注意力模块可以很容易地嵌入到基于卷积神经网络的声学回声消除任务中,并自适应地对语音特征进行时域和频域加权,以此提升声学回声消除的效果。
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实 施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
实施例一
本发明实施例提供了一种基于语音信号时域和频域的特征提取方法,包括:
S1:根据中间映射特征计算得到时间权值向量,并将时间权值向量扩展到与所述中间映射特征相等的维度,其中,中间映射特征由语音信号的时频特征经过多层卷积神经网络变换后得到,时间权值向量中包含语音特征中重要的时间帧信息;
S2:将所述中间映射特征与所述时间权值向量做哈达玛乘积,得到经过时域加权的映射特征;
S3:根据所述经过时域加权的映射特征计算得到频率权值向量,并将频率权值向量扩展到与所述经过时域加权的映射特征相等的维度,其中,频率权值向量中包含语音特征中重要的频率信息;
S4:将所述频率权值向量与所述经过时域加权的映射特征做哈达玛乘积,得到经过时域和频域加权的映射特征。
具体实施过程中,可以采用短时傅里叶变换计算语音信号的时频特征,然后经过多层卷积神经网络的变换得到中间映射特征,中间映射特征的组织方式是(批大小,时间维,通道维,频率维)。
在一种实施方式中,步骤S1包括:
S1.1:对所述中间映射特征基于通道和频率维进行全局最大池化和平均池化,得到经过最大池化的第一权值向量和经过平均池化的第二权值向量,两个权值向量大小相等,保留有语音特征每个时间帧中重要的通道和频率维的信息,
S1.2:将经过最大池化的第一权值向量和经过平均池化的第二权值向量分别送入第一长短时记忆网络,以在保持时间序列因果依赖的情况下,学习时序特征的权重,得到两个更新后的权值向量;
S1.3:将所述两个更新后的权值向量按点相加并经过sigmoid激活函数得到时间权值向量。
具体来说,第一权值向量和第二权值向量中保留了语音特征每个时间帧中重要的通道和频率维的信息,然后可以根据这些信息对特征的时间帧加权,以突出 其中重要的时间帧。
其中,中间映射特征经过最大池化,保留下通道和频率最显著的信息,然后根据保留下的信息对时间轴加权,使得通道和频率维信息丰富的时间点权值也大一些。但是如果只采用最大池化又会将通道维和频率维的次重要信息全部丢掉,这样信息损失过多,因此用平均池化保留的信息作为补充。
在一种实施方式中,步骤S3包括:
S3.1:对所述经过时域加权的映射特征基于通道维进行全局最大池化和平均池化,得到经过最大池化的第三权值向量和经过平均池化的第四权值向量,两个权值向量大小相等,保留有经过时域加权的映射特征的重要的通道维的信息,
S3.2:将第三权值向量和第四权值向量按照通道维进行堆叠,再利用一维卷积神经网络和批归一化层得到融合权值向量,以学习特征各频率的重要程度;
S3.3:将所述融合权值向量经过sigmoid激活函数得到频率权值向量。
实施例二
基于同样的发明构思,本实施例提供了一种基于语音信号时域和频域的特征提取装置,所述装置为注意力模块,包括:
时域注意力模块,用于根据中间映射特征计算得到时间权值向量,并将时间权值向量扩展到与所述中间映射特征相等的维度,其中,中间映射特征由语音信号的时频特征经过多层卷积神经网络变换后得到,时间权值向量中包含语音特征中重要的时间帧信息;
时域加权模块,用于将所述中间映射特征与所述时间权值向量做哈达玛乘积,得到经过时域加权的映射特征;
频域注意力模块,用于根据所述经过时域加权的映射特征计算得到频率权值向量,并将频率权值向量扩展到与所述经过时域加权的映射特征相等的维度,其中,频率权值向量中包含语音特征中重要的频率信息;
频域加权模块,用于将所述频率权值向量与所述经过时域加权的映射特征做哈达玛乘积,得到最终经过时域和频域加权的映射特征。
其中,基于时域和频域加权的注意力模块如图4所示。
由于本发明实施例二所介绍的装置,为实施本发明实施例一中基于语音信号 时域和频域的特征提取方法所采用的装置,故而基于本发明实施例一所介绍的方法,本领域所属技术人员能够了解该装置的具体结构及变形,故而在此不再赘述。凡是本发明实施例一的方法所采用的装置都属于本发明所欲保护的范围。
通过本发明提供的注意力模块可知,本发明可以很容易地嵌入到基于卷积神经网络的声学回声消除任务中,并自适应地对语音特征进行时域和频域加权,以此提升声学回声消除的效果。
实施例三
基于同样的发明构思,本实施例提供了一种回声消除方法,包括:
S101:采用短时傅里叶变换计算远端参考信号和近端麦克风信号的实部和虚部,将远端参考信号和近端麦克风信号的实部和虚部以通道维堆叠起来形成四维输入通道的初始声学特征;
S102:对初始声学特征采用基于复数域二维卷积,得到中间映射特征;
S103:对中间映射特征进行特征提取,得到经过时域和频域加权的映射特征;
S104:对中间映射特征进行时序特征学习,得到经过时间建模的特征;
S105:根据经过时间建模的特征和经过时域和频域加权的映射特征,得到复数域比值掩码;
S106:利用所述复数域比值掩码对近端麦克风信号的实部和虚部进行掩蔽,将掩蔽后的实部和虚部经过逆短时傅里叶变换得到回声消除后的信号。
具体实施过程中,步骤S101中初始声学特征的组织方式是(批大小,4,频率维,时间维);
具体来说,帧长、帧移和短时傅里叶变换的长度可以根据需要调整,作为一种实施方式,可以将远端参考信号和近端麦克风信号按照每帧25毫秒分为多个时间帧,且每两个相邻的时间帧之间具有15毫秒的重叠,然后对远端麦克风信号和近端回声信号应用512点的短时傅里叶变换,这会产生257个频率区间。
步骤S102:将步骤S101的初始声学特征通过由复数域二维卷积模块组成的编码器,其中每层复数域二维卷积模块输出的中间映射特征维度各不相同。
步骤S103:将步骤S102输出的特征分别送入六个基于时域和频域加权的注意力模块中。
步骤S1031:将步骤S102中间映射特征输入到如图5所示的时域注意力模 块得到时间权值向量,并将其扩展到与步骤S101的中间映射特征相等的维度;其中时域注意力模块具体来说,是对步骤S102的中间映射特征基于通道和频率维进行全局最大池化和平均池化,得到两个大小相等的权值向量,一个是经过最大池化得到的,一个是经过平均池化得到的,然后将两个权值向量分别送入长短时记忆网络更新权值向量,最后将两个更新后的权值向量按点相加并经过sigmoid激活函数得到时间权值向量;
步骤S1032:将步骤S102的中间映射特征与步骤S1031的时间权值向量做哈达玛乘积,得到经过时域加权的映射特征;
步骤S1033:将步骤S1032的经过时域加权的映射特征输入到如图6所示的频域注意力模块得到频率权值向量,并将其扩展到与步骤S1032的经过时间加权的映射特征相等的维度。频域注意力模块具体来说,是将步骤S1032的经过时域加权的映射特征基于通道维进行最大池化和平均池化,得到两个大小相等的权值向量,一个是经过最大池化得到的,一个是经过平均池化得到的,然后分别将两个权值向量按照通道维堆叠起来,利用一维卷积网络和批归一化层得到融合权值向量,最后将融合权值向量经过sigmoid激活函数得到频率权值向量;
步骤S1034:将步骤S1033的频率权值向量与步骤S1032的经过时域加权的映射特征做哈达玛乘积,得到最终的经过时域和频域加权的映射特征。
步骤S104:将步骤S102编码器的输出特征输入第二长短时记忆网络中,输出经过时间建模的特征;
第二长短时记忆网络的参数可以根据需要调整,作为实施方式本发明设置了两层长短时记忆网络,每层有800个隐藏单元,输出层是由257个神经元组成的全连接网络。
步骤S105:将步骤S104的输出送入基于复数域二维转置卷积的解码器,同时将步骤S103的六个基于时域和频域加权的注意力模块的输出分别送入解码器的六层复数转置卷积模块,以此得到复数域比值掩码;
步骤S106:利用步骤S105的复数域比值掩码对近端麦克风信号的实部和虚部进行掩蔽,将掩蔽后的实部和虚部经过逆短时傅里叶变换得到回声消除后的信号.
由于本发明实施例三所介绍的方法,为基于本发明实施例一中基于语音信号 时域和频域的特征提取方法所实现的方法,故而基于本发明实施例一所介绍的方法,本领域所属技术人员能够了解该方法的具体实施步骤,故而在此不再赘述。
实施例四
基于同样的发明构思,本实施例提供了一种回声消除装置,所述装置为基于卷积神经网络的声学回声消除模型,所述模型包括:
预处理模块,用于采用短时傅里叶变换计算远端参考信号和近端麦克风信号的实部和虚部,将远端参考信号和近端麦克风信号的实部和虚部以通道维堆叠起来形成四维输入通道的初始声学特征;
基于复数域二维卷积的编码器,用于对初始声学特征采用基于复数域二维卷积,得到中间映射特征;
注意力模块,用于对中间映射特征进行特征提取,得到经过时域和频域加权的映射特征;
第二长短时记忆网络,用于对中间映射特征进行时序特征学习,得到经过时间建模的特征;
基于复数域二维转置卷积的解码器,用于根据经过时间建模的特征和经过时域和频域加权的映射特征,得到复数域比值掩码;
变换模块,用于利用所述复数域比值掩码对近端麦克风信号的实部和虚部进行掩蔽,将掩蔽后的实部和虚部经过逆短时傅里叶变换得到回声消除后的信号。
请参见图1,为本发明实施中基于卷积神经网络的声学回声消除模型的框架图。
在一种实施方式中,所述基于复数域二维卷积的编码器包括六层复数域二维卷积模块,其中,每个复数域二维卷积块包括复数卷积层、复数批归一化层和激活函数。
具体来说,由六层复数域二维卷积模块组成的编码器如图2所示。
采用基于卷积神经网络的编码器,其中卷积神经网络的层数、各输入输出的通道数、卷积核大小和步长等网络参数可根据需要调整。作为一种实施方式编码器由六层复数域二维卷积块组成,每个复数域二维卷积块如图3所示包含复数卷积层、复数批归一化层和激活函数,各层二维卷积块的输入通道数为 {4,32,64,128,256,256},各卷积神经网络的步长在时间和频率维度的大小为(3,2),步长为(2,1)。复数卷积层的卷积核W可以表示为W=W r+jW i,其中W r和W i分别模拟实部的卷积核和模拟虚部的卷积核,r表示复数的实部,i表示复数的虚部,j表示虚数单位。语音的中间特征定义为X=X r+jX i其中X r和X i分别表示特征的实数部分和特征的虚数部分,这样每层复数卷积层的输出Y可以表示为Y=(X r*W r-X i*W i)+j(X r*W i+X i*W r),其中*表示传统的二维卷积操作,由此可以看出复数卷积层包含四个传统的二维卷积操作。复数批归一化可以看作白化二维矢量的问题。激活函数是PReLU,其公式是:
Figure PCTCN2021122350-appb-000001
其中,x表示激活函数的输入变量,a表示斜率参数。
在一种实施方式中,所述基于复数域二维转置卷积的解码器包括六个复数域二维转置卷积块,每个复数域二维转置卷积块包括复数转置卷积层、复数批归一化层和激活函数。
具体来说,将第二长短时记忆网络的输出送入基于复数域二维转置卷积的解码器,同时将六个基于时域和频域加权的注意力模块的输出分别送入解码器的六层复数转置卷积模块,以此得到复数域比值掩码;
具体来说,解码器和编码器是对称结构,基于复数域二维转置卷积的解码器如图7由六个复数域二维转置卷积块组成,每个复数域二维转置卷积块中包含复数转置卷积层、复数批归一化层和激活函数,复数转置卷积层与复数卷积层类似,只是将卷积操作改为转置卷积操作,各二维转置卷积块的输入通道数为{512,512,256,128,64,4}。六层复数域二维转置卷积块的输入均为上一层网络的输出和相应的经过时频加权的中间映射特征沿通道维堆叠起来形成的,解码器最后的输出是复数域比值掩码。复数域比值掩码(CRM)定义具体如下:
Figure PCTCN2021122350-appb-000002
其中,Y r和Y i分别表示近端麦克风信号的实部和虚部,S r和S i分别表示近端干净语音的实部和虚部。
利用基于复数域二维转置卷积的解码器得到复数域比值掩码对近端麦克风信号的实部和虚部进行掩蔽,将掩蔽后的实部和虚部经过逆短时傅里叶变换得到 回声消除后的信号;
具体来说,经过解码器估计出来的复数域比值掩码可以通过如下公式计算的近端干净语音的复数表示:
Figure PCTCN2021122350-appb-000003
其中
Figure PCTCN2021122350-appb-000004
Figure PCTCN2021122350-appb-000005
分别表示编码器输出掩码的实部和虚部,然后将估计出的近端语音
Figure PCTCN2021122350-appb-000006
的实部和虚部利用逆离散傅里叶变换得到近端干净的时域表示。
由于本发明实施例四所介绍的装置,为实施本发明实施例三中回声消除方法所采用的装置,故而基于本发明实施例一所介绍的方法,本领域所属技术人员能够了解该装置的具体结构及变形,故而在此不再赘述。凡是本发明实施例三的方法所采用的装置都属于本发明所欲保护的范围。
以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。

Claims (8)

  1. 一种基于语音信号时域和频域的特征提取方法,其特征在于,包括:
    S1:根据中间映射特征计算得到时间权值向量,并将时间权值向量扩展到与所述中间映射特征相等的维度,其中,中间映射特征由语音信号的时频特征经过多层卷积神经网络变换后得到,时间权值向量中包含语音特征中重要的时间帧信息;
    S2:将所述中间映射特征与所述时间权值向量做哈达玛乘积,得到经过时域加权的映射特征;
    S3:根据所述经过时域加权的映射特征计算得到频率权值向量,并将频率权值向量扩展到与所述经过时域加权的映射特征相等的维度,其中,频率权值向量中包含语音特征中重要的频率信息;
    S4:将所述频率权值向量与所述经过时域加权的映射特征做哈达玛乘积,得到经过时域和频域加权的映射特征。
  2. 如权利要求1所述的特征提取方法,其特征在于,步骤S1包括:
    S1.1:对所述中间映射特征基于通道和频率维进行全局最大池化和平均池化,得到经过最大池化的第一权值向量和经过平均池化的第二权值向量,两个权值向量大小相等,保留有语音特征每个时间帧中重要的通道和频率维的信息,
    S1.2:将经过最大池化的第一权值向量和经过平均池化的第二权值向量分别送入第一长短时记忆网络,以在保持时间序列因果依赖的情况下,学习时序特征的权重,得到两个更新后的权值向量;
    S1.3:将所述两个更新后的权值向量按点相加并经过sigmoid激活函数得到时间权值向量。
  3. 如权利要求1所述的特征提取方法,其特征在于,步骤S3包括:
    S3.1:对所述经过时域加权的映射特征基于通道维进行全局最大池化和平均池化,得到经过最大池化的第三权值向量和经过平均池化的第四权值向量,两个权值向量大小相等,保留有经过时域加权的映射特征的重要的通道维的信息,
    S3.2:将第三权值向量和第四权值向量按照通道维进行堆叠,再利用一维卷积神经网络和批归一化层得到融合权值向量,以学习特征各频率的重要程度;
    S3.3:将所述融合权值向量经过sigmoid激活函数得到频率权值向量。
  4. 一种基于语音信号时域和频域的特征提取装置,其特征在于,所述装置为注意力模块,包括:
    时域注意力模块,用于根据中间映射特征计算得到时间权值向量,并将时间权值向量扩展到与所述中间映射特征相等的维度,其中,中间映射特征由语音信号的时频特征经过多层卷积神经网络变换后得到,时间权值向量中包含语音特征中重要的时间帧信息;
    时域加权模块,用于将所述中间映射特征与所述时间权值向量做哈达玛乘积,得到经过时域加权的映射特征;
    频域注意力模块,用于根据所述经过时域加权的映射特征计算得到频率权值向量,并将频率权值向量扩展到与所述经过时域加权的映射特征相等的维度,其中,频率权值向量中包含语音特征中重要的频率信息;
    频域加权模块,用于将所述频率权值向量与所述经过时域加权的映射特征做哈达玛乘积,得到最终经过时域和频域加权的映射特征。
  5. 一种回声消除方法,其特征在于,包括:
    采用短时傅里叶变换计算远端参考信号和近端麦克风信号的实部和虚部,将远端参考信号和近端麦克风信号的实部和虚部以通道维堆叠起来形成四维输入通道的初始声学特征;
    对初始声学特征采用基于复数域二维卷积,得到中间映射特征;
    对中间映射特征采用权利要求1所述的特征提取方法进行特征提取,得到经过时域和频域加权的映射特征;
    对中间映射特征进行时序特征学习,得到经过时间建模的特征;
    根据经过时间建模的特征和经过时域和频域加权的映射特征,得到复数域比值掩码;
    利用所述复数域比值掩码对近端麦克风信号的实部和虚部进行掩蔽,将掩蔽后的实部和虚部经过逆短时傅里叶变换得到回声消除后的信号。
  6. 一种回声消除装置,其特征在于,所述装置为基于卷积神经网络的声学回声消除模型,所述模型包括:
    预处理模块,用于采用短时傅里叶变换计算远端参考信号和近端麦克风信号的实部和虚部,将远端参考信号和近端麦克风信号的实部和虚部以通道维堆叠起 来形成四维输入通道的初始声学特征;
    基于复数域二维卷积的编码器,用于对初始声学特征采用基于复数域二维卷积,得到中间映射特征;
    注意力模块,用于对中间映射特征进行特征提取,得到经过时域和频域加权的映射特征;
    第二长短时记忆网络,用于对中间映射特征进行时序特征学习,得到经过时间建模的特征;
    基于复数域二维转置卷积的解码器,用于根据经过时间建模的特征和经过时域和频域加权的映射特征,得到复数域比值掩码;
    变换模块,用于利用所述复数域比值掩码对近端麦克风信号的实部和虚部进行掩蔽,将掩蔽后的实部和虚部经过逆短时傅里叶变换得到回声消除后的信号。
  7. 如权利要求6所述的回声消除装置,其特征在于,所述基于复数域二维卷积的编码器包括六层复数域二维卷积模块,其中,每个复数域二维卷积块包括复数卷积层、复数批归一化层和激活函数。
  8. 如权利要6所述的回声消除装置,其特征在于,所述基于复数域二维转置卷积的解码器包括六个复数域二维转置卷积块,每个复数域二维转置卷积块包括复数转置卷积层、复数批归一化层和激活函数。
PCT/CN2021/122350 2021-09-24 2021-09-30 一种基于语音信号时域和频域的特征提取方法、装置、回声消除方法及装置 WO2023044962A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111119961.0 2021-09-24
CN202111119961.0A CN113870888A (zh) 2021-09-24 2021-09-24 一种基于语音信号时域和频域的特征提取方法、装置、回声消除方法及装置

Publications (1)

Publication Number Publication Date
WO2023044962A1 true WO2023044962A1 (zh) 2023-03-30

Family

ID=78993692

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/122350 WO2023044962A1 (zh) 2021-09-24 2021-09-30 一种基于语音信号时域和频域的特征提取方法、装置、回声消除方法及装置

Country Status (2)

Country Link
CN (1) CN113870888A (zh)
WO (1) WO2023044962A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230245673A1 (en) * 2022-02-03 2023-08-03 GM Global Technology Operations LLC System and method for processing an audio input signal
CN116580428A (zh) * 2023-07-11 2023-08-11 中国民用航空总局第二研究所 一种基于多尺度通道注意力机制的行人重识别方法
CN116994587A (zh) * 2023-09-26 2023-11-03 成都航空职业技术学院 一种培训监管系统

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114067826B (zh) * 2022-01-18 2022-06-07 深圳市友杰智新科技有限公司 语音降噪方法、装置、设备及存储介质
CN114722334B (zh) * 2022-04-11 2022-12-27 哈尔滨工程大学 一种基于stft的高压天然气缸内直喷发动机燃气喷射时间特征在线识别方法
CN114495958B (zh) * 2022-04-14 2022-07-05 齐鲁工业大学 一种基于时间建模生成对抗网络的语音增强系统
CN115116471B (zh) * 2022-04-28 2024-02-13 腾讯科技(深圳)有限公司 音频信号处理方法和装置、训练方法、设备及介质
CN114974292A (zh) * 2022-05-23 2022-08-30 维沃移动通信有限公司 音频增强方法、装置、电子设备及可读存储介质
CN115359771B (zh) * 2022-07-22 2023-07-07 中国人民解放军国防科技大学 一种水声信号降噪方法、系统、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB201602382D0 (en) * 2016-02-10 2016-03-23 Cedar Audio Ltd Acoustic source seperation systems
CN109063820A (zh) * 2018-06-07 2018-12-21 中国科学技术大学 利用时频联合长时循环神经网络的数据处理方法
CN111081268A (zh) * 2019-12-18 2020-04-28 浙江大学 一种相位相关的共享深度卷积神经网络语音增强方法
CN111261146A (zh) * 2020-01-16 2020-06-09 腾讯科技(深圳)有限公司 语音识别及模型训练方法、装置和计算机可读存储介质
CN112750465A (zh) * 2020-12-29 2021-05-04 昆山杜克大学 一种云端语言能力评测系统及可穿戴录音终端

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB201602382D0 (en) * 2016-02-10 2016-03-23 Cedar Audio Ltd Acoustic source seperation systems
CN109063820A (zh) * 2018-06-07 2018-12-21 中国科学技术大学 利用时频联合长时循环神经网络的数据处理方法
CN111081268A (zh) * 2019-12-18 2020-04-28 浙江大学 一种相位相关的共享深度卷积神经网络语音增强方法
CN111261146A (zh) * 2020-01-16 2020-06-09 腾讯科技(深圳)有限公司 语音识别及模型训练方法、装置和计算机可读存储介质
CN112750465A (zh) * 2020-12-29 2021-05-04 昆山杜克大学 一种云端语言能力评测系统及可穿戴录音终端

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230245673A1 (en) * 2022-02-03 2023-08-03 GM Global Technology Operations LLC System and method for processing an audio input signal
US11823703B2 (en) * 2022-02-03 2023-11-21 GM Global Technology Operations LLC System and method for processing an audio input signal
CN116580428A (zh) * 2023-07-11 2023-08-11 中国民用航空总局第二研究所 一种基于多尺度通道注意力机制的行人重识别方法
CN116994587A (zh) * 2023-09-26 2023-11-03 成都航空职业技术学院 一种培训监管系统
CN116994587B (zh) * 2023-09-26 2023-12-08 成都航空职业技术学院 一种培训监管系统

Also Published As

Publication number Publication date
CN113870888A (zh) 2021-12-31

Similar Documents

Publication Publication Date Title
WO2023044962A1 (zh) 一种基于语音信号时域和频域的特征提取方法、装置、回声消除方法及装置
CN111292759A (zh) 一种基于神经网络的立体声回声消除方法及系统
CN112863535B (zh) 一种残余回声及噪声消除方法及装置
CN111768796A (zh) 一种声学回波消除与去混响方法及装置
CN111755020B (zh) 一种立体声回声消除方法
TWI559297B (zh) 回音消除方法及其系統
CN115132215A (zh) 一种单通道语音增强方法
CN106161820B (zh) 一种用于立体声声学回声抵消的通道间去相关方法
Zhang et al. A complex spectral mapping with inplace convolution recurrent neural networks for acoustic echo cancellation
CN114242043A (zh) 语音处理方法、设备、存储介质及程序产品
CN111370016B (zh) 一种回声消除方法及电子设备
Ma et al. Multi-scale attention neural network for acoustic echo cancellation
US11984110B2 (en) Heterogeneous computing for hybrid acoustic echo cancellation
CN113411456B (zh) 一种基于语音识别的话音质量评估方法及装置
Togami et al. Acoustic echo suppressor with multichannel semi-blind non-negative matrix factorization
CN113763978B (zh) 语音信号处理方法、装置、电子设备以及存储介质
Silva-Rodríguez et al. Acoustic echo cancellation using residual U-Nets
CN114373473A (zh) 通过低延迟深度学习实现同时降噪和去混响
Zhang et al. Neural Multi-Channel and Multi-Microphone Acoustic Echo Cancellation
Seidel et al. Efficient Deep Acoustic Echo Suppression with Condition-Aware Training
Kaur et al. Performance and convergence analysis of LMS algorithm
Bekrani et al. Neural network based adaptive echo cancellation for stereophonic teleconferencing application
Pathrose et al. MASTER: Microphone Array Source Time Difference Eco canceller via Reconstructed Spiking Neural Network
Yen et al. Artificial Neural Network Algorithm for Acoustic Echo Cancellation Applications
Yoshioka et al. Speech dereverberation and denoising based on time varying speech model and autoregressive reverberation model

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21958077

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE