WO2021012403A1 - Dual sensor speech enhancement method and implementation device - Google Patents

Dual sensor speech enhancement method and implementation device Download PDF

Info

Publication number
WO2021012403A1
WO2021012403A1 PCT/CN2019/110290 CN2019110290W WO2021012403A1 WO 2021012403 A1 WO2021012403 A1 WO 2021012403A1 CN 2019110290 W CN2019110290 W CN 2019110290W WO 2021012403 A1 WO2021012403 A1 WO 2021012403A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
air conduction
air
dual
channel
Prior art date
Application number
PCT/CN2019/110290
Other languages
French (fr)
Chinese (zh)
Inventor
张军
李�学
宁更新
冯义志
余华
季飞
Original Assignee
华南理工大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华南理工大学 filed Critical 华南理工大学
Publication of WO2021012403A1 publication Critical patent/WO2021012403A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02165Two microphones, one receiving mainly the noise signal and the other one mainly the speech signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/06Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being correlation coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • the invention relates to the technical field of speech signal processing, in particular to a dual-sensor speech enhancement method and implementation device based on dual-channel Wiener filtering.
  • Speech enhancement technology is an important branch of speech signal processing. The purpose is to extract as much pure original speech as possible from noisy speech, and it is widely used in speech communication, speech compression coding and speech recognition in noisy environments.
  • non-air conduction air conduction for short
  • speech that is, the use of air conduction sensors (such as microphones) to collect speech
  • the enhancement effect is affected.
  • the influence of various acoustic noises in the environment is great, and the performance is usually poor in a noisy environment.
  • non-air conduction abbreviated as non-air conduction
  • sensors such as throat microphones and bone conduction microphones are often used for voice collection in noisy environments.
  • non-air conduction voice sensors use the vibration of the speaker’s vocal cords, jawbone and other parts to drive the reed or carbon film in the sensor to change, change its resistance value, and change the voltage at both ends. In this way, the vibration signal is converted into an electrical signal, that is, a voice signal. Since the sound waves conducted in the air cannot deform the reed or carbon film of the non-air-conducting sensor, the non-air-conducting sensor is not affected by the air-conducting sound and has a strong ability to resist acoustic noise.
  • the non-air conduction sensor collects the voice transmitted through the vibration of the jawbone, muscle, skin and other parts, the high frequency part is seriously lost, which is manifested as the sound is muffled and ambiguous, and the voice intelligibility is poor.
  • Chinese invention patent 201610025390.7 discloses a dual-sensor speech enhancement method and device based on a statistical model.
  • the invention first combines non-air-conducted speech and air-conducted speech to construct a joint statistical model for classification and endpoint detection. Calculate the current best air conduction speech filter, filter and enhance the air conduction speech, and then use the non-air conduction speech to air conduction speech mapping model to convert the non-air conduction speech into the air conduction speech, and combine it with the filtered and enhanced speech
  • the weighted fusion of air conduction speech partially solves the problem of the lack of full use of the correlation and prior knowledge of the air conduction speech recovered by the non-air conduction sensor and the air conduction speech fusion, but it is still used in the second step of fusion
  • the air-conducted speech recovered to the non-air-conducted speech so there are also deficiencies such as high-frequency and silent noise, and the use of non-air-conducted speech to recover the air-conducted speech.
  • the purpose of the present invention is to solve the above-mentioned defects in the prior art and provide a dual-sensor speech enhancement method and implementation device based on dual-channel Wiener filtering.
  • the method first utilizes the complementarity between air-conducted speech and non-air-conducted speech It establishes a dual-channel voice joint classification model for frame classification of dual-channel input signals of air conduction sensors and non-air conduction sensors, and uses this model to classify the voice frames collected by dual channels, and finally constructs a dual-channel dimension based on the classification results Nano filter to filter and enhance the voice signal collected by dual channels.
  • the present invention more fully integrates the information contained in air-guided speech and non-air-guided speech, and introduces prior knowledge of the speech signal through a statistical model, which can effectively improve the performance of the speech enhancement system in a noisy environment. Enhancement.
  • the invention can be widely used in various occasions such as video calls, car phones, multimedia classrooms, and military communications.
  • a dual-sensor voice enhancement method based on dual-channel Wiener filtering includes the following steps:
  • step S3 Use the statistical model of air conduction noise and the dual-channel speech joint classification model in step S1 to classify the synchronized input air conduction test speech frames and non-air conduction test speech frames;
  • step S4 Construct a two-channel Wiener filter according to the classification result of step S3 and the power spectrum mean value ⁇ vv ( ⁇ ), and filter the air conduction test speech frame and the non-air conduction test speech frame to obtain an enhanced air conduction speech.
  • step S1 is as follows:
  • step S1.2 Use the air-guided speech and non-air-guided speech features obtained in step S1.1 to train a dual-channel speech joint classification model;
  • the dual-channel voice joint classification model adopts a multi-data stream Gaussian Mixture Model (GMM), namely
  • GMM Gaussian Mixture Model
  • N (o, ⁇ , ⁇ ) is a Gaussian function
  • o x (k) and o b (k) for the k-th frame and the non-voice test air conduction air conduction test speech feature vectors extracted, with Is the mean value of the l Gaussian component of the air-conducted voice data stream and the non-air-conducted voice data stream in the multi-data stream GMM, with Is the variance of the l-th Gaussian component of the air-conducted voice data stream and the non-air-conducted voice data stream in the multi-stream GMM, c l is the weight of the l-th Gaussian component in the multi-stream GMM, w x and w b are respectively The weight of the air-conducted voice data stream and the non-air-conducted voice data stream in the data stream GMM, L is the number of Gaussian components.
  • each Gaussian component in the dual-channel speech joint classification model represents a category, and for each pair of synchronized air conduction training speech frames and non-air conduction speech frames, the following formula is used to calculate Its score for each category
  • the current air conduction training speech frames and non-air conduction speech frames belong to the category with the highest score; calculate the classification to which all air conduction training speech frames and non-air conduction speech frames belong, and then calculate the air conduction training speech frames and sums contained in the same category
  • the statistical model of air conduction noise is the mean value of the power spectrum of air conduction noise ⁇ vv ( ⁇ ), which is calculated by the following method:
  • step S2.3 Use the time corresponding to the endpoint of the non-air conduction test voice signal detected in step S2.2 as the endpoint of the air conduction test voice, and extract the pure noise segment in the air conduction test voice;
  • the vector Taylor series model (Vector Taylor Seties, VTS) compensation technology is first used, and the air conduction noise statistical model is used to correct the parameters of the air conduction speech data stream in the dual-channel speech joint classification model. , And then classify the input air conduction test speech frames and non-air conduction test speech frames.
  • the following formula is used to modify the average value of each Gaussian component of the air conduction speech data stream in the dual-channel speech joint classification model:
  • step S4 for the air conduction test speech and non-air conduction test speech synchronously collected in the kth frame, the following formula is used to calculate the enhanced air conduction speech spectrum:
  • Y( ⁇ ,k), X( ⁇ ,k), B( ⁇ ,k) are the frequency spectrums of the enhanced air conduction speech, air conduction test speech and non-air conduction test speech at the k-th frame, respectively.
  • the following formulas are used to calculate
  • q(k,l) is the classification score of the k-th frame air conduction test speech and non-air conduction test speech corresponding to the first category of the dual-channel speech joint classification model
  • H a ( ⁇ ,k,l) is the k-th frame
  • the air conduction test speech corresponds to the Wiener filter frequency response of the first category of the dual-channel speech joint classification model.
  • H na ( ⁇ ,k,l) is the Wiener filter frequency response of the k-th frame of non-air conduction test speech corresponding to the first category of the dual-channel speech joint classification model.
  • the calculation method is:
  • a device for implementing a dual-sensor speech enhancement method based on dual-channel Wiener filtering includes an air-conducted speech sensor, a non-air-conducted speech sensor, a noise model estimation module, a dual-channel speech joint classification model, and a model compensation module , Frame classification module, filter coefficient generation module and dual-channel filter, among which,
  • the air-conducted speech sensor and the non-air-conducted speech sensor are respectively connected to the noise model estimation module, the frame classification module, and the dual-channel filter; the dual-channel speech joint classification model, the model compensation module, and the frame classification module , The filter coefficient generation module and the dual-channel filter are connected in sequence, the noise model estimation module is connected with the model compensation module and the filter coefficient generation module, and the dual-channel speech joint classification model is connected with the filter coefficient generation module ;
  • the air conduction speech sensor and the non-air conduction speech sensor are used to collect air conduction and non-air conduction speech signals, respectively, and the noise model estimation module is used to estimate the current air conduction noise model and power spectrum.
  • the channel voice joint classification model uses synchronously collected clean air conduction training speech and non-air conduction training speech to establish air conduction speech frames and non-air conduction speech frames, and the air conduction speech power of each category in the dual-channel speech joint classification model
  • the mean value of the spectrum is ⁇ ss ( ⁇ ,l)
  • the mean value of the power spectrum of non-air-guided speech is ⁇ bb ( ⁇ ,l)
  • the mean value of the cross-spectrum between air-guided speech and non-air-guided speech is ⁇ bs ( ⁇ ,l)
  • the model compensation module uses the statistical model of air conduction noise to correct the parameters of the dual-channel speech joint classification model, and the frame classification module classifies the currently synchronized input air conduction test speech and non-air conduction test speech
  • the air-conducted speech sensor is a microphone
  • the non-air-conducted speech sensor is a throat microphone
  • the present invention has the following advantages and effects:
  • the present invention uses both air conduction test speech and non-air conduction test speech information when enhancing, and can achieve better enhancement effect.
  • the present invention adopts a dual-channel speech joint classification model to fuse information of air conduction test speech and non-air conduction test speech, which can make frame classification more accurate and make full use of the correlation and prior knowledge of the two.
  • the present invention uses a dual-channel Wiener filter to recover air-conducted speech. Compared with the Chinese invention patent 201610025390.7, the calculation is simpler, and it can avoid high-frequency or silent noise when air-conducted speech is recovered from non-air-conducted speech. , Failed to take advantage of the insufficiency of air conduction voice information, with better performance.
  • the present invention uses a dual-channel Wiener filter to recover air-conducted speech, avoiding the assumption that non-air-conducted speech and air-conducted speech are mutually independent.
  • Figure 1 is a structural block diagram of a device for implementing a dual-sensor voice enhancement method based on dual-channel Wiener filtering disclosed in an embodiment of the present invention
  • Fig. 2 is a flowchart of a dual-sensor speech enhancement method based on dual-channel Wiener filtering disclosed in an embodiment of the present invention.
  • This embodiment discloses a structural block diagram of a device for implementing a dual-sensor speech enhancement method based on dual-channel Wiener filtering.
  • the air-conducted speech sensor, non-air-conducted speech sensor, noise model estimation module, and dual Channel speech joint classification model, model compensation module, frame classification module, filter coefficient generation module, and dual-channel filter are jointly constituted.
  • air-conducted speech sensor and non-air-conducted speech sensor are respectively combined with noise model estimation module, frame classification module, Dual-channel filter connection, dual-channel speech joint classification model, model compensation module, frame classification module, filter coefficient generation module, dual-channel filter are connected in sequence, noise model estimation module is connected with model compensation module, filter coefficient generation module , The dual-channel speech joint classification model is connected to the filter coefficient generation module.
  • the air-conducted speech sensor is a microphone
  • the non-air-conducted speech sensor is a larynx microphone, both of which are used to collect air-conducted and non-air-conducted speech signals
  • the noise model estimation module is used to estimate the current air conduction noise Model and power spectrum.
  • the dual-channel voice joint classification model uses the synchronously collected clean air conduction training speech and non-air conduction training speech to establish air conduction speech frames and non-air conduction speech frames.
  • the air conduction speech power spectrum of each category in the above two-channel speech joint classification model The mean value ⁇ ss ( ⁇ ,l), the mean value of the power spectrum of non-air-guided speech ⁇ bb ( ⁇ ,l), the mean value of cross-spectrum between air-guided speech and non-air-guided speech ⁇ bs ( ⁇ ,l).
  • the model compensation module uses the statistical model of air conduction noise to correct the parameters of the dual-channel speech joint classification model.
  • the frame classification module classifies the air conduction test speech and non-air conduction test speech frames input simultaneously.
  • the filter coefficient generation module constructs a dual-channel Wiener filter based on the classification result and the power spectrum of air conduction noise.
  • the dual-channel filter filters air conduction test speech frames and non-air conduction test speech frames to obtain enhanced air conduction speech.
  • This embodiment discloses a dual-sensor speech enhancement method based on dual-channel Wiener filtering. According to the implementation device disclosed in the above embodiment, the following steps are used to calculate the enhanced air conduction test speech and non-air conduction test speech input. Guide voice, its process is shown in Figure 2:
  • Step S1 Synchronously collect clean air conduction training speech and non-air conduction training speech, establish a two-channel speech joint classification model of air conduction speech frame and non-air conduction speech frame, and calculate each corresponding to each of the above-mentioned two-channel speech joint classification model
  • the average power spectrum of non-air-guided speech ⁇ bb ( ⁇ ,l) the average cross-spectrum between air-guided speech and non-air-guided speech ⁇ bs ( ⁇ ,l)
  • is the frequency
  • l is the serial number of the classification.
  • the synchronously collected clean air conduction training speech and non-air conduction training speech are divided into frames with a frame length of 30 ms and a frame shift of 10 ms.
  • Each frame of clean air conduction training speech and non-air conduction training speech uses Hamming. After adding windows and pre-emphasis, find the power spectrum.
  • the power spectra of the above-mentioned air conduction training speech and non-air conduction training speech are respectively passed through a 24-dimensional mel filter bank, and the output of the filter bank is taken logarithmically and then subjected to DCT transformation to obtain two sets of 12-dimensional mel frequency inverted
  • the spectral coefficient is used as the training feature of the dual-channel speech joint classification model.
  • step S1.2 Use the air-guided speech and non-air-guided speech features obtained in step S1.1 to train a dual-channel speech joint classification model.
  • the dual-channel voice joint classification model adopts multi-data stream GMM, namely
  • N (o, ⁇ , ⁇ ) is a Gaussian function
  • o x (k) and o b (k) for the k-th frame and the non-voice test air conduction air conduction test speech feature vectors extracted, with Is the mean value of the l Gaussian component of the air-conducted voice data stream and the non-air-conducted voice data stream in the multi-data stream GMM, with Is the variance of the l-th Gaussian component of the air-conducted voice data stream and the non-air-conducted voice data stream in the multi-stream GMM, c l is the weight of the l-th Gaussian component in the multi-stream GMM, w x and w b are respectively The weight of the air-conducted voice data stream and the non-air-conducted voice data stream in the data stream GMM, L is the number of Gaussian components.
  • each Gaussian component in the dual-channel speech joint classification model represents a category.
  • the following formula is used to calculate the score for each category
  • the current air conduction training speech frame and non-air conduction speech frame belong to the category with the highest score. Calculate the category to which all air conduction training speech frames and non-air conduction speech frames belong, and then calculate the average air conduction speech power spectrum of the air conduction training speech frames and non-air conduction speech frames contained in the same category ⁇ ss ( ⁇ ,l) , The average power spectrum of non-air-guided speech ⁇ bb ( ⁇ ,l), the average cross-spectrum between air-guided speech and non-air-guided speech ⁇ bs ( ⁇ ,l).
  • Step S2 Collect air conduction test speech and non-air conduction test speech simultaneously, use the pure noise section of the air conduction test speech to establish a statistical model of air conduction noise, and calculate the power spectrum mean value ⁇ vv ( ⁇ ) of air conduction noise.
  • the statistical model of air conduction noise is the mean value of the power spectrum of air conduction noise ⁇ vv ( ⁇ ), which is calculated by the following method:
  • step S2.3 Use the time corresponding to the endpoint of the non-air conduction test voice signal detected in step S2.2 as the endpoint of the air conduction test voice, and extract the pure noise segment in the air conduction test voice;
  • the statistical model of air conduction noise is Gaussian function, GMM model or HMM model.
  • Step S3 Use the statistical model of air conduction noise and the two-channel speech joint classification model in step S1 to classify the air conduction test speech frames and non-air conduction test speech frames input simultaneously.
  • the VTS model compensation technology is first adopted, and the air conduction noise statistical model is used to correct the parameters of the air conduction speech data stream in the dual-channel speech joint classification model, and then the input air conduction test speech frames and non-air conduction test speech frames are corrected.
  • Guide test voice frames for classification The specific method is to use the following formula to modify the mean value of each Gaussian component of the air conduction speech data stream in the dual-channel speech joint classification model:
  • the power spectra of the clean air conduction training speech and noise belonging to the l-th class respectively pass through the 24-dimensional mel filter bank and take the logarithm of the mean, and C is the discrete cosine transform matrix (Discrete Cosine Transform, DCT).
  • DCT discrete Cosine Transform
  • Step S4 Construct a two-channel Wiener filter according to the classification result of step S3 and ⁇ vv ( ⁇ ), and filter the air conduction test speech frame and the non-air conduction test speech frame to obtain an enhanced air conduction speech.
  • Y( ⁇ ,k), X( ⁇ ,k), B( ⁇ ,k) are the frequency spectrums of the enhanced air conduction speech, air conduction test speech and non-air conduction test speech at the k-th frame, respectively.
  • the following formulas are used to calculate
  • q(k,l) is the classification score of the k-th frame air conduction test speech and non-air conduction test speech corresponding to the first category of the dual-channel speech joint classification model.
  • H a ( ⁇ ,k,l) is the Wiener filter frequency response of the k-th frame air conduction test speech corresponding to the first category of the dual-channel speech joint classification model.
  • H na ( ⁇ ,k,l) is the Wiener filter frequency response of the k-th frame of non-air conduction test speech corresponding to the first category of the dual-channel speech joint classification model.
  • the calculation method is:

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

A dual sensor speech enhancement method and an implementation device based on two-channel Wiener filtering. The method comprises: firstly, building a dual channel speech association classification model for performing frame classification on dual channel input signals of an air conduction sensor and a non-air conduction sensor by using the complementarity between air conduction speech and non-air conduction speech; secondly, classifying speech frames collected by dual channels by using the model; and finally, constructing a dual channel Wiener filter according to the classification result, and performing filtering enhancement on the speech signals collected by the dual channels. Information comprised in the air conduction speech and the non-air conduction speech is more fully integrated, and the prior knowledge of the speech signals is introduced by means of a statistic model, so that the enhancement effect of a speech enhancement system in a noise environment can be effectively improved. The dual sensor speech enhancement method can be widely applied to a plurality of occasions such as video call, vehicle call, multimedia classrooms and military communication.

Description

一种双传感器语音增强方法及实现装置Double-sensor voice enhancement method and implementation device 技术领域Technical field
本发明涉及语音信号处理技术领域,具体涉及一种基于双通道维纳滤波的双传感器语音增强方法及实现装置。The invention relates to the technical field of speech signal processing, in particular to a dual-sensor speech enhancement method and implementation device based on dual-channel Wiener filtering.
背景技术Background technique
在实际语音通信中,语音信号常会受到外界环境噪声的干扰,影响接收语音的质量。语音增强技术是语音信号处理的一个重要分支,目的是从带噪语音中尽可能地提取纯净的原始语音,广泛应用于嘈杂环境下的语音通信、语音压缩编码和语音识别等领域。In actual voice communication, the voice signal is often interfered by external environmental noise, which affects the quality of the received voice. Speech enhancement technology is an important branch of speech signal processing. The purpose is to extract as much pure original speech as possible from noisy speech, and it is widely used in speech communication, speech compression coding and speech recognition in noisy environments.
由于人耳对声音的感知是通过空气的振动来进行,因此目前大多数语音增强算法都是针对空气传导(简称气导)语音,即利用气导传感器(如麦克风)来采集语音,增强效果受环境中各种声学噪声的影响较大,在嘈杂的环境中通常性能不佳。为了降低环境噪声对语音质量的影响,非空气传导(简称为非气导)传感器如喉部送话器、骨传导麦克风等常被用于嘈杂环境中的语音采集。有别于气导传感器,非气导的语音传感器利用说话人声带、颚骨等部位的振动来带动传感器中的簧片或者碳膜发生变化,改变其电阻值,使其两端的电压发生变化,从而将振动信号转化为电信号,即语音信号。由于空气中传导的声波无法使非气导传感器的簧片或者碳膜发生形变,因此非气导传感器不受气导声音的影响,具有很强的抗声学噪声能力。但因非气导传感器采集的是通过颚骨、肌肉、皮肤等部位的振动传播出来的语音,其高频部分丢失严重,表现为声音发闷、含糊不清,语音可懂度较差。Since human ears perceive sound through the vibration of the air, most of the current speech enhancement algorithms are aimed at air conduction (air conduction for short) speech, that is, the use of air conduction sensors (such as microphones) to collect speech, and the enhancement effect is affected. The influence of various acoustic noises in the environment is great, and the performance is usually poor in a noisy environment. In order to reduce the impact of environmental noise on voice quality, non-air conduction (abbreviated as non-air conduction) sensors such as throat microphones and bone conduction microphones are often used for voice collection in noisy environments. Different from air conduction sensors, non-air conduction voice sensors use the vibration of the speaker’s vocal cords, jawbone and other parts to drive the reed or carbon film in the sensor to change, change its resistance value, and change the voltage at both ends. In this way, the vibration signal is converted into an electrical signal, that is, a voice signal. Since the sound waves conducted in the air cannot deform the reed or carbon film of the non-air-conducting sensor, the non-air-conducting sensor is not affected by the air-conducting sound and has a strong ability to resist acoustic noise. However, because the non-air conduction sensor collects the voice transmitted through the vibration of the jawbone, muscle, skin and other parts, the high frequency part is seriously lost, which is manifested as the sound is muffled and ambiguous, and the voice intelligibility is poor.
鉴于气导与非气导传感器单独应用时都存在着一定的不足,近年来出现了一些结合两者优点的语音增强方法。这些方法利用气导语音和非气导语音的互补性,采用多传感器融合技术来实现语音增强,通常能取得比单传感器语音增强系统更好的效果。现有的双传感器语音增强主要有两种方式,一种先从非气导语音恢复出气导语音,然后再与带噪的气导语音进行融合;另一种是从非气导语音恢复出气导语音,并利用气导传感器和非气导传感器信号对带噪的气导语音进行增强,然后两者再进行融合。这些技术存在着以下的不足之处:(1)利用非气导语音恢复气导语音时,会在高频或静 音中引入额外的噪声,影响增强效果。(2)利用非气导语音恢复气导语音时,未能利用当前气导语音的信息。(3)利用非气导语音恢复的气导语音与气导语音融合时,未能充分利用两者的相关性和先验知识。(4)融合时通常假设非气导语音和气导语音相互独立,但该假设在实际中并不成立。In view of the certain shortcomings of air conduction and non-air conduction sensors when applied separately, some speech enhancement methods combining the advantages of both have appeared in recent years. These methods take advantage of the complementarity of air-guided speech and non-air-guided speech, and use multi-sensor fusion technology to achieve speech enhancement, which usually achieves better results than single-sensor speech enhancement systems. There are mainly two ways to enhance the existing dual-sensor speech. One is to restore air-conducted speech from non-air-conducted speech, and then merge with noisy air-conducted speech; the other is to restore air conduction from non-air-conducted speech Voice, and use the air conduction sensor and non-air conduction sensor signals to enhance the noisy air conduction speech, and then the two are fused. These technologies have the following shortcomings: (1) When using non-air-conducted speech to restore air-conducted speech, additional noise will be introduced in the high frequency or silence, which will affect the enhancement effect. (2) When using non-air conduction speech to restore air conduction speech, the information of current air conduction speech cannot be used. (3) When the air-guided speech and air-guided speech recovered by non-air-guided speech are fused, the correlation and prior knowledge of the two cannot be fully utilized. (4) It is usually assumed that non-air-guided speech and air-guided speech are independent of each other during fusion, but this assumption does not hold true in practice.
中国发明专利201610025390.7公开了一种基于统计模型的双传感器语音增强方法与装置,该发明首先结合非气导语音和气导语音来构建用于分类的联合统计模型以及进行端点检测,通过联合统计模型来计算当前最佳气导语音滤波器,对气导语音进行滤波增强,然后利用非气导语音到气导语音的映射模型将非气导语音转换为气导语音,并将其与滤波增强后的气导语音进行加权融合,部分解决了非气导传感器恢复的气导语音和气导语音融合时未能充分利用两者的相关性和先验知识的不足,但在第二步融合时仍要用到非气导语音恢复的气导语音,因此同样存在着高频和静音噪声、利用非气导语音恢复气导语音时未能利用气导语音的信息等不足。Chinese invention patent 201610025390.7 discloses a dual-sensor speech enhancement method and device based on a statistical model. The invention first combines non-air-conducted speech and air-conducted speech to construct a joint statistical model for classification and endpoint detection. Calculate the current best air conduction speech filter, filter and enhance the air conduction speech, and then use the non-air conduction speech to air conduction speech mapping model to convert the non-air conduction speech into the air conduction speech, and combine it with the filtered and enhanced speech The weighted fusion of air conduction speech partially solves the problem of the lack of full use of the correlation and prior knowledge of the air conduction speech recovered by the non-air conduction sensor and the air conduction speech fusion, but it is still used in the second step of fusion The air-conducted speech recovered to the non-air-conducted speech, so there are also deficiencies such as high-frequency and silent noise, and the use of non-air-conducted speech to recover the air-conducted speech.
发明内容Summary of the invention
本发明的目的是为了解决现有技术中的上述缺陷,提供一种基于双通道维纳滤波的双传感器语音增强方法及实现装置,该方法首先利用气导语音与非气导语音之间的互补性,建立对气导传感器和非气导传感器双通道输入信号进行帧分类的双通道语音联合分类模型,并利用该模型来对双通道采集的语音帧进行分类,最后根据分类结果构造双通道维纳滤波器,对双通道采集的语音信号进行滤波增强。与现有技术相比,本发明更充分地融合了气导语音与非气导语音所包含的信息,并通过统计模型引入语音信号的先验知识,能有效提高语音增强系统在噪声环境下的增强效果。本发明可以广泛应用于视频通话、车载电话、多媒体教室、军事通信等多种场合。The purpose of the present invention is to solve the above-mentioned defects in the prior art and provide a dual-sensor speech enhancement method and implementation device based on dual-channel Wiener filtering. The method first utilizes the complementarity between air-conducted speech and non-air-conducted speech It establishes a dual-channel voice joint classification model for frame classification of dual-channel input signals of air conduction sensors and non-air conduction sensors, and uses this model to classify the voice frames collected by dual channels, and finally constructs a dual-channel dimension based on the classification results Nano filter to filter and enhance the voice signal collected by dual channels. Compared with the prior art, the present invention more fully integrates the information contained in air-guided speech and non-air-guided speech, and introduces prior knowledge of the speech signal through a statistical model, which can effectively improve the performance of the speech enhancement system in a noisy environment. Enhancement. The invention can be widely used in various occasions such as video calls, car phones, multimedia classrooms, and military communications.
本发明的第一个目的可以通过采取如下技术方案达到:The first objective of the present invention can be achieved by adopting the following technical solutions:
一种基于双通道维纳滤波的双传感器语音增强方法,所述的双传感器语音增强方法包括以下步骤:A dual-sensor voice enhancement method based on dual-channel Wiener filtering. The dual-sensor voice enhancement method includes the following steps:
S1、同步采集干净的气导训练语音和非气导训练语音,建立气导语音帧和非气导语音帧的双通道语音联合分类模型,并计算对应于上述双通道语音联合分类模型中每个分类的气导语音功率谱均值Φ ss(ω,l)、非气导语音功率谱均值Φ bb(ω,l)、气导语音和非气导语音之间的互谱均值Φ bs(ω,l),其中ω为频率,l为分类的序号; S1. Collect clean air conduction training speech and non-air conduction training speech simultaneously, establish a two-channel speech joint classification model of air conduction speech frame and non-air conduction speech frame, and calculate corresponding to each of the above two-channel speech joint classification models The average power spectrum of classified air-guided speech Φ ss (ω,l), the average power spectrum of non-air-guided speech Φ bb (ω,l), the average cross-spectrum between air-guided speech and non-air-guided speech Φ bs (ω, l), where ω is the frequency and l is the serial number of the classification;
S2、同步采集气导测试语音和非气导测试语音,利用气导测试语音的纯噪声段建立气导噪声的统计模型,并计算气导噪声的功率谱均值Φ vv(ω); S2. Collect air conduction test speech and non-air conduction test speech simultaneously, use the pure noise section of air conduction test speech to establish a statistical model of air conduction noise, and calculate the mean value of the power spectrum of air conduction noise Φ vv (ω);
S3、利用气导噪声的统计模型和步骤S1中的双通道语音联合分类模型对同步输入的气导测试语音帧和非气导测试语音帧进行分类;S3. Use the statistical model of air conduction noise and the dual-channel speech joint classification model in step S1 to classify the synchronized input air conduction test speech frames and non-air conduction test speech frames;
S4、根据步骤S3的分类结果和功率谱均值Φ vv(ω)构建双通道维纳滤波器,对气导测试语音帧和非气导测试语音帧进行滤波,得到增强后的气导语音。 S4. Construct a two-channel Wiener filter according to the classification result of step S3 and the power spectrum mean value Φ vv (ω), and filter the air conduction test speech frame and the non-air conduction test speech frame to obtain an enhanced air conduction speech.
进一步地,所述的步骤S1过程如下:Further, the process of step S1 is as follows:
S1.1、对同步采集的干净气导训练语音和非气导训练语音进行分帧和预处理,提取每帧语音的特征参数——倒梅尔谱系数;S1.1. Perform framing and preprocessing of the synchronously collected clean air conduction training speech and non-air conduction training speech, and extract the characteristic parameter of each frame of speech—inverted mel spectrum coefficient;
S1.2、利用步骤S1.1中得到的气导语音和非气导语音特征,训练双通道语音联合分类模型;S1.2. Use the air-guided speech and non-air-guided speech features obtained in step S1.1 to train a dual-channel speech joint classification model;
S1.3、使用经过训练的双通道语音联合分类模型对所有气导训练语音帧和非气导语音帧进行分类,然后计算每一分类所包含的气导训练语音帧和非气导语音帧的气导语音功率谱均值Φ ss(ω,l)、非气导语音功率谱均值Φ bb(ω,l)、气导语音和非气导语音之间的互谱均值Φ bs(ω,l)。 S1.3. Use the trained dual-channel speech joint classification model to classify all air conduction training speech frames and non-air conduction speech frames, and then calculate the air conduction training speech frames and non-air conduction speech frames contained in each classification. Air-guided speech power spectrum mean Φ ss (ω,l), non-air-guided speech power spectrum mean Φ bb (ω,l), cross-spectrum mean between air-guided speech and non-air-guided speech Φ bs (ω,l) .
进一步地,所述的步骤S1.2中,双通道语音联合分类模型采用多数据流高斯混合模型(Gaussian Mixture Model,GMM),即Further, in the step S1.2, the dual-channel voice joint classification model adopts a multi-data stream Gaussian Mixture Model (GMM), namely
Figure PCTCN2019110290-appb-000001
Figure PCTCN2019110290-appb-000001
其中N(o,μ,σ)为高斯函数,o x(k)和o b(k)为第k帧气导测试语音和非气导测试语音中提取的特征矢量,
Figure PCTCN2019110290-appb-000002
Figure PCTCN2019110290-appb-000003
为多数据流GMM中气导语音数据流和非气导语音数据流第l个高斯分量的均值,
Figure PCTCN2019110290-appb-000004
Figure PCTCN2019110290-appb-000005
为多数据流GMM中气导语音数据流和非气导语音数据流第l个高斯分量的方差,c l为多数据流GMM中第l个高斯分量的权重,w x和w b分别为多数据流GMM中气导语音数据流和非气导语音数据流的权重,L为高斯分量的个数。
Where N (o, μ, σ) is a Gaussian function, o x (k) and o b (k) for the k-th frame and the non-voice test air conduction air conduction test speech feature vectors extracted,
Figure PCTCN2019110290-appb-000002
with
Figure PCTCN2019110290-appb-000003
Is the mean value of the l Gaussian component of the air-conducted voice data stream and the non-air-conducted voice data stream in the multi-data stream GMM,
Figure PCTCN2019110290-appb-000004
with
Figure PCTCN2019110290-appb-000005
Is the variance of the l-th Gaussian component of the air-conducted voice data stream and the non-air-conducted voice data stream in the multi-stream GMM, c l is the weight of the l-th Gaussian component in the multi-stream GMM, w x and w b are respectively The weight of the air-conducted voice data stream and the non-air-conducted voice data stream in the data stream GMM, L is the number of Gaussian components.
进一步地,所述的步骤S1.3中,双通道语音联合分类模型中的每个高斯分量代表一个分类,对于每一对同步的气导训练语音帧和非气导语音帧,采用下式计算其对每一个分类的得分Further, in the step S1.3, each Gaussian component in the dual-channel speech joint classification model represents a category, and for each pair of synchronized air conduction training speech frames and non-air conduction speech frames, the following formula is used to calculate Its score for each category
Figure PCTCN2019110290-appb-000006
Figure PCTCN2019110290-appb-000006
当前的气导训练语音帧和非气导语音帧属于得分最高的分类;计算出所有气导训练语音帧和非气导语音帧所属的分类,然后计算同一分类所包含的气导训练语音帧和非气导语音帧的气导语音功率谱均值Φ ss(ω,l)、非气导语音功率谱均值Φ bb(ω,l)、气导语音和非气导语音之间的互谱均值Φ bs(ω,l)。 The current air conduction training speech frames and non-air conduction speech frames belong to the category with the highest score; calculate the classification to which all air conduction training speech frames and non-air conduction speech frames belong, and then calculate the air conduction training speech frames and sums contained in the same category The mean value of the air-guided speech power spectrum of the non-air-guided speech frame Φ ss (ω,l), the mean value of the power spectrum of the non-air-guided speech Φ bb (ω,l), the cross-spectrum mean value between the air-guided speech and the non-air-guided speech Φ bs (ω,l).
进一步地,所述的气导噪声的统计模型即为气导噪声的功率谱均值Φ vv(ω),采用以下方法来计算: Further, the statistical model of air conduction noise is the mean value of the power spectrum of air conduction noise Φ vv (ω), which is calculated by the following method:
S2.1、同步采集气导测试语音和非气导测试语音并分帧;S2.1. Collect air conduction test speech and non-air conduction test speech simultaneously and divide them into frames;
S2.2、根据非气导检测语音帧的短时自相关函数R b(m)和短时能量E b,计算每帧非气导检测语音帧的短时平均过门限率C bS2.2. According to the short-term autocorrelation function R b (m) and short-term energy E b of the non-air conduction detection speech frame, calculate the short-term average threshold crossing rate C b of each non-air conduction detection speech frame:
Figure PCTCN2019110290-appb-000007
Figure PCTCN2019110290-appb-000007
其中sgn[·]为取符号运算,
Figure PCTCN2019110290-appb-000008
是调节因子,T是门限初值,M是帧长,当C b大于预设的门限值时,判断该帧为语音信号,否则为噪声,根据每帧的判决结果得到非气导检测语音信号的端点位置;
Where sgn[·] is a symbolic operation,
Figure PCTCN2019110290-appb-000008
Is the adjustment factor, T is the threshold initial value, and M is the frame length. When C b is greater than the preset threshold, the frame is judged to be a speech signal, otherwise it is noise. According to the judgment result of each frame, the non-air conduction detection speech is obtained The end position of the signal;
S2.3、将步骤S2.2中检测到的非气导测试语音信号端点对应的时刻作为气导检测语音的端点,提取气导检测语音中的纯噪声段;S2.3. Use the time corresponding to the endpoint of the non-air conduction test voice signal detected in step S2.2 as the endpoint of the air conduction test voice, and extract the pure noise segment in the air conduction test voice;
S2.4、计算气导测试语音中纯噪声段信号的功率谱均值Φ vv(ω)。 S2.4. Calculate the mean value Φ vv (ω) of the power spectrum of the pure noise signal in the air conduction test speech.
进一步地,所述的步骤S3中首先采用矢量泰勒级数模型(Vector Taylor Seties,VTS)补偿技术,利用气导噪声的统计模型对双通道语音联合分类模型中气导语音数据流的参数进行修正,然后再对输入的气导测试语音帧和非气导测试语音帧进行分类,其中,采用下式修正双通道语音联合分类模型中气导语音数据流每个高斯分量的均值:Further, in the step S3, the vector Taylor series model (Vector Taylor Seties, VTS) compensation technology is first used, and the air conduction noise statistical model is used to correct the parameters of the air conduction speech data stream in the dual-channel speech joint classification model. , And then classify the input air conduction test speech frames and non-air conduction test speech frames. The following formula is used to modify the average value of each Gaussian component of the air conduction speech data stream in the dual-channel speech joint classification model:
Figure PCTCN2019110290-appb-000009
Figure PCTCN2019110290-appb-000009
其中
Figure PCTCN2019110290-appb-000010
Figure PCTCN2019110290-appb-000011
分别为属于第l个类的干净气导训练语音和噪声的功率谱分别通过24维梅尔滤波器组并取对数后的均值,C为DCT变换矩阵,双通道语音联合分类模型中的其他参数保持不变,采用修正后的双通道语音联合分类模型对同步输入的气导测试语音帧和非气导测试语音帧进行分类,得到当前气导测试语音帧和非气导测试语音帧 对应于每个分类的分类得分q(k,l)。
among them
Figure PCTCN2019110290-appb-000010
with
Figure PCTCN2019110290-appb-000011
The power spectra of clean air conduction training speech and noise belonging to the l-th class respectively pass through the 24-dimensional mel filter bank and take the logarithm of the mean value. C is the DCT transformation matrix. Others in the dual-channel speech joint classification model The parameters remain unchanged, and the revised dual-channel speech joint classification model is used to classify the synchronized input air conduction test speech frames and non-air conduction test speech frames, and the current air conduction test speech frames and non-air conduction test speech frames correspond to The classification score of each category is q(k,l).
进一步地,所述的步骤S4中,对于第k帧同步采集的气导测试语音和非气导测试语音,采用下式计算增强后的气导语音频谱:Further, in the step S4, for the air conduction test speech and non-air conduction test speech synchronously collected in the kth frame, the following formula is used to calculate the enhanced air conduction speech spectrum:
Figure PCTCN2019110290-appb-000012
Figure PCTCN2019110290-appb-000012
其中Y(ω,k)、X(ω,k)、B(ω,k)分别为第k帧增强后的气导语音、气导测试语音和非气导测试语音的频谱,
Figure PCTCN2019110290-appb-000013
为对应于第k帧气导测试语音和非气导测试语音的维纳滤波器滤波器的频率响应,分别采用下式计算
Among them, Y(ω,k), X(ω,k), B(ω,k) are the frequency spectrums of the enhanced air conduction speech, air conduction test speech and non-air conduction test speech at the k-th frame, respectively.
Figure PCTCN2019110290-appb-000013
In order to correspond to the frequency response of the Wiener filter filter of the k-th air conduction test speech and non-air conduction test speech, the following formulas are used to calculate
Figure PCTCN2019110290-appb-000014
Figure PCTCN2019110290-appb-000014
Figure PCTCN2019110290-appb-000015
Figure PCTCN2019110290-appb-000015
式中q(k,l)为第k帧气导测试语音和非气导测试语音对应于双通道语音联合分类模型第l类的分类得分,H a(ω,k,l)为第k帧气导测试语音对应于双通道语音联合分类模型第l类的维纳滤波器频率响应,计算方法为: Where q(k,l) is the classification score of the k-th frame air conduction test speech and non-air conduction test speech corresponding to the first category of the dual-channel speech joint classification model, and H a (ω,k,l) is the k-th frame The air conduction test speech corresponds to the Wiener filter frequency response of the first category of the dual-channel speech joint classification model. The calculation method is:
Figure PCTCN2019110290-appb-000016
Figure PCTCN2019110290-appb-000016
H na(ω,k,l)为第k帧非气导测试语音对应于双通道语音联合分类模型第l类的维纳滤波器频率响应,计算方法为: H na (ω,k,l) is the Wiener filter frequency response of the k-th frame of non-air conduction test speech corresponding to the first category of the dual-channel speech joint classification model. The calculation method is:
Figure PCTCN2019110290-appb-000017
Figure PCTCN2019110290-appb-000017
进一步地,所述的
Figure PCTCN2019110290-appb-000018
Figure PCTCN2019110290-appb-000019
采用下式计算:
Further, said
Figure PCTCN2019110290-appb-000018
with
Figure PCTCN2019110290-appb-000019
Use the following formula:
Figure PCTCN2019110290-appb-000020
l=arg max q(k,l)。
Figure PCTCN2019110290-appb-000020
l=arg max q(k,l).
本发明的另一个目的通过以下技术方案实现:Another objective of the present invention is achieved through the following technical solutions:
一种基于双通道维纳滤波的双传感器语音增强方法的实现装置,所述的实现装置包括气导语音传感器、非气导语音传感器、噪声模型估计模块、双通道语音联合分类模型、模型补偿模块、帧分类模块、滤波器系数生成模块和双通道滤波器,其中,A device for implementing a dual-sensor speech enhancement method based on dual-channel Wiener filtering. The device includes an air-conducted speech sensor, a non-air-conducted speech sensor, a noise model estimation module, a dual-channel speech joint classification model, and a model compensation module , Frame classification module, filter coefficient generation module and dual-channel filter, among which,
所述的气导语音传感器和非气导语音传感器分别与所述的噪声模型估计模块、帧分类模块、双通道滤波器连接;所述的双通道语音联合分类模型、模型补偿模块、帧 分类模块、滤波器系数生成模块、双通道滤波器顺次连接,所述的噪声模型估计模块与模型补偿模块、滤波器系数生成模块连接,所述的双通道语音联合分类模型与滤波器系数生成模块连接;The air-conducted speech sensor and the non-air-conducted speech sensor are respectively connected to the noise model estimation module, the frame classification module, and the dual-channel filter; the dual-channel speech joint classification model, the model compensation module, and the frame classification module , The filter coefficient generation module and the dual-channel filter are connected in sequence, the noise model estimation module is connected with the model compensation module and the filter coefficient generation module, and the dual-channel speech joint classification model is connected with the filter coefficient generation module ;
所述的气导语音传感器和非气导语音传感器分别用于采集气导和非气导语音信号,所述的噪声模型估计模块用于估计当前气导噪声的模型和功率谱,所述的双通道语音联合分类模型采用同步采集的干净气导训练语音和非气导训练语音建立气导语音帧和非气导语音帧,所述的双通道语音联合分类模型中每个分类的气导语音功率谱均值是Φ ss(ω,l)、非气导语音功率谱均值是Φ bb(ω,l)、气导语音和非气导语音之间的互谱均值是Φ bs(ω,l),所述的模型补偿模块利用气导噪声的统计模型对双通道语音联合分类模型的参数进行修正,所述的帧分类模块对当前同步输入的气导测试语音和非气导测试语音帧进行分类,所述的滤波器系数生成模块根据分类结果和气导噪声的功率谱构建双通道维纳滤波器,所述的双通道滤波器对气导测试语音帧和非气导测试语音帧进行滤波,得到增强后的气导语音。 The air conduction speech sensor and the non-air conduction speech sensor are used to collect air conduction and non-air conduction speech signals, respectively, and the noise model estimation module is used to estimate the current air conduction noise model and power spectrum. The channel voice joint classification model uses synchronously collected clean air conduction training speech and non-air conduction training speech to establish air conduction speech frames and non-air conduction speech frames, and the air conduction speech power of each category in the dual-channel speech joint classification model The mean value of the spectrum is Φ ss (ω,l), the mean value of the power spectrum of non-air-guided speech is Φ bb (ω,l), the mean value of the cross-spectrum between air-guided speech and non-air-guided speech is Φ bs (ω,l), The model compensation module uses the statistical model of air conduction noise to correct the parameters of the dual-channel speech joint classification model, and the frame classification module classifies the currently synchronized input air conduction test speech and non-air conduction test speech frames, The filter coefficient generation module constructs a dual-channel Wiener filter based on the classification result and the power spectrum of air conduction noise. The dual-channel filter filters air conduction test speech frames and non-air conduction test speech frames to be enhanced After the air conduction voice.
进一步地,所述的气导语音传感器为麦克风,所述的非气导语音传感器为喉部送话器。Further, the air-conducted speech sensor is a microphone, and the non-air-conducted speech sensor is a throat microphone.
本发明相对于现有技术具有如下的优点及效果:Compared with the prior art, the present invention has the following advantages and effects:
(1)与仅基于气导测试语音或非气导测试语音的语音增强技术相比,本发明在增强时同时利用了气导测试语音和非气导测试语音的信息,可以取得更好的增强效果。(1) Compared with the speech enhancement technology based only on air conduction test speech or non-air conduction test speech, the present invention uses both air conduction test speech and non-air conduction test speech information when enhancing, and can achieve better enhancement effect.
(2)本发明采用双通道语音联合分类模型来融合气导测试语音和非气导测试语音的信息,可以使帧分类更准确,充分了利用两者的相关性和先验知识。(2) The present invention adopts a dual-channel speech joint classification model to fuse information of air conduction test speech and non-air conduction test speech, which can make frame classification more accurate and make full use of the correlation and prior knowledge of the two.
(3)本发明在采用了双通道维纳滤波器来恢复气导语音,与中国发明专利201610025390.7相比计算更简单,同时能避免从非气导语音恢复出气导语音时存在高频或静音噪声、未能利用气导语音信息的不足,具有更好的性能。(3) The present invention uses a dual-channel Wiener filter to recover air-conducted speech. Compared with the Chinese invention patent 201610025390.7, the calculation is simpler, and it can avoid high-frequency or silent noise when air-conducted speech is recovered from non-air-conducted speech. , Failed to take advantage of the insufficiency of air conduction voice information, with better performance.
(4)本发明采用了双通道维纳滤波器来恢复气导语音,避免了非气导语音和气导语音相互独立的假设。(4) The present invention uses a dual-channel Wiener filter to recover air-conducted speech, avoiding the assumption that non-air-conducted speech and air-conducted speech are mutually independent.
附图说明Description of the drawings
图1是本发明实施例中公开的基于双通道维纳滤波的双传感器语音增强方法的实现装置的结构框图;Figure 1 is a structural block diagram of a device for implementing a dual-sensor voice enhancement method based on dual-channel Wiener filtering disclosed in an embodiment of the present invention;
图2是本发明实施例中公开的基于双通道维纳滤波的双传感器语音增强方法的流程图。Fig. 2 is a flowchart of a dual-sensor speech enhancement method based on dual-channel Wiener filtering disclosed in an embodiment of the present invention.
具体实施方式Detailed ways
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of the embodiments of the present invention, not all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.
实施例一Example one
本实施例公开了一种基于双通道维纳滤波的双传感器语音增强方法的实现装置的结构框图,如图1所示,由气导语音传感器、非气导语音传感器、噪声模型估计模块、双通道语音联合分类模型、模型补偿模块、帧分类模块、滤波器系数生成模块、双通道滤波器共同构成,其中,气导语音传感器和非气导语音传感器分别与噪声模型估计模块、帧分类模块、双通道滤波器连接,双通道语音联合分类模型、模型补偿模块、帧分类模块、滤波器系数生成模块、双通道滤波器顺次连接,噪声模型估计模块与模型补偿模块、滤波器系数生成模块连接,双通道语音联合分类模型与滤波器系数生成模块连接。This embodiment discloses a structural block diagram of a device for implementing a dual-sensor speech enhancement method based on dual-channel Wiener filtering. As shown in FIG. 1, the air-conducted speech sensor, non-air-conducted speech sensor, noise model estimation module, and dual Channel speech joint classification model, model compensation module, frame classification module, filter coefficient generation module, and dual-channel filter are jointly constituted. Among them, air-conducted speech sensor and non-air-conducted speech sensor are respectively combined with noise model estimation module, frame classification module, Dual-channel filter connection, dual-channel speech joint classification model, model compensation module, frame classification module, filter coefficient generation module, dual-channel filter are connected in sequence, noise model estimation module is connected with model compensation module, filter coefficient generation module , The dual-channel speech joint classification model is connected to the filter coefficient generation module.
本实施例中,气导语音传感器为麦克风,非气导语音传感器为喉部送话器,两者用于采集气导和非气导语音信号;噪声模型估计模块用于估计当前气导噪声的模型和功率谱。双通道语音联合分类模型采用同步采集的干净气导训练语音和非气导训练语音建立气导语音帧和非气导语音帧,上述双通道语音联合分类模型中每个分类的气导语音功率谱均值Φ ss(ω,l)、非气导语音功率谱均值Φ bb(ω,l)、气导语音和非气导语音之间的互谱均值Φ bs(ω,l)。模型补偿模块利用气导噪声的统计模型对双通道语音联合分类模型的参数进行修正。帧分类模块对当前同步输入的气导测试语音和非气导测试语音帧进行分类。滤波器系数生成模块根据分类结果和气导噪声的功率谱构建双通道维纳滤波器。双通道滤波器对气导测试语音帧和非气导测试语音帧进行滤波,得到增强后的气导语音。 In this embodiment, the air-conducted speech sensor is a microphone, and the non-air-conducted speech sensor is a larynx microphone, both of which are used to collect air-conducted and non-air-conducted speech signals; the noise model estimation module is used to estimate the current air conduction noise Model and power spectrum. The dual-channel voice joint classification model uses the synchronously collected clean air conduction training speech and non-air conduction training speech to establish air conduction speech frames and non-air conduction speech frames. The air conduction speech power spectrum of each category in the above two-channel speech joint classification model The mean value Φ ss (ω,l), the mean value of the power spectrum of non-air-guided speech Φ bb (ω,l), the mean value of cross-spectrum between air-guided speech and non-air-guided speech Φ bs (ω,l). The model compensation module uses the statistical model of air conduction noise to correct the parameters of the dual-channel speech joint classification model. The frame classification module classifies the air conduction test speech and non-air conduction test speech frames input simultaneously. The filter coefficient generation module constructs a dual-channel Wiener filter based on the classification result and the power spectrum of air conduction noise. The dual-channel filter filters air conduction test speech frames and non-air conduction test speech frames to obtain enhanced air conduction speech.
实施例二Example two
本实施例公开了一种基于双通道维纳滤波的双传感器语音增强方法,根据上述实施例公开的实现装置,采用以下步骤利用输入的气导测试语音和非气导测试语音计算增强后的气导语音,其流程如图2所示:This embodiment discloses a dual-sensor speech enhancement method based on dual-channel Wiener filtering. According to the implementation device disclosed in the above embodiment, the following steps are used to calculate the enhanced air conduction test speech and non-air conduction test speech input. Guide voice, its process is shown in Figure 2:
步骤S1、同步采集干净的气导训练语音和非气导训练语音,建立气导语音帧和非气导语音帧的双通道语音联合分类模型,并计算对应于上述双通道语音联合分类模型中每个分类的气导语音功率谱均值Φ ss(ω,l)、非气导语音功率谱均值Φ bb(ω,l)、气导语音和非气导语音之间的互谱均值Φ bs(ω,l),其中ω为频率,l为分类的序号。 Step S1. Synchronously collect clean air conduction training speech and non-air conduction training speech, establish a two-channel speech joint classification model of air conduction speech frame and non-air conduction speech frame, and calculate each corresponding to each of the above-mentioned two-channel speech joint classification model The average power spectrum of air-guided speech Φ ss (ω,l), the average power spectrum of non-air-guided speech Φ bb (ω,l), the average cross-spectrum between air-guided speech and non-air-guided speech Φ bs (ω ,l), where ω is the frequency and l is the serial number of the classification.
本实施例中采用以下步骤来完成:In this embodiment, the following steps are used to complete:
S1.1、对同步采集的干净气导训练语音和非气导训练语音进行分帧和预处理,提取每帧语音的特征参数。S1.1. Perform framing and preprocessing on the synchronously collected clean air conduction training speech and non-air conduction training speech, and extract the characteristic parameters of each frame of speech.
本实施例中,将同步采集的干净气导训练语音和非气导训练语音按帧长30ms、帧移10ms进行分帧,每帧干净的气导训练语音和非气导训练语音分别采用汉明窗加窗并进行预加重后求其功率谱。将上述气导训练语音和非气导训练语音的功率谱分别通过24维梅尔滤波器组,对滤波器组的输出取对数后再进行DCT变换,得到两组12维的梅尔频率倒谱系数,作为双通道语音联合分类模型的训练特征。In this embodiment, the synchronously collected clean air conduction training speech and non-air conduction training speech are divided into frames with a frame length of 30 ms and a frame shift of 10 ms. Each frame of clean air conduction training speech and non-air conduction training speech uses Hamming. After adding windows and pre-emphasis, find the power spectrum. The power spectra of the above-mentioned air conduction training speech and non-air conduction training speech are respectively passed through a 24-dimensional mel filter bank, and the output of the filter bank is taken logarithmically and then subjected to DCT transformation to obtain two sets of 12-dimensional mel frequency inverted The spectral coefficient is used as the training feature of the dual-channel speech joint classification model.
S1.2、利用步骤S1.1中得到的气导语音和非气导语音特征,训练双通道语音联合分类模型。本实施例中,双通道语音联合分类模型采用多数据流GMM,即S1.2. Use the air-guided speech and non-air-guided speech features obtained in step S1.1 to train a dual-channel speech joint classification model. In this embodiment, the dual-channel voice joint classification model adopts multi-data stream GMM, namely
Figure PCTCN2019110290-appb-000021
Figure PCTCN2019110290-appb-000021
其中N(o,μ,σ)为高斯函数,o x(k)和o b(k)为第k帧气导测试语音和非气导测试语音中提取的特征矢量,
Figure PCTCN2019110290-appb-000022
Figure PCTCN2019110290-appb-000023
为多数据流GMM中气导语音数据流和非气导语音数据流第l个高斯分量的均值,
Figure PCTCN2019110290-appb-000024
Figure PCTCN2019110290-appb-000025
为多数据流GMM中气导语音数据流和非气导语音数据流第l个高斯分量的方差,c l为多数据流GMM中第l个高斯分量的权重,w x和w b分别为多数据流GMM中气导语音数据流和非气导语音数据流的权重,L为高斯分量的个数。
Where N (o, μ, σ) is a Gaussian function, o x (k) and o b (k) for the k-th frame and the non-voice test air conduction air conduction test speech feature vectors extracted,
Figure PCTCN2019110290-appb-000022
with
Figure PCTCN2019110290-appb-000023
Is the mean value of the l Gaussian component of the air-conducted voice data stream and the non-air-conducted voice data stream in the multi-data stream GMM,
Figure PCTCN2019110290-appb-000024
with
Figure PCTCN2019110290-appb-000025
Is the variance of the l-th Gaussian component of the air-conducted voice data stream and the non-air-conducted voice data stream in the multi-stream GMM, c l is the weight of the l-th Gaussian component in the multi-stream GMM, w x and w b are respectively The weight of the air-conducted voice data stream and the non-air-conducted voice data stream in the data stream GMM, L is the number of Gaussian components.
双通道语音联合分类模型中的参数c l、w x、w b
Figure PCTCN2019110290-appb-000026
Figure PCTCN2019110290-appb-000027
采用最大期望(Expectation Maximization)算法估计。
The parameters c l , w x , w b , in the dual-channel speech joint classification model
Figure PCTCN2019110290-appb-000026
with
Figure PCTCN2019110290-appb-000027
Estimated using Expectation Maximization algorithm.
S1.3、使用经过训练的双通道语音联合分类模型对所有气导训练语音帧和非气导语音帧进行分类,然后计算每一分类所包含的气导训练语音帧和非气导语音帧的气导语音功率谱均值Φ ss(ω,l)、非气导语音功率谱均值Φ bb(ω,l)、气导语音和非气导语音之间的 互谱均值Φ bs(ω,l)。 S1.3. Use the trained dual-channel speech joint classification model to classify all air conduction training speech frames and non-air conduction speech frames, and then calculate the air conduction training speech frames and non-air conduction speech frames contained in each classification. Air-guided speech power spectrum mean Φ ss (ω,l), non-air-guided speech power spectrum mean Φ bb (ω,l), cross-spectrum mean between air-guided speech and non-air-guided speech Φ bs (ω,l) .
本实施例中,双通道语音联合分类模型中的每个高斯分量代表一个分类,对于每一对同步的气导训练语音帧和非气导语音帧,采用下式计算其对每一个分类的得分In this embodiment, each Gaussian component in the dual-channel speech joint classification model represents a category. For each pair of synchronized air conduction training speech frames and non-air conduction speech frames, the following formula is used to calculate the score for each category
Figure PCTCN2019110290-appb-000028
Figure PCTCN2019110290-appb-000028
当前的气导训练语音帧和非气导语音帧属于得分最高的分类。计算出所有气导训练语音帧和非气导语音帧所属的分类,然后计算同一分类所包含的气导训练语音帧和非气导语音帧的气导语音功率谱均值Φ ss(ω,l)、非气导语音功率谱均值Φ bb(ω,l)、气导语音和非气导语音之间的互谱均值Φ bs(ω,l)。 The current air conduction training speech frame and non-air conduction speech frame belong to the category with the highest score. Calculate the category to which all air conduction training speech frames and non-air conduction speech frames belong, and then calculate the average air conduction speech power spectrum of the air conduction training speech frames and non-air conduction speech frames contained in the same category Φ ss (ω,l) , The average power spectrum of non-air-guided speech Φ bb (ω,l), the average cross-spectrum between air-guided speech and non-air-guided speech Φ bs (ω,l).
步骤S2、同步采集气导测试语音和非气导测试语音,利用气导测试语音的纯噪声段建立气导噪声的统计模型,并计算气导噪声的功率谱均值Φ vv(ω)。 Step S2: Collect air conduction test speech and non-air conduction test speech simultaneously, use the pure noise section of the air conduction test speech to establish a statistical model of air conduction noise, and calculate the power spectrum mean value Φ vv (ω) of air conduction noise.
本实施例中,气导噪声的统计模型即为气导噪声的功率谱均值Φ vv(ω),采用以下方法来计算: In this embodiment, the statistical model of air conduction noise is the mean value of the power spectrum of air conduction noise Φ vv (ω), which is calculated by the following method:
S2.1、同步采集气导测试语音和非气导测试语音并分帧;S2.1. Collect air conduction test speech and non-air conduction test speech simultaneously and divide them into frames;
S2.2、根据非气导检测语音帧的短时自相关函数R b(m)和短时能量E b,计算每帧非气导检测语音帧的短时平均过门限率C bS2.2. According to the short-term autocorrelation function R b (m) and short-term energy E b of the non-air conduction detection speech frame, calculate the short-term average threshold crossing rate C b of each non-air conduction detection speech frame:
Figure PCTCN2019110290-appb-000029
Figure PCTCN2019110290-appb-000029
其中sgn[·]为取符号运算,
Figure PCTCN2019110290-appb-000030
是调节因子,T是门限初值,M是帧长。当C b大于预设的门限值时,判断该帧为语音信号,否则为噪声,根据每帧的判决结果得到非气导检测语音信号的端点位置;
Where sgn[·] is a symbolic operation,
Figure PCTCN2019110290-appb-000030
Is the adjustment factor, T is the initial threshold value, and M is the frame length. When C b is greater than the preset threshold, determine that the frame is a voice signal, otherwise it is noise, and obtain the endpoint position of the non-air conduction detection voice signal according to the decision result of each frame;
S2.3、将步骤S2.2中检测到的非气导测试语音信号端点对应的时刻作为气导检测语音的端点,提取气导检测语音中的纯噪声段;S2.3. Use the time corresponding to the endpoint of the non-air conduction test voice signal detected in step S2.2 as the endpoint of the air conduction test voice, and extract the pure noise segment in the air conduction test voice;
S2.4、计算气导测试语音中纯噪声段信号的功率谱均值Φ vv(ω)。 S2.4. Calculate the mean value Φ vv (ω) of the power spectrum of the pure noise signal in the air conduction test speech.
其中,气导噪声的统计模型为高斯函数、GMM模型或HMM模型。Among them, the statistical model of air conduction noise is Gaussian function, GMM model or HMM model.
步骤S3、利用气导噪声的统计模型和步骤S1中的双通道语音联合分类模型对同步输入的气导测试语音帧和非气导测试语音帧进行分类。Step S3: Use the statistical model of air conduction noise and the two-channel speech joint classification model in step S1 to classify the air conduction test speech frames and non-air conduction test speech frames input simultaneously.
本实施例中,首先采用VTS模型补偿技术,利用气导噪声的统计模型对双通道语 音联合分类模型中气导语音数据流的参数进行修正,然后再对输入的气导测试语音帧和非气导测试语音帧进行分类。具体方法为采用下式修正双通道语音联合分类模型中气导语音数据流每个高斯分量的均值:In this embodiment, the VTS model compensation technology is first adopted, and the air conduction noise statistical model is used to correct the parameters of the air conduction speech data stream in the dual-channel speech joint classification model, and then the input air conduction test speech frames and non-air conduction test speech frames are corrected. Guide test voice frames for classification. The specific method is to use the following formula to modify the mean value of each Gaussian component of the air conduction speech data stream in the dual-channel speech joint classification model:
Figure PCTCN2019110290-appb-000031
Figure PCTCN2019110290-appb-000031
其中
Figure PCTCN2019110290-appb-000032
Figure PCTCN2019110290-appb-000033
分别为属于第l个类的干净气导训练语音和噪声的功率谱分别通过24维梅尔滤波器组并取对数后的均值,C为离散余弦变换矩阵(Discrete Cosine Transform,DCT)。双通道语音联合分类模型中的其他参数保持不变。采用修正后的双通道语音联合分类模型对同步输入的气导测试语音帧和非气导测试语音帧进行分类,得到当前气导测试语音帧和非气导测试语音帧对应于每个分类的分类得分q(k,l)。
among them
Figure PCTCN2019110290-appb-000032
with
Figure PCTCN2019110290-appb-000033
The power spectra of the clean air conduction training speech and noise belonging to the l-th class respectively pass through the 24-dimensional mel filter bank and take the logarithm of the mean, and C is the discrete cosine transform matrix (Discrete Cosine Transform, DCT). The other parameters in the dual-channel speech joint classification model remain unchanged. The revised dual-channel speech joint classification model is used to classify the synchronized input air conduction test speech frames and non-air conduction test speech frames, and the current air conduction test speech frames and non-air conduction test speech frames corresponding to each classification are obtained Score q(k,l).
步骤S4、根据步骤S3的分类结果和Φ vv(ω)构建双通道维纳滤波器,对气导测试语音帧和非气导测试语音帧进行滤波,得到增强后的气导语音。 Step S4: Construct a two-channel Wiener filter according to the classification result of step S3 and Φ vv (ω), and filter the air conduction test speech frame and the non-air conduction test speech frame to obtain an enhanced air conduction speech.
本实施例中,对于第k帧同步采集的气导测试语音和非气导测试语音,采用下式计算增强后的气导语音频谱:In this embodiment, for the air conduction test speech and non-air conduction test speech synchronously collected at the kth frame, the following formula is used to calculate the enhanced air conduction speech spectrum:
Figure PCTCN2019110290-appb-000034
Figure PCTCN2019110290-appb-000034
其中Y(ω,k)、X(ω,k)、B(ω,k)分别为第k帧增强后的气导语音、气导测试语音和非气导测试语音的频谱,
Figure PCTCN2019110290-appb-000035
为对应于第k帧气导测试语音和非气导测试语音的维纳滤波器滤波器的频率响应,分别采用下式计算
Among them, Y(ω,k), X(ω,k), B(ω,k) are the frequency spectrums of the enhanced air conduction speech, air conduction test speech and non-air conduction test speech at the k-th frame, respectively.
Figure PCTCN2019110290-appb-000035
In order to correspond to the frequency response of the Wiener filter filter of the k-th air conduction test speech and non-air conduction test speech, the following formulas are used to calculate
Figure PCTCN2019110290-appb-000036
Figure PCTCN2019110290-appb-000036
Figure PCTCN2019110290-appb-000037
Figure PCTCN2019110290-appb-000037
式中的q(k,l)为第k帧气导测试语音和非气导测试语音对应于双通道语音联合分类模型第l类的分类得分。H a(ω,k,l)为第k帧气导测试语音对应于双通道语音联合分类模型第l类的维纳滤波器频率响应,计算方法为: In the formula, q(k,l) is the classification score of the k-th frame air conduction test speech and non-air conduction test speech corresponding to the first category of the dual-channel speech joint classification model. H a (ω,k,l) is the Wiener filter frequency response of the k-th frame air conduction test speech corresponding to the first category of the dual-channel speech joint classification model. The calculation method is:
Figure PCTCN2019110290-appb-000038
Figure PCTCN2019110290-appb-000038
H na(ω,k,l)为第k帧非气导测试语音对应于双通道语音联合分类模型第l类的维纳滤波器频率响应,计算方法为: H na (ω,k,l) is the Wiener filter frequency response of the k-th frame of non-air conduction test speech corresponding to the first category of the dual-channel speech joint classification model. The calculation method is:
Figure PCTCN2019110290-appb-000039
Figure PCTCN2019110290-appb-000039
在另一实施例中,上述
Figure PCTCN2019110290-appb-000040
Figure PCTCN2019110290-appb-000041
采用下式计算:
In another embodiment, the above
Figure PCTCN2019110290-appb-000040
with
Figure PCTCN2019110290-appb-000041
Use the following formula:
Figure PCTCN2019110290-appb-000042
l=argmaxq(k,l)。
Figure PCTCN2019110290-appb-000042
l=argmaxq(k,l).
上述实施例为本发明较佳的实施方式,但本发明的实施方式并不受上述实施例的限制,其他的任何未背离本发明的精神实质与原理下所作的改变、修饰、替代、组合、简化,均应为等效的置换方式,都包含在本发明的保护范围之内。The above-mentioned embodiments are preferred embodiments of the present invention, but the embodiments of the present invention are not limited by the above-mentioned embodiments, and any other changes, modifications, substitutions, combinations, etc. made without departing from the spirit and principle of the present invention Simplified, all should be equivalent replacement methods, and they are all included in the protection scope of the present invention.

Claims (10)

  1. 一种基于双通道维纳滤波的双传感器语音增强方法,其特征在于,所述的双传感器语音增强方法包括以下步骤:A dual-sensor voice enhancement method based on dual-channel Wiener filtering is characterized in that the dual-sensor voice enhancement method includes the following steps:
    S1、同步采集干净的气导训练语音和非气导训练语音,建立气导语音帧和非气导语音帧的双通道语音联合分类模型,并计算对应于上述双通道语音联合分类模型中每个分类的气导语音功率谱均值Φ ss(ω,l)、非气导语音功率谱均值Φ bb(ω,l)、气导语音和非气导语音之间的互谱均值Φ bs(ω,l),其中ω为频率,l为分类的序号; S1. Collect clean air conduction training speech and non-air conduction training speech simultaneously, establish a two-channel speech joint classification model of air conduction speech frame and non-air conduction speech frame, and calculate corresponding to each of the above two-channel speech joint classification models The average power spectrum of classified air-guided speech Φ ss (ω,l), the average power spectrum of non-air-guided speech Φ bb (ω,l), the average cross-spectrum between air-guided speech and non-air-guided speech Φ bs (ω, l), where ω is frequency and l is the serial number of the classification;
    S2、同步采集气导测试语音和非气导测试语音,利用气导测试语音的纯噪声段建立气导噪声的统计模型,并计算气导噪声的功率谱均值Φ vv(ω); S2. Collect air conduction test speech and non-air conduction test speech simultaneously, use the pure noise section of air conduction test speech to establish a statistical model of air conduction noise, and calculate the mean value of the power spectrum of air conduction noise Φ vv (ω);
    S3、利用气导噪声的统计模型和步骤S1中的双通道语音联合分类模型对同步输入的气导测试语音帧和非气导测试语音帧进行分类;S3. Use the statistical model of air conduction noise and the dual-channel speech joint classification model in step S1 to classify the synchronized input air conduction test speech frames and non-air conduction test speech frames;
    S4、根据步骤S3的分类结果和功率谱均值Φ vv(ω)构建双通道维纳滤波器,对气导测试语音帧和非气导测试语音帧进行滤波,得到增强后的气导语音。 S4. Construct a two-channel Wiener filter according to the classification result of step S3 and the power spectrum mean value Φ vv (ω), and filter the air conduction test speech frame and the non-air conduction test speech frame to obtain an enhanced air conduction speech.
  2. 根据权利要求1所述的双传感器语音增强方法,其特征在于,所述的步骤S1过程如下:The dual-sensor speech enhancement method according to claim 1, wherein the process of step S1 is as follows:
    S1.1、对同步采集的干净气导训练语音和非气导训练语音进行分帧和预处理,提取每帧语音的特征参数,其中,所述的特征参数为倒梅尔谱系数;S1.1. Framing and preprocessing the synchronously collected clean air conduction training speech and non-air conduction training speech, and extracting characteristic parameters of each frame of speech, where the characteristic parameters are inverted mel spectrum coefficients;
    S1.2、利用步骤S1.1中得到的气导语音和非气导语音特征,训练双通道语音联合分类模型;S1.2. Use the air-guided speech and non-air-guided speech features obtained in step S1.1 to train a dual-channel speech joint classification model;
    S1.3、使用经过训练的双通道语音联合分类模型对所有气导训练语音帧和非气导语音帧进行分类,然后计算每一分类所包含的气导训练语音帧和非气导语音帧的气导语音功率谱均值Φ ss(ω,l)、非气导语音功率谱均值Φ bb(ω,l)、气导语音和非气导语音之间的互谱均值Φ bs(ω,l)。 S1.3. Use the trained dual-channel speech joint classification model to classify all air conduction training speech frames and non-air conduction speech frames, and then calculate the air conduction training speech frames and non-air conduction speech frames contained in each classification. Air-guided speech power spectrum mean Φ ss (ω,l), non-air-guided speech power spectrum mean Φ bb (ω,l), cross-spectrum mean between air-guided speech and non-air-guided speech Φ bs (ω,l) .
  3. 根据权利要求2所述的双传感器语音增强方法,其特征在于,所述的步骤S1.2中,双通道语音联合分类模型采用多数据流GMM,其中,GMM为高斯混合模型,即The dual-sensor speech enhancement method according to claim 2, characterized in that, in the step S1.2, the dual-channel speech joint classification model adopts a multi-data stream GMM, where GMM is a Gaussian mixture model, namely
    Figure PCTCN2019110290-appb-100001
    Figure PCTCN2019110290-appb-100001
    其中N(o,μ,σ)为高斯函数,o x(k)和o b(k)为第k帧气导测试语音和非气导测试语音中提 取的特征矢量,
    Figure PCTCN2019110290-appb-100002
    Figure PCTCN2019110290-appb-100003
    为多数据流GMM中气导语音数据流和非气导语音数据流第l个高斯分量的均值,
    Figure PCTCN2019110290-appb-100004
    Figure PCTCN2019110290-appb-100005
    为多数据流GMM中气导语音数据流和非气导语音数据流第l个高斯分量的方差,c l为多数据流GMM中第l个高斯分量的权重,w x和w b分别为多数据流GMM中气导语音数据流和非气导语音数据流的权重,L为高斯分量的个数。
    Where N (o, μ, σ) is a Gaussian function, o x (k) and o b (k) for the k-th frame and the non-voice test air conduction air conduction test speech feature vectors extracted,
    Figure PCTCN2019110290-appb-100002
    with
    Figure PCTCN2019110290-appb-100003
    Is the mean value of the l Gaussian component of the air-conducted voice data stream and the non-air-conducted voice data stream in the multi-data stream GMM,
    Figure PCTCN2019110290-appb-100004
    with
    Figure PCTCN2019110290-appb-100005
    Is the variance of the l-th Gaussian component of the air-conducted voice data stream and the non-air-conducted voice data stream in the multi-stream GMM, c l is the weight of the l-th Gaussian component in the multi-stream GMM, w x and w b are respectively The weight of the air-conducted voice data stream and the non-air-conducted voice data stream in the data stream GMM, L is the number of Gaussian components.
  4. 根据权利要求3所述的双传感器语音增强方法,其特征在于,所述的步骤S1.3中,双通道语音联合分类模型中的每个高斯分量代表一个分类,对于每一对同步的气导训练语音帧和非气导语音帧,采用下式计算其对每一个分类的得分The dual-sensor speech enhancement method according to claim 3, wherein in the step S1.3, each Gaussian component in the dual-channel speech joint classification model represents a classification, and for each pair of synchronized air conduction Training speech frames and non-air-conducted speech frames, use the following formula to calculate their scores for each category
    Figure PCTCN2019110290-appb-100006
    Figure PCTCN2019110290-appb-100006
    当前的气导训练语音帧和非气导语音帧属于得分最高的分类;计算出所有气导训练语音帧和非气导语音帧所属的分类,然后计算同一分类所包含的气导训练语音帧和非气导语音帧的气导语音功率谱均值Φ ss(ω,l)、非气导语音功率谱均值Φ bb(ω,l)、气导语音和非气导语音之间的互谱均值Φ bs(ω,l)。 The current air conduction training speech frames and non-air conduction speech frames belong to the category with the highest score; calculate the classification to which all air conduction training speech frames and non-air conduction speech frames belong, and then calculate the air conduction training speech frames and sums contained in the same category The mean value of the air-guided speech power spectrum of the non-air-guided speech frame Φ ss (ω,l), the mean value of the power spectrum of the non-air-guided speech Φ bb (ω,l), the cross-spectrum mean value between the air-guided speech and the non-air-guided speech Φ bs (ω,l).
  5. 根据权利要求1所述的双传感器语音增强方法,其特征在于,所述的气导噪声的统计模型即为气导噪声的功率谱均值Φ vv(ω),采用以下方法来计算: The dual-sensor speech enhancement method according to claim 1, wherein the statistical model of air conduction noise is the mean value of the power spectrum of air conduction noise Φ vv (ω), which is calculated by the following method:
    S2.1、同步采集气导测试语音和非气导测试语音并分帧;S2.1. Collect air conduction test speech and non-air conduction test speech simultaneously and divide them into frames;
    S2.2、根据非气导检测语音帧的短时自相关函数R b(m)和短时能量E b,计算每帧非气导检测语音帧的短时平均过门限率C bS2.2. According to the short-term autocorrelation function R b (m) and short-term energy E b of the non-air conduction detection speech frame, calculate the short-term average threshold crossing rate C b of each non-air conduction detection speech frame:
    Figure PCTCN2019110290-appb-100007
    Figure PCTCN2019110290-appb-100007
    其中sgn[·]为取符号运算,
    Figure PCTCN2019110290-appb-100008
    是调节因子,T是门限初值,M是帧长,当C b大于预设的门限值时,判断该帧为语音信号,否则为噪声,根据每帧的判决结果得到非气导检测语音信号的端点位置;
    Where sgn[·] is a symbolic operation,
    Figure PCTCN2019110290-appb-100008
    Is the adjustment factor, T is the threshold initial value, and M is the frame length. When C b is greater than the preset threshold, the frame is judged to be a speech signal, otherwise it is noise. According to the judgment result of each frame, the non-air conduction detection speech is obtained The end position of the signal;
    S2.3、将步骤S2.2中检测到的非气导测试语音信号端点对应的时刻作为气导检测语音的端点,提取气导检测语音中的纯噪声段;S2.3. Use the time corresponding to the endpoint of the non-air conduction test voice signal detected in step S2.2 as the endpoint of the air conduction test voice, and extract the pure noise segment in the air conduction test voice;
    S2.4、计算气导测试语音中纯噪声段信号的功率谱均值Φ vv(ω)。 S2.4. Calculate the mean value Φ vv (ω) of the power spectrum of the pure noise signal in the air conduction test speech.
  6. 根据权利要求1所述的双传感器语音增强方法,其特征在于,所述的步骤S3 中首先采用矢量泰勒级数模型补偿技术,利用气导噪声的统计模型对双通道语音联合分类模型中气导语音数据流的参数进行修正,然后再对输入的气导测试语音帧和非气导测试语音帧进行分类,其中,采用下式修正双通道语音联合分类模型中气导语音数据流每个高斯分量的均值:The dual-sensor speech enhancement method according to claim 1, characterized in that, in said step S3, a vector Taylor series model compensation technique is first adopted, and a statistical model of air conduction noise is used to analyze the air conduction noise in the dual-channel speech joint classification model. The parameters of the speech data stream are corrected, and then the input air conduction test speech frames and non-air conduction test speech frames are classified. The following formula is used to modify each Gaussian component of the air conduction speech data stream in the dual-channel speech joint classification model Mean of:
    Figure PCTCN2019110290-appb-100009
    Figure PCTCN2019110290-appb-100009
    其中
    Figure PCTCN2019110290-appb-100010
    Figure PCTCN2019110290-appb-100011
    分别为属于第l个类的干净气导训练语音和噪声的功率谱分别通过24维梅尔滤波器组并取对数后的均值,C为离散余弦变换矩阵,双通道语音联合分类模型中的其他参数保持不变,采用修正后的双通道语音联合分类模型对同步输入的气导测试语音帧和非气导测试语音帧进行分类,得到当前气导测试语音帧和非气导测试语音帧对应于每个分类的分类得分q(k,l)。
    among them
    Figure PCTCN2019110290-appb-100010
    with
    Figure PCTCN2019110290-appb-100011
    The power spectra of the clean air conduction training speech and noise belonging to the l-th class respectively pass through the 24-dimensional mel filter bank and take the logarithm of the mean value. C is the discrete cosine transform matrix, the two-channel speech joint classification model Other parameters remain unchanged, and the revised dual-channel speech joint classification model is used to classify the synchronized input air conduction test speech frame and non-air conduction test speech frame to obtain the current air conduction test speech frame and the corresponding non-air conduction test speech frame The category score q(k,l) for each category.
  7. 根据权利要求2所述的双传感器语音增强方法,其特征在于,所述的步骤S4中,对于第k帧同步采集的气导测试语音和非气导测试语音,采用下式计算增强后的气导语音频谱:The dual-sensor speech enhancement method according to claim 2, characterized in that, in the step S4, for the air conduction test speech and non-air conduction test speech synchronously collected at the kth frame, the following formula is used to calculate the enhanced air conduction test speech Guide voice spectrum:
    Figure PCTCN2019110290-appb-100012
    Figure PCTCN2019110290-appb-100012
    其中Y(ω,k)、X(ω,k)、B(ω,k)分别为第k帧增强后的气导语音、气导测试语音和非气导测试语音的频谱,
    Figure PCTCN2019110290-appb-100013
    为对应于第k帧气导测试语音和非气导测试语音的维纳滤波器滤波器的频率响应,分别采用下式计算
    Among them, Y(ω,k), X(ω,k), B(ω,k) are the frequency spectrums of the enhanced air conduction speech, air conduction test speech and non-air conduction test speech at the k-th frame, respectively.
    Figure PCTCN2019110290-appb-100013
    In order to correspond to the frequency response of the Wiener filter filter of the k-th air conduction test speech and non-air conduction test speech, the following formulas are used to calculate
    Figure PCTCN2019110290-appb-100014
    Figure PCTCN2019110290-appb-100014
    Figure PCTCN2019110290-appb-100015
    Figure PCTCN2019110290-appb-100015
    式中q(k,l)为第k帧气导测试语音和非气导测试语音对应于双通道语音联合分类模型第l类的分类得分,H a(ω,k,l)为第k帧气导测试语音对应于双通道语音联合分类模型第l类的维纳滤波器频率响应,计算方法为: Where q(k,l) is the classification score of the k-th frame air conduction test speech and non-air conduction test speech corresponding to the first category of the dual-channel speech joint classification model, and H a (ω,k,l) is the k-th frame The air conduction test speech corresponds to the Wiener filter frequency response of the first category of the dual-channel speech joint classification model. The calculation method is:
    Figure PCTCN2019110290-appb-100016
    Figure PCTCN2019110290-appb-100016
    H na(ω,k,l)为第k帧非气导测试语音对应于双通道语音联合分类模型第l类的维纳滤波器频率响应,计算方法为: H na (ω,k,l) is the Wiener filter frequency response of the k-th frame of non-air conduction test speech corresponding to the first category of the dual-channel speech joint classification model. The calculation method is:
    Figure PCTCN2019110290-appb-100017
    Figure PCTCN2019110290-appb-100017
  8. 根据权利要求7所述的双传感器语音增强方法,其特征在于,所述的
    Figure PCTCN2019110290-appb-100018
    Figure PCTCN2019110290-appb-100019
    采用下式计算:
    The dual-sensor speech enhancement method according to claim 7, wherein the
    Figure PCTCN2019110290-appb-100018
    with
    Figure PCTCN2019110290-appb-100019
    Use the following formula:
    Figure PCTCN2019110290-appb-100020
    Figure PCTCN2019110290-appb-100020
  9. 一种基于双通道维纳滤波的双传感器语音增强方法的实现装置,其特征在于,所述的实现装置包括气导语音传感器、非气导语音传感器、噪声模型估计模块、双通道语音联合分类模型、模型补偿模块、帧分类模块、滤波器系数生成模块和双通道滤波器,其中,A device for implementing a dual-sensor speech enhancement method based on dual-channel Wiener filtering, characterized in that the device includes an air-conducted speech sensor, a non-air-conducted speech sensor, a noise model estimation module, and a dual-channel speech joint classification model , Model compensation module, frame classification module, filter coefficient generation module and dual-channel filter, among which,
    所述的气导语音传感器和非气导语音传感器分别与所述的噪声模型估计模块、帧分类模块、双通道滤波器连接;所述的双通道语音联合分类模型、模型补偿模块、帧分类模块、滤波器系数生成模块、双通道滤波器顺次连接,所述的噪声模型估计模块与模型补偿模块、滤波器系数生成模块连接,所述的双通道语音联合分类模型与滤波器系数生成模块连接;The air-conducted speech sensor and the non-air-conducted speech sensor are respectively connected to the noise model estimation module, the frame classification module, and the dual-channel filter; the dual-channel speech joint classification model, the model compensation module, and the frame classification module , The filter coefficient generation module and the dual-channel filter are connected in sequence, the noise model estimation module is connected with the model compensation module and the filter coefficient generation module, and the dual-channel speech joint classification model is connected with the filter coefficient generation module ;
    所述的气导语音传感器和非气导语音传感器分别用于采集气导和非气导语音信号,所述的噪声模型估计模块用于估计当前气导噪声的模型和功率谱,所述的双通道语音联合分类模型采用同步采集的干净气导训练语音和非气导训练语音建立气导语音帧和非气导语音帧,所述的双通道语音联合分类模型中每个分类的气导语音功率谱均值是Φ ss(ω,l)、非气导语音功率谱均值是Φ bb(ω,l)、气导语音和非气导语音之间的互谱均值是Φ bs(ω,l),所述的模型补偿模块利用气导噪声的统计模型对双通道语音联合分类模型的参数进行修正,所述的帧分类模块对当前同步输入的气导测试语音和非气导测试语音帧进行分类,所述的滤波器系数生成模块根据分类结果和气导噪声的功率谱构建双通道维纳滤波器,所述的双通道滤波器对气导测试语音帧和非气导测试语音帧进行滤波,得到增强后的气导语音。 The air conduction speech sensor and the non-air conduction speech sensor are used to collect air conduction and non-air conduction speech signals, respectively, and the noise model estimation module is used to estimate the current air conduction noise model and power spectrum. The channel voice joint classification model uses synchronously collected clean air conduction training speech and non-air conduction training speech to establish air conduction speech frames and non-air conduction speech frames, and the air conduction speech power of each category in the dual-channel speech joint classification model The mean value of the spectrum is Φ ss (ω,l), the mean value of the power spectrum of non-air-guided speech is Φ bb (ω,l), the mean value of the cross-spectrum between air-guided speech and non-air-guided speech is Φ bs (ω,l), The model compensation module uses the statistical model of air conduction noise to correct the parameters of the dual-channel speech joint classification model, and the frame classification module classifies the currently synchronized input air conduction test speech and non-air conduction test speech frames, The filter coefficient generation module constructs a dual-channel Wiener filter based on the classification result and the power spectrum of air conduction noise. The dual-channel filter filters air conduction test speech frames and non-air conduction test speech frames to be enhanced After the air conduction voice.
  10. 根据权利要求9所述的双传感器语音增强方法的实现装置,其特征在于,所述的气导语音传感器为麦克风,所述的非气导语音传感器为喉部送话器。The device for implementing the dual-sensor speech enhancement method according to claim 9, wherein the air-conducted speech sensor is a microphone, and the non-air-conducted speech sensor is a throat microphone.
PCT/CN2019/110290 2019-07-25 2019-10-10 Dual sensor speech enhancement method and implementation device WO2021012403A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910678398.7A CN110390945B (en) 2019-07-25 2019-07-25 Dual-sensor voice enhancement method and implementation device
CN201910678398.7 2019-07-25

Publications (1)

Publication Number Publication Date
WO2021012403A1 true WO2021012403A1 (en) 2021-01-28

Family

ID=68287587

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/110290 WO2021012403A1 (en) 2019-07-25 2019-10-10 Dual sensor speech enhancement method and implementation device

Country Status (2)

Country Link
CN (1) CN110390945B (en)
WO (1) WO2021012403A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111009253B (en) * 2019-11-29 2022-10-21 联想(北京)有限公司 Data processing method and device
CN111524531A (en) * 2020-04-23 2020-08-11 广州清音智能科技有限公司 Method for real-time noise reduction of high-quality two-channel video voice
CN116470959A (en) * 2022-07-12 2023-07-21 苏州旭创科技有限公司 Filter implementation method, noise suppression method, device and computer equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004279768A (en) * 2003-03-17 2004-10-07 Mitsubishi Heavy Ind Ltd Device and method for estimating air-conducted sound
CN103208291A (en) * 2013-03-08 2013-07-17 华南理工大学 Speech enhancement method and device applicable to strong noise environments
CN105513605A (en) * 2015-12-01 2016-04-20 南京师范大学 Voice enhancement system and method for cellphone microphone
CN105632512A (en) * 2016-01-14 2016-06-01 华南理工大学 Dual-sensor voice enhancement method based on statistics model and device
US20170294179A1 (en) * 2011-09-19 2017-10-12 Bitwave Pte Ltd Multi-sensor signal optimization for speech communication
CN107886967A (en) * 2017-11-18 2018-04-06 中国人民解放军陆军工程大学 A kind of bone conduction sound enhancement method of depth bidirectional gate recurrent neural network
JP2018063400A (en) * 2016-10-14 2018-04-19 富士通株式会社 Audio processing apparatus and audio processing program
CN108986834A (en) * 2018-08-22 2018-12-11 中国人民解放军陆军工程大学 The blind Enhancement Method of bone conduction voice based on codec framework and recurrent neural network

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN203165457U (en) * 2013-03-08 2013-08-28 华南理工大学 Voice acquisition device used for noisy environment
CN106328156B (en) * 2016-08-22 2020-02-18 华南理工大学 Audio and video information fusion microphone array voice enhancement system and method
GB201713946D0 (en) * 2017-06-16 2017-10-18 Cirrus Logic Int Semiconductor Ltd Earbud speech estimation
CN110010143B (en) * 2019-04-19 2020-06-09 出门问问信息科技有限公司 Voice signal enhancement system, method and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004279768A (en) * 2003-03-17 2004-10-07 Mitsubishi Heavy Ind Ltd Device and method for estimating air-conducted sound
US20170294179A1 (en) * 2011-09-19 2017-10-12 Bitwave Pte Ltd Multi-sensor signal optimization for speech communication
CN103208291A (en) * 2013-03-08 2013-07-17 华南理工大学 Speech enhancement method and device applicable to strong noise environments
CN105513605A (en) * 2015-12-01 2016-04-20 南京师范大学 Voice enhancement system and method for cellphone microphone
CN105632512A (en) * 2016-01-14 2016-06-01 华南理工大学 Dual-sensor voice enhancement method based on statistics model and device
JP2018063400A (en) * 2016-10-14 2018-04-19 富士通株式会社 Audio processing apparatus and audio processing program
CN107886967A (en) * 2017-11-18 2018-04-06 中国人民解放军陆军工程大学 A kind of bone conduction sound enhancement method of depth bidirectional gate recurrent neural network
CN108986834A (en) * 2018-08-22 2018-12-11 中国人民解放军陆军工程大学 The blind Enhancement Method of bone conduction voice based on codec framework and recurrent neural network

Also Published As

Publication number Publication date
CN110390945A (en) 2019-10-29
CN110390945B (en) 2021-09-21

Similar Documents

Publication Publication Date Title
TWI763073B (en) Deep learning based noise reduction method using both bone-conduction sensor and microphone signals
WO2021012403A1 (en) Dual sensor speech enhancement method and implementation device
CN110070880B (en) Establishment method and application method of combined statistical model for classification
CN109273021B (en) RNN-based real-time conference noise reduction method and device
CN111916101B (en) Deep learning noise reduction method and system fusing bone vibration sensor and double-microphone signals
CN100573663C (en) Mute detection method based on speech characteristic to jude
Zhang et al. On end-to-end multi-channel time domain speech separation in reverberant environments
WO2022027423A1 (en) Deep learning noise reduction method and system fusing signal of bone vibration sensor with signals of two microphones
KR102429152B1 (en) Deep learning voice extraction and noise reduction method by fusion of bone vibration sensor and microphone signal
CN110197665A (en) A kind of speech Separation and tracking for police criminal detection monitoring
CN103208291A (en) Speech enhancement method and device applicable to strong noise environments
CN110942784A (en) Snore classification system based on support vector machine
Zheng et al. Spectra restoration of bone-conducted speech via attention-based contextual information and spectro-temporal structure constraint
CN203165457U (en) Voice acquisition device used for noisy environment
CN111341351A (en) Voice activity detection method and device based on self-attention mechanism and storage medium
CN113327589B (en) Voice activity detection method based on attitude sensor
CN112992131A (en) Method for extracting ping-pong command of target voice in complex scene
Heracleous et al. Fusion of standard and alternative acoustic sensors for robust automatic speech recognition
Srinivasan et al. Robustness analysis of speech enhancement using a bone conduction microphone-preliminary results
Thomsen et al. Speech enhancement and noise-robust automatic speech recognition
Chandra Hindi vowel classification using QCN-PNCC features
Radha et al. A Study on Alternative Speech Sensor
Jiang et al. Using energy difference for speech separation of dual-microphone close-talk system
Saudi et al. Robust Audio-Visual Speech Recognition System based on Gabor Features and Dynamic Stream Weight Adaption
Sathiamoorthy et al. Performance of Speaker Verification Using CSM and TM

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19938708

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19938708

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 29.09.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19938708

Country of ref document: EP

Kind code of ref document: A1