CN113707171B

CN113707171B - Airspace filtering voice enhancement system and method

Info

Publication number: CN113707171B
Application number: CN202111004913.7A
Authority: CN
Inventors: 王笑楠; 李光; 刘云飞; 周瑜; 冯杰
Original assignee: Third Research Institute Of China Electronics Technology Group Corp
Current assignee: Third Research Institute Of China Electronics Technology Group Corp
Priority date: 2021-08-30
Filing date: 2021-08-30
Publication date: 2024-05-14
Anticipated expiration: 2041-08-30
Also published as: CN113707171A

Abstract

The invention discloses a space domain filtering voice enhancement system and a space domain filtering voice enhancement method. The two-dimensional particle vibration velocity sensing units which are orthogonally placed can synchronously and independently measure the spatial particle vibration velocity components while the sound pressure sensing units measure the sound pressure components in the sound field. Practice proves that the sound source of the structure has high orientation precision, directivity is not influenced by the aperture of the array, the array structure is compact, the aperture can be in millimeter level, the structure can adapt to various environments and various noises, and has a certain inhibition effect on room reverberation, so that the voice enhancement effect is realized, and the problems of limited design and application scenes of the existing microphone array are effectively solved.

Description

Airspace filtering voice enhancement system and method

Technical Field

The invention relates to the technical field of signal processing, in particular to a spatial filtering voice enhancement system and a method.

Background

In the process of acquiring the voice signal, under the conditions of no environmental noise and no inter-room reverberation and very close distance to a sound source, a single microphone can acquire the voice signal with high quality. But in fact, the unfixed sound source position, sound environment and other factors affect the quality of the microphone for picking up the voice signals, thereby reducing the voice intelligibility. Based on the limitation of single microphone pickup, a method for voice processing by a microphone array is introduced, and the method improves pickup quality and promotes the development of voice enhancement technology. Unlike single microphone, the array microphone has space selectivity, captures target signals in specific direction by means of electronic control, and simultaneously has a suppression effect on surrounding noise, but the microphone array is limited by half-wavelength theory, the larger the number of microphones, the larger the aperture, and the problems of spatial aliasing and high operation complexity are solved, so that the design freedom and application scene of the microphone array are greatly limited.

Disclosure of Invention

The invention provides a spatial filtering voice enhancement system and a spatial filtering voice enhancement method based on a particle vibration velocity sensor microarray, which are used for solving the problem of limited microphone array design and application scene in the prior art.

In a first aspect, the present invention provides a spatially filtered speech enhancement system based on a particle velocity sensor microarray, comprising: the device comprises a sound pressure sensitive unit, two-dimensional particle vibration velocity sensitive units and a processor, wherein the particle vibration velocity sensitive units are orthogonally arranged on two sides of the sound pressure sensitive unit;

the sound pressure sensitive unit is used for measuring the sound pressure component in the sound field;

the particle vibration velocity sensing unit is used for measuring the spatial particle vibration velocity component;

the processor is used for estimating the target voice direction according to the sound pressure component and the space particle vibration velocity component, setting the energy of a frequency point outside the target voice frequency domain to zero, obtaining time-frequency point data corresponding to the target voice azimuth, and obtaining a target voice signal according to the time-frequency point data.

Optionally, the sound pressure sensing unit is further configured to collect a sound pressure signal p (t) of a channel;

The particle vibration velocity sensing unit is also used for collecting time-frequency data v _x (t) and v _y (t) of particle vibration velocity signals of two channels;

Wherein x ₁(t),x₂ (t) is the sound pressure signal of the target voice and the interference voice, θ ₁,θ₂ is the horizontal azimuth angle of the target voice and the interference voice, the positive direction of the x axis is 0 °, and n _p(t),n_x (t) and n _y (t) are the noise signals received by the sound pressure and particle vibration velocity sensitive unit.

Optionally, the processor is further configured to pre-process the sound pressure signal and the time-frequency data to obtain corresponding channel time-frequency spectrum data.

Optionally, the processor is further configured to frame-window the sound pressure signal and the time-frequency data to obtain a corresponding single-frame time domain signal p _win(l)、v_xwin(l)、v_ywin (L), where l=1, 2 …, L is a single-frame time domain signal length, and then perform short-time fourier transform to transform the single-frame time domain signal into a frequency domain signal;

P_win(l,k)＝fft(p_win(l))、V_xwin(l,k)＝fft(v_xwin(l))、V_ywin(l,k)＝fft(v_ywin(l)).

Optionally, the processor is further configured to estimate an angle interval of the directional sound source in the single frame of voice signal, obtain an angle distribution of energy, and calculate a sound source arrival angle estimate of any time-frequency point based on a trigonometric function relationship between time-frequency spectrum data.

Optionally, the processor is further configured to perform processing through a preset window function, set energy of a frequency point outside the target voice frequency domain to zero, and obtain time-frequency point data corresponding to the target voice azimuth.

Optionally, the processor is further configured to perform convolution operation with the obtained energy distribution in the full-angle space by using a rectangular window function, a gaussian window, a hanning window or a hamming window, so as to zero the frequency point energy outside the target voice frequency region, and obtain time-frequency point data corresponding to the target voice azimuth;

Where θ is the target azimuth and Δθ is the width of the rectangular window.

Optionally, the processor is further configured to perform IFFT on time-frequency point data corresponding to the target speech azimuth to a corresponding time-domain signal, and splice the target speech signal after spatial domain filtering by using an overlap-add method.

In a second aspect, the present invention provides a method for spatial filtering speech enhancement using the system of any one of the preceding claims, the method comprising:

Based on the measured target voice and interference voice azimuth information, spatial filtering is carried out on time spectrum data of each channel of the mass vibration velocity sensor microarray, time frequency point data corresponding to the target voice azimuth is obtained, IFFT is carried out on the time frequency point data corresponding to the target voice azimuth, corresponding time domain signals are obtained, and the spatial-domain filtered target voice signals are spliced by adopting a superposition method.

In a third aspect, the present invention provides a computer readable storage medium storing a computer program of signal mapping, which when executed by at least one processor, implements a method of spatial filtering speech enhancement as described in any of the above.

The invention has the following beneficial effects:

The invention is composed of a sound pressure sensitive unit and two-dimensional particle vibration velocity sensitive units. The two-dimensional particle vibration velocity sensing units which are orthogonally placed can synchronously and independently measure the spatial particle vibration velocity components while the sound pressure sensing units measure the sound pressure components in the sound field. Practice proves that the sound source of the structure has high orientation precision, directivity is not influenced by the aperture of the array, the array structure is compact, the aperture can be in millimeter level, the structure can adapt to various environments and various noises, and has a certain inhibition effect on room reverberation, so that the voice enhancement effect is realized, and the problems of limited design and application scenes of the existing microphone array are effectively solved.

The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 is a schematic diagram of a spatial filtering speech enhancement system based on a particle velocity sensor microarray according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for spatially filtered speech enhancement based on a particle velocity sensor microarray according to an embodiment of the present invention.

Detailed Description

Aiming at the problems that the number of microphones is larger, the aperture is larger, and the airspace aliasing and operation complexity are high, the embodiment of the invention designs an airspace filtering voice enhancement system based on a particle vibration velocity sensor microarray, which consists of a sound pressure sensitive unit and two-dimensional particle vibration velocity sensitive units, wherein the sound pressure sensitive unit is used for measuring the sound pressure component in a sound field, and the two-dimensional particle vibration velocity sensitive units which are orthogonally arranged can be used for synchronously and independently measuring the space particle vibration velocity component. Practice proves that the sound source of the structure has high orientation precision, directivity is not influenced by the aperture of the array, the array structure is compact, the aperture can be in millimeter level, the structure can adapt to various environments and various noises, and the structure has a certain inhibition effect on room reverberation. The present invention will be described in further detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

A first embodiment of the present invention provides a spatially filtered speech enhancement system based on a particle velocity sensor microarray, see fig. 1, comprising: the device comprises a sound pressure sensitive unit, two-dimensional particle vibration velocity sensitive units and a processor, wherein the particle vibration velocity sensitive units are orthogonally arranged on two sides of the sound pressure sensitive unit;

That is, the embodiment of the invention designs a airspace filtering voice enhancement system based on a particle vibration velocity sensor microarray aiming at the problems that the number of microphones is larger, the aperture is larger and the airspace aliasing and the operation complexity are high because the existing microphone array is limited by a half-wavelength theory, and the system consists of a sound pressure sensitive unit and two-dimensional particle vibration velocity sensitive units, wherein the two-dimensional particle vibration velocity sensitive units which are orthogonally arranged can synchronously and independently measure the spatial particle vibration velocity components while the sound pressure sensitive units measure the sound pressure components in a sound field. Therefore, the original precision sound source orientation is ensured, the directivity is not influenced by the array aperture, the array structure is compact, the array structure can adapt to various environments and various noises, and the array structure has a certain inhibition effect on room reverberation.

In a specific implementation, the sound pressure sensing unit in the embodiment of the present invention is further configured to collect a sound pressure signal p (t) of a channel; the particle vibration velocity sensing unit is also used for collecting time-frequency data v _x (t) and v _y (t) of particle vibration velocity signals of two channels;

The processor is used for preprocessing the sound pressure signal and the time-frequency data to obtain corresponding channel time-frequency spectrum data.

Specifically, the processor in the embodiment of the present invention performs frame windowing on the sound pressure signal and the time-frequency data to obtain a corresponding single-frame time domain signal p _win(l)、v_xwin(l)、v_ywin (L), where l=1, 2 …, L and L are the lengths of the single-frame time domain signals, and performs short-time fourier transform to transform the single-frame time domain signal into a frequency domain signal ;P_win(l,k)＝fft(p_win(l))、V_xwin(l,k)＝fft(v_xwin(l))、V_ywin(l,k)＝fft(v_ywin(l)).

Further, the processor in the embodiment of the present invention is further configured to estimate an angle interval of a directional sound source in a single frame of speech signal, obtain an angle distribution of energy, and calculate a sound source arrival angle estimate of any time-frequency point based on a trigonometric function relationship between time-frequency spectrum data. And setting the energy of the frequency point outside the target voice frequency domain to zero through the preset window function processing to obtain time-frequency point data corresponding to the target voice azimuth.

The processor performs convolution operation by using a rectangular window function, a Gaussian window, a Hanning window or a Hamming window and the obtained energy distribution of the full-angle space so as to zero the energy of frequency points outside the target voice frequency region and obtain time-frequency point data corresponding to the target voice azimuth;

Where θ is the target azimuth and Δθ is the width of the rectangular window.

And finally, performing IFFT on time-frequency point data corresponding to the target voice azimuth through the processor to corresponding time domain signals, and splicing the target voice signals after space domain filtering by adopting an overlap-add method.

Correspondingly, the embodiment of the invention also provides a method for performing spatial filtering voice enhancement by applying the system, which comprises the following steps:

In order to better explain the method of the present invention, the following will explain the method of the present invention in detail by way of a specific example:

The embodiment of the invention researches a voice enhancement algorithm based on spatial filtering by utilizing the trigonometric function relation among all received components of a particle vibration velocity sensor microarray and the time-frequency sparse characteristic of voice signals, and as shown in fig. 2, the specific working flow of the method comprises the following steps:

Step one, arranging a particle vibration velocity sensor microarray and collecting voice signals;

In specific implementation, the method specifically comprises the following steps: the micro array arrangement of the particle vibration velocity sensor is shown in fig. 1, and the sampling rate is 16kHz, so that the time-frequency data v _x、v_y of a sound pressure signal p of one channel and the two-channel particle vibration velocity signal can be obtained.

Wherein x ₁(t),x₂ (t) is the sound pressure signal of the target voice and the interference voice respectively, θ ₁,θ₂ is the horizontal azimuth angle of the target voice and the interference voice respectively, the positive direction of the x-axis is 0 °, and n _p(t),n_x (t) and n _y (t) are the sound pressure and the noise signal received by the two-dimensional particle vibration velocity sensitive unit respectively.

Step two, carrying out framing and windowing processing on the microarray output data of the mass point vibration velocity sensor, and obtaining time-frequency data of a sound pressure signal of one channel and a mass point vibration velocity signal of two channels through short-time Fourier transformation;

To obtain an approximately smooth speech signal, the p, v _x、v_y signals in the previous step are windowed in frames to obtain a corresponding single-frame time-domain signal p _win(l)、v_xwin(l)、v_ywin (L) (l=1, 2 …, L is the single-frame time-domain signal length). Then, the single-frame time domain signal is transformed into a frequency domain signal by short-time Fourier transformation. The specific parameters were selected as follows: window function: hanning window, window length k=1024 sampling points; frame shifting by 50%; a fourier transform point k; and obtaining spectrum data of the corresponding channel.

P_win(l,k)＝fft(p_win(l))

V_xwin(l,k)＝fft(v_xwin(l))

V_ywin(l,k)＝fft(v_ywin(l))

Thirdly, obtaining high signal-to-noise ratio angle estimation of any time frequency point by utilizing a trigonometric function relation between a particle vibration velocity sensor microarray one-channel sound pressure signal and two-channel particle vibration velocity signals, and obtaining full-angle space energy distribution;

The purpose of this step is to estimate the angular interval of the directional sound source in the single frame speech signal and to obtain the angular distribution of the energy. According to step 1, the single-frame frequency domain noisy speech signal P _win(k,l)、V_xwin(k,l)、V_ywin (k, l) obtained in step 2.

And obtaining the sound source arrival angle estimation of any time-frequency point by utilizing the trigonometric function relation among the frequency spectrum data of each channel of the particle vibration velocity sensor microarray.

The known voice signals have good sparse characteristics in a time-frequency domain, when a plurality of voices in different directions exist, different voice signals have discrete distribution in the time-frequency domain, and the full-angle space energy distribution of the voice signals is obtained according to the sound source arrival angle estimation of any time-frequency point.

In other words, in the third step of the invention, when the directional environment interference voice exists, the triangular function relation among all channels of the particle vibration velocity sensor microarray is utilized to obtain the high signal-to-noise ratio angle estimation of any time-frequency point based on the time-frequency domain sparse characteristic of the voice signal, and the full angle space energy distribution can be further obtained.

Step four, convolution operation is carried out on the full-angle space energy distribution of the current frame signal of the mass point vibration velocity sensor microarray by utilizing rectangular window functions (window functions such as a Gaussian window, a Hanning window, a Hamming window and the like can be selected), and time-frequency point data corresponding to the target voice azimuth is obtained;

The energy distribution of the current frame signal in the full angle space is obtained through the steps. And performing convolution operation by using a rectangular window function (window functions such as a Gaussian window, a Hanning window, a Hamming window and the like) and the obtained energy distribution in the full-angle space, so as to zero the energy of the frequency point outside the target voice frequency region and obtain time-frequency point data corresponding to the target voice azimuth.

Where θ is the target azimuth and Δθ is the width of the rectangular window, typically taking Δθ=5°

The fourth step of the invention is to set the energy of the frequency point outside the target voice frequency area to zero based on the rectangular window space domain filtering (the window functions such as Gaussian window, hanning window, hamming window, etc. can also be selected), and obtain the time-frequency point data corresponding to the target voice azimuth.

And fifthly, acquiring time-frequency point data corresponding to the target voice azimuth, and splicing the target voice signals after the space domain filtering by adopting a superposition method after the inverse Fourier transform.

The final filtering data are obtained through the steps, IFFT is carried out on the final filtering data, corresponding time domain signals can be obtained, and the target voice signals after space domain filtering are spliced by adopting a superposition method.

In general, the spatial filtering speech enhancement algorithm of the particle velocity sensor microarray according to the embodiments of the present invention is based on a special particle velocity sensor microarray, and is composed of a sound pressure sensing unit and two-dimensional particle velocity sensing units. The particle vibration velocity sensing units are orthogonally arranged on two sides of the sound pressure sensing unit, particularly shown in fig. 1, the sound source orientation precision of the structure is high, directivity is not affected by the aperture of the array, the array structure is compact, and the aperture can be in millimeter level.

In addition, the embodiment of the invention realizes the orientation of target voice and interference voice based on the trigonometric function relation between the sound pressure signal of one channel and the particle vibration velocity signals of two channels. Then, the spatial filtering voice enhancement algorithm based on rectangular window filtering solves the problem that interference and noise in all directions are suppressed under the condition that the number of spatial sound sources and the sound source directions are unknown, and achieves target voice enhancement. Practice proves that the method has the advantages of high robustness, good reliability, strong practicability and the like.

A second embodiment of the present invention provides a computer-readable storage medium storing a computer program of signal mapping, which when executed by at least one processor, implements the method of spatial domain filtered speech enhancement according to any of the first embodiments of the present invention.

The relevant content of the embodiments of the present invention can be understood with reference to the first embodiment of the present invention, and will not be discussed in detail herein.

Although the preferred embodiments of the present invention have been disclosed for illustrative purposes, those skilled in the art will appreciate that various modifications, additions and substitutions are possible, and accordingly the scope of the invention is not limited to the embodiments described above.

Claims

1. A particle velocity sensor microarray-based spatial filtering speech enhancement system, comprising: the device comprises a sound pressure sensitive unit, two-dimensional particle vibration velocity sensitive units and a processor, wherein the particle vibration velocity sensitive units are orthogonally arranged on two sides of the sound pressure sensitive unit;

the sound pressure sensitive unit is used for measuring the sound pressure component in the sound field and collecting a sound pressure signal p (t) of a channel;

The particle vibration velocity sensing unit is used for measuring space particle vibration velocity components and collecting time-frequency data v _x (t) and v _y (t) of particle vibration velocity signals of two channels;

Wherein x ₁(t),x₂ (t) is the sound pressure signal of the target voice and the interference voice respectively, θ ₁,θ₂ is the horizontal azimuth angle of the target voice and the interference voice respectively, the positive direction of the x axis is 0 °, and n _p(t),n_x (t) and n _y (t) are the noise signals received by the sound pressure and particle vibration velocity sensitive unit respectively;

the processor is used for preprocessing the sound pressure signal and the time-frequency data to obtain corresponding channel time-frequency spectrum data; the method specifically comprises the following steps:

The sound pressure signal and the time frequency data are subjected to framing and windowing to obtain a corresponding single-frame time domain signal p _win(l)、v_xwin(l)、v_ywin (L), wherein l=1, 2 …, L and L are the length of the single-frame time domain signal, and then short-time Fourier transformation is carried out to transform the single-frame time domain signal into a frequency domain signal;

P_win(l,k)＝fft(p_win(l))、V_xwin(l,k)＝fft(v_xwin(l))、V_ywin(l,k)＝fft(v_ywin(l));

Estimating an angle interval of a directional sound source in a single frame of voice signal, obtaining energy angle distribution, and obtaining sound source arrival angle estimation of any time-frequency point based on a trigonometric function relation between time-frequency spectrum data;

Setting the energy of a frequency point outside the target voice frequency domain to be zero through the processing of a preset window function, and obtaining time-frequency point data corresponding to the target voice azimuth;

Performing convolution operation by using a rectangular window function, a Gaussian window, a Hanning window or a Hamming window and the obtained energy distribution of the full-angle space so as to zero the energy of frequency points outside the target voice frequency region and obtain time-frequency point data corresponding to the target voice azimuth;

Wherein θ is the target azimuth, Δθ is the width of the rectangular window; and estimating the target voice direction according to the sound pressure component and the space particle vibration velocity component, setting the energy of a frequency point outside the target voice frequency domain to zero, obtaining time-frequency point data corresponding to the target voice azimuth, and obtaining a target voice signal according to the time-frequency point data.

2. The system of claim 1, wherein the system further comprises a controller configured to control the controller,

The processor is further configured to perform IFFT on time-frequency point data corresponding to the target speech azimuth to a corresponding time-domain signal, and splice the target speech signal after space-domain filtering by using an overlap-add method.

3. A method of spatial filtered speech enhancement using the system of any of claims 1-2, the method comprising:

4. A computer readable storage medium storing a computer program of signal mapping, which when executed by at least one processor, implements the method of spatial filtered speech enhancement of claim 3.