CN110444220B

CN110444220B - Multi-mode remote voice perception method and device

Info

Publication number: CN110444220B
Application number: CN201910705872.0A
Authority: CN
Inventors: 吴江南; 顾冠杰; 廉增辉; 潘翔
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2019-08-01
Filing date: 2019-08-01
Publication date: 2023-02-10
Anticipated expiration: 2039-08-01
Also published as: CN110444220A

Abstract

The invention discloses a multimode remote voice perception method and device. The perception method comprises the following steps: and collecting voice and video signals by using a rectangular microphone array and a camera. And carrying out preliminary arrival angle estimation on the target voice signal by utilizing beam forming so as to obtain a rough sound source direction. And by utilizing the preliminary information of the sound source position, the driving camera is over against the sound source direction. And establishing a background model based on the initial video data, and performing foreground detection and background updating. And transmitting the high-precision azimuth parameters corresponding to the foreground to a beam forming module, and forming the output of the beam in the azimuth, namely the enhanced voice signal.

Description

Multi-mode remote voice perception method and device

Technical Field

The invention relates to the field of multi-mode combined sensor acquisition and voice enhancement, in particular to a multi-mode remote voice sensing method and device based on rectangular microphone area array and camera combined acquisition.

Background

In recent years, remote video monitoring technology is more and more widely applied to the life of people. The red light running camera on the street, the monitoring camera in the office, various infrared detectors, thermal imaging technologies and the like, especially in the aspect of remote monitoring application, only one camera is needed, people can check remote monitoring pictures on intelligent equipment such as mobile phones and the like at any time and any place, and great convenience is brought to life of people. The use of microphones to process audio signals has found some applications in the fields of cell phones and personal computers. In these examples of applications, systems consisting of a single or two microphones are generally used. In recent years, amazon, microsoft, google, and the like have released products based on microphone array technology abroad. In China, companies such as news flight, cloud learning, acoustic intelligence and science and technology also provide mature microphone hardware schemes. The pickup and action distances of the products are within 10m, and the products are mainly oriented to near-field voice application scenes. However, conventional near-field speech applications have been increasingly unable to meet the needs of people. When a scene is switched to the outdoor, robot, vehicle-mounted or monitoring field, more complex voice control intelligent equipment is needed, and therefore, the microphone array technology becomes the core of far-field voice perception.

However, remote video can only process images and cannot sense sound, which is just as unsatisfactory. Meanwhile, in the traditional voice perception technology, the recognition rate of voice recognition reaches the level of identity recognition in a short distance, but the effect is greatly reduced in a long distance condition because the signal-to-noise ratio of the received voice signal is low and an interference signal exists.

The existing remote voice positioning technology has the following problems:

(1) The use of compressive sensing techniques for orientation estimation can improve orientation accuracy, but requires high signal-to-noise ratios;

(2) The convolution beam forming method is used for a small sensor array, and a higher signal-to-noise ratio is also needed while the azimuth estimation precision is improved;

(3) The large-scale microphone array can simultaneously meet high signal-to-noise ratio and narrow beams, but is very troublesome in engineering use, occupies a large space on one hand, and on the other hand, multi-channel data processing needs a powerful signal processor.

In order to solve the problem of inaccurate remote voice positioning, researchers provide a method for improving positioning accuracy by using image high-resolution capability, acquiring an effective position of a sound source, and then combining a microphone array, enhancing voice by using a beam forming algorithm, eliminating noise and improving voice quality.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a multi-mode remote voice perception method and a multi-mode remote voice perception device.

The purpose of the invention is realized by the following technical scheme: a multimodal remote speech perception method comprising the steps of:

step 1: collecting voice and video signals by using a rectangular microphone array and a camera;

step 2: carrying out primary arrival angle estimation on a target voice signal by utilizing beam forming so as to obtain a rough sound source direction;

and step 3: according to the rough sound source position, the driving camera is over against the sound source direction;

and 4, step 4: establishing a background model based on the initial data, and performing foreground extraction and background model self-adaptive updating;

and 5: the foreground spatial position is mapped to a high-precision azimuth, the high-precision azimuth parameter is transmitted to a beam forming module, and the output of beam forming in the azimuth is an enhanced voice signal.

Further, the step 2 specifically includes the following sub-steps:

step 2.1, for the speech signal framing, the signal of the L frame (L =1,.. Gtoreq.l) acquired by the array is recorded as x (L) = [ x ]) ₁ (l),x ₂ (l),...,x _m (l),...,x _M (l)]Where M represents the number of microphones, each microphone being a channel, x _m (l)＝[x _m (0,l),x _m (1,l),...,x _m (n,l)...,x _m (N-1,l)] ^T Representing the l frame signal collected on the m channel; applying a window function to each frame of signal, performing short-time Fourier transform, and performing Fourier transform on the time domain signal of the ith frame of the mth channel to obtain frequency domain representation:

where n denotes an index of time, k denotes a k-th frequency point, b _n Representing a hanning window of length N;

the frequency domain signal defining the M channel is X (k, l):

X(k,l)＝[X ₁ (k,l),X ₂ (k,l),...,X _M (k,l)] ^T ，0≤k≤N-1 (2.2)

step 2.2, defining the space spectrum matrix of the signal as S _X (k)＝E{X(k,l)X ^H (k, L) }, E {. Cndot.) denotes the expectation of the L-frame signal, matrix element

Assuming that the incident angle of the voice signal is theta, performing weighted summation on the spatial spectrum estimation results of the N frequency points to obtain total beam power P (theta):

wherein, w _DS (θ,k)＝[w ₁ (θ,k),w ₂ (θ,k),...,w _M (θ,k)] ^T A weight vector representing the k-th bin with the phase aligned,

w _DS ^H (theta, k) represents w _DS Conjugate transpose of (θ, k);

carrying out angle search on the total beam power P (theta) to obtain a rough sound source azimuth angle estimated preliminarily

Further, the step 3 specifically includes the following sub-steps:

step 3.1, according to the direction angle obtained in step 2

And judging the approximate direction of the sound source, wherein the driving camera is opposite to the direction of the sound source.

Further, the step 4 specifically includes the following sub-steps:

step 4.1, firstly, establishing a background model by using initial video data, and recording the collected p frame image as I _p (x, y), (x, y) are image matrix pixel coordinates; after the image is converted into a gray image, averaging the previous S frames to be used as an initial background B ₀ (x, y). The formula is as follows:

after background modeling is finished, subtracting a background model from a current frame to obtain a foreground Target (x, y):

D(x,y)＝I _p (x,y)-B ₀ (x,y) (4.2)

I _p (x, y) represents the current frame image, D (x, y) represents the foreground pixel, T is the set threshold, and 1 in the Target (x, y) matrix represents the foreground pixel.

Step 4.2, the obtained binary foreground image has the phenomena of discontinuous outline, incomplete foreground and the like, so that the image needs to be subjected to subsequent processing such as opening-closing operation and the like, and finally, a complete foreground image G is obtained _p (x,y)。

When processing a video stream, the background model needs to be updated due to environmental changes such as light. The update formula is as follows:

wherein, B _p (x, y) is a background model added into the p frame image for self-adaptive updating, 0<α<1 is an update factor, which varies according to the environment change.

Because the horizontal size of the target is small relative to the distance to the camera, the image coordinate and the direction coordinate can be regarded as a linear relation, and the position of the foreground can be converted into an angle

And outputting the signals to a beam forming module.

Further, the step 5 specifically includes the following sub-steps:

step 5.1, obtaining accurate angle information according to image processing

Array response vector of corresponding target signal

Comprises the following steps:

wherein

[p ₁ ,p ₂ ,...,p _M ]Are the two-dimensional coordinates of the M microphone elements,

is the wavelength, f, corresponding to the k-th frequency point _k Is the frequency of the k frequency point, c represents the propagation speed of the plane wave in the medium;

step 5.2, converting the linear constraint minimum variance beam forming into the following optimization problem:

wherein w (k, l) = [ w ₁ (k,l),w ₂ (k,l),...,w _M (k,l)] ^T Weight vector, S, representing the signal of the l-th frame _x (k, l) represents a spatial spectrum matrix of the l-th frame signal. Filtering according to a steepest descent adaptive algorithm:

w(k,l+1)＝J(k)[w(k,l)-μX(k,l)Y ^* (k,l)]+F(k) (5.3)

wherein

Y(k,l)＝w ^H (k, l) X (k, l) denotes a beamformed output signal, Y ^* (k, l) represents the complex conjugate of Y (k, l), μ ≧ 0 is the convergence step, the initial weight vector

Splicing the sub-band signals into a broadband signal: y (l) = [ Y (0, l), Y (1, l),.., Y (N-1, l) ];

step 5.3: finally, inverse Discrete Fourier Transform (IDFT) is performed on Y (l) to obtain a time-domain output signal Y (l) of the l-th frame:

y(l)＝IDFT[Y(l)] (5.4)

then, splicing the L frame voice signals to obtain a time domain output y (t):

y(t)＝[y(1),y(2),...,y(l)，...,y(L)] (5.5)

y (t) is the enhanced speech signal.

It is another object of the present invention to provide a multimodal remote speech perception apparatus, comprising:

the rectangular microphone array is 8-10 m away from the sound source;

the camera is arranged on the end edge of the rectangular microphone array and rotates synchronously with the microphone array;

the lower computer is connected with the rectangular microphone array and is used for controlling command receiving, signal acquisition and data transmission; after receiving a 'start' control instruction sent by the upper computer, the lower computer performs voice signal acquisition through the rectangular microphone array and uploads data to the upper computer in real time; the lower computer stops uploading data to the upper computer in real time after receiving a stop control instruction sent by the upper computer;

and the upper computer is connected with the camera, receives the video signal and the voice signal sent by the lower computer, performs initial angle estimation on the target voice signal, and drives the camera to rotate to the direction opposite to the sound source by using the angle. Extracting a target foreground from the video image, and mapping foreground coordinates to an accurate position; and transmitting the high-precision azimuth parameters to a beam forming module, and outputting the beam forming in the azimuth to obtain an enhanced voice signal.

Further, the connection and data transmission of the lower computer and the upper computer are as follows:

a, determining data ports and connection interfaces of an upper computer, a lower computer, a microphone array and a camera, and establishing connection;

b, the upper computer issues a control command 'start' to start collecting audio and video data;

c, performing parallel-serial conversion on the sampling data of all channels of the rectangular microphone array, and sending an uplink data packet to an upper computer by a lower computer;

d, the upper computer sends a control command 'stop', the lower computer stops collecting data and waits for the upper computer to send the control command 'start' again;

and e, automatically storing the audio data into a dat file after the acquisition is finished, and storing the video data into an avi file.

Compared with the prior art, the invention has the beneficial effects that:

(1) The invention uses the audio and video combined voice positioning method, adds the video positioning portrait to facilitate the acquisition of accurate azimuth angles of the sound sources, and avoids the defects that the voice azimuth estimation resolution is low and a plurality of sound sources cannot be clearly distinguished in the traditional beam forming.

(2) The invention uses the angle returned by image processing and the microphone array to enhance the remote voice signal, and solves the problems of energy weakening and low signal-to-noise ratio of the remote voice signal after being transmitted in the space.

(3) The invention utilizes the self-adaptive linear constraint minimum variance beam former to restrain incoherent noise and interference signals, and solves the problem of serious noise interference when the voice signals are far away.

(4) Based on the three characteristics, the invention can realize the function of outdoor remote voice perception and has better practical value.

Drawings

FIG. 1 is a general flow chart of the multimodal remote speech perception method of the present invention;

FIG. 2 is a flow chart of the preliminary estimation of the azimuth of the sound source according to the present invention;

FIG. 3 is a flow chart of the image processing output accurate sound source azimuth of the present invention;

FIG. 4 is a flow chart of the adaptive beamforming for enhancing speech signals in the present invention;

FIG. 5 is a beam pattern diagram of primary positioning of upper computer beam forming in the present invention;

FIG. 6 is a diagram of the result of obtaining high-precision voice direction by video processing according to the present invention;

FIG. 7 is a waveform diagram of signals before and after speech enhancement according to the present invention;

FIG. 8 is a time-frequency diagram of signals before and after speech enhancement according to the present invention.

Detailed Description

The objects and effects of the present invention will become more apparent from the following detailed description of the present invention when taken in conjunction with the accompanying drawings.

Figure 1 shows a general flow diagram of the present invention. The multi-mode remote voice perception method comprises 5 steps, namely, firstly, a rectangular microphone array and a camera are used for collecting voice and video signals; carrying out preliminary azimuth estimation on the signals; according to the rough estimation result of the arrival angle, matching with a self-adaptive background modeling detection target to obtain an accurate azimuth angle of a sound source; based on the accurate azimuth angle obtained by image processing, the self-adaptive filtering of the voice signals is realized by utilizing linear constraint minimum variance beam forming and the steepest descent algorithm, and finally the enhanced clear voice signals are output.

The detection method of the invention has the following specific implementation modes:

step 1: placing the rectangular microphone array and the camera at the same angle, and collecting audio and video signals;

and 2, step: and estimating the arrival angle of the target voice signal to obtain a rough sound source azimuth. The flow chart is shown in fig. 2 and comprises the following sub-steps:

the frequency domain signal defining the M channel is X (k, l):

X(k,l)＝[X ₁ (k,l),X ₂ (k,l),...,X _M (k,l)] ^T ，0≤k≤N-1 (2.2)

preferably, in the implementation process, the sampling frequency is 48kHz, the short-time Fourier transform length N is 512, and the window function b is selected _n A hanning window of length 512.

Step 2.2, defining the space spectrum matrix of the signal as S _X (k)＝E{X(k,l)X ^H (k, L) }, E { · } denotes the expectation of the L-frame signal, the matrix elements

wherein, w _DS (θ,k)＝[w ₁ (θ,k),w ₂ (θ,k),...,w _M (θ,k)] ^T Weight vector, w, representing the phase aligned k-th bin _DS ^H (theta, k) represents w _DS Conjugate transpose of (θ, k);

In a specific implementation process, according to an actual situation, the search range of the angle θ is as follows: theta is more than or equal to minus 90 degrees and less than or equal to plus 90 degrees, and the angle step is 1 degree.

And step 3: and the camera is just opposite to the direction of the sound source by utilizing the initial information of the direction of the sound source.

And 4, step 4: establishing a background model based on the initial data, and performing foreground detection and background model self-adaptive updating; the flow chart is shown in fig. 3 and comprises the following sub-steps:

D(x,y)＝I _p (x,y)-B ₀ (x,y) (4.2)

Because the horizontal size of the target is small relative to the distance to the camera, the image coordinates and the direction coordinates can be regarded as a linear relation, and then the foreground image G is obtained _p (x, y) position conversion to angle

And outputting the signals to a beam forming module. The exact angle obtained in the experiment was +27 °.

And 5: the precise angle is applied to a self-adaptive beam forming algorithm, and the signal-to-noise ratio of the voice signal is improved. The flow chart is shown in fig. 4 and includes the following sub-steps:

step 5.1, obtaining accurate angle information according to image processing

Array response vector of corresponding target signal

Comprises the following steps:

wherein

is the wavelength corresponding to the kth frequency point, f _k Is the frequency of the k frequency point, c represents the propagation speed of the plane wave in the medium; in the specific implementation process, the microphone arrays are uniform matrixes of 2 multiplied by 6, the distances among the microphones are all 0.05m, only the horizontal direction angle is considered, and the pitching direction angle is not considered.

w(k,l+1)＝J(k)[w(k,l)-μX(k,l)Y ^* (k,l)]+F(k) (5.3)

wherein

Y(k,l)＝w ^H (k, l) X (k, l) denotes a beamformed output signal, Y ^* (k, l) represents the complex conjugate of Y (k, l), μ ≧0 is the convergence step size, the initial weight vector

In the specific implementation process, the selection of mu is changed according to different voice acquisition scenes, and mu is more than or equal to 0.00003 and less than or equal to 0.0001 in the experiment.

y(l)＝IDFT[Y(l)] (5.4)

then, splicing the L frame voice signals to obtain a time domain output y (t):

y(t)＝[y(1),y(2),...,y(l)，...,y(L)] (5.5)

y (t) is the enhanced speech signal.

The multi-modal remote voice perception device comprises the following four modules:

a, a rectangular microphone array is 8-10 m away from a sound source;

b, a camera is arranged on the end edge of the rectangular microphone array and rotates synchronously with the microphone array;

c, the lower computer is connected with the rectangular microphone array and is used for controlling command receiving, signal acquisition and data transmission; after receiving a 'start' control instruction sent by the upper computer, the lower computer performs voice signal acquisition through the rectangular microphone array and uploads data to the upper computer in real time; the lower computer stops uploading data after receiving a stop control instruction sent by the upper computer;

and d, the upper computer is connected with the camera, receives the video signal and the voice signal sent by the lower computer, performs initial angle estimation on the target voice signal, and drives the camera to rotate to the direction opposite to the sound source by using the angle. Extracting a target foreground from the video image, and mapping foreground coordinates to an accurate position; and transmitting the high-precision azimuth parameters to a beam forming module, and outputting the beam forming in the azimuth to obtain an enhanced voice signal.

The connection and data transmission of the lower computer and the upper computer in the detection device are as follows:

Examples

In this embodiment, the detection method is applied to remote speech sensing, and the specific steps are as described above and will not be described herein again.

Performing initial arrival angle estimation on target voice by using beam forming, calculating total beam power P (theta) from-90 degrees to +90 degrees, and plotting by using angle theta as an x-axis coordinate and normalized power P (theta) as a y-axis coordinate, wherein the result is shown in fig. 5, and the rough azimuth angle of a target sound source (female voice)

The estimated angle of the interfering sound source is-29.

As shown in fig. 6 (a) of the original image processing and fig. 6 (b) of the background subtraction processing, it is seen that some noise and interference affect the result. To eliminate the interference, the result shown in fig. 6 (c) is obtained after the on-off operation processing, and the final sound source localization result is shown in fig. 6 (d). The precise positions of the sound source are +27 degrees and-25 degrees, and the precise azimuth angle of the sound source is selected according to the rough azimuth angle obtained by beam forming

Output to the beam formingThe module performs beamforming output.

The audio processing results of the upper computer audio-video joint algorithm are shown in fig. 7 and 8. Fig. 7 is a waveform diagram of signals before and after speech enhancement, after processing, noise is significantly reduced and signal-to-noise ratio is enhanced. Fig. 8 is a time-frequency diagram of signals before and after speech enhancement, and it can be seen from the time-frequency diagram that after beamforming, the interference (male voice) with noise and energy concentrated in the low frequency part is suppressed, and the target sound source (female voice) in the high frequency part is retained and enhanced.

And respectively evaluating the results of beam forming at the rough angle and the precise angle by using the signal-to-noise ratio and the PESQ score, and checking the performance of the multi-mode combined system in actual data processing. Processing results as shown in tables 1 and 2, the signal-to-noise ratio gain of the output signal of the beam forming at the precise azimuth reaches 12.1704dB, the PESQ score is improved by 0.655, and the performance of the beam forming is better than that of the beam forming at the output of the rough azimuth.

TABLE 1 beamforming SNR comparison

	Coarse angle	Precise angle
			Signal-to-noise ratio gain (dB)	10.0168	12.1704

TABLE 2PESQ evaluation score comparison

	Single channel signal	Rough angle	Precise angle
				Evaluation of PESQ	1.6458	1.9473	2.3008

The processing method provided by the invention is tested in a Yongquan Square in the Yuquan school district of the university of Zhejiang Hangzhou, 2 multiplied by 6 microphone area arrays are adopted, the sound source distance is 10 meters, the sound source distance is divided into a target sound source (27 degrees) and an interference sound source (25 degrees), the sampling rate is 48kHz, and the test result is good. The invention can carry out the joint collection of remote voice and video, and sends the upper computer for processing and outputting.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A multimodal remote speech perception method, comprising the steps of:

and 2, step: carrying out preliminary arrival angle estimation on a target voice signal by utilizing beam forming to obtain a rough sound source direction;

and 3, step 3: according to the rough sound source position, the driving camera is over against the sound source direction;

and 4, step 4: establishing a background model based on the initial data, and performing foreground detection and background model self-adaptive updating;

and 5: and transmitting the high-precision azimuth parameters corresponding to the foreground to a beam forming module, wherein the output of the beam forming module at the azimuth is the enhanced voice signal.

2. The method according to claim 1, wherein the step 2 comprises the following sub-steps:

where n denotes the index of time, k denotes the kth frequency point, b _n Representing a hanning window of length N;

the frequency domain signal defining the M channel is X (k, l):

X(k,l)＝[X ₁ (k,l),X ₂ (k,l),...,X _M (k,l)] ^T ，0≤k≤N-1 (2.2)

step 2.2, defining the space spectrum matrix of the signal as S _X (k) Elements of the matrix

Assuming that the incident angle of the voice signal is theta, carrying out weighted summation on the space spectrum estimation results of N frequency points to obtain a total waveBeam power P (θ):

carrying out angle search on the total beam power P (theta) to obtain a rough azimuth angle of the sound source

3. The method according to claim 2, wherein the step 3 comprises the following sub-steps:

step 3.1, according to the direction angle obtained in step 2

4. The method according to claim 3, wherein the step 4 comprises the following sub-steps:

step 4.1, firstly, establishing a background model by using initial video data, and recording the collected p frame image as I _p (x, y), (x, y) are image matrix pixel coordinates; after the image is converted into a gray image, averaging the previous S frames to be used as an initial background B ₀ (x, y), the formula is as follows:

D(x,y)＝I _p (x,y)-B ₀ (x,y) (4.2)

I _p (x, y) represents the current frame image, D (x, y) represents the foreground image point, T is the set threshold, and 1 in the Target (x, y) matrix represents the foreground pixel;

step 4.2, the obtained binaryzation foreground image is subjected to the subsequent processing of the opening-closing operation of the foreground image, and finally, a complete foreground image G is obtained _p (x,y)；

When the video stream is processed, the background model is updated, and the updating formula is as follows:

wherein, B _p (x, y) is a background model added into the p frame image for self-adaptive updating, and alpha is more than 0 and less than 1 and is an updating factor;

the foreground image G _p After the (x, y) horizontal scale coordinates are mapped to the angle coordinates, the position of the pixel where the foreground is located is converted into an angle

And output to the beam forming module.

5. The method according to claim 4, wherein the step 4 comprises the following sub-steps:

step 5.1, obtaining accurate angle information according to image processing

Array response vector of corresponding target signal

Comprises the following steps:

wherein

is the wavelength, f, corresponding to the k-th frequency point _k Is the frequency of the k-th frequency point, c represents the sound velocity;

wherein w (k, l) = [ w ₁ (k,l),w ₂ (k,l),...,w _M (k,l)] ^T Weight vector, S, representing the signal of the l-th frame _X (k, l) a spatial spectral matrix representing the l-th frame signal; filtering according to a steepest descent adaptive algorithm:

w(k,l+1)＝J(k)[w(k,l)-μX(k,l)Y ^* (k,l)]+F(k) (5.3)

wherein

Y(k,l)＝w ^H (k, l) X (k, l) denotes a beamformed output signal, Y ^* (k, l) represents the complex conjugate of Y (k, l), μ ≧ 0 is the convergence step, andweight vector

y(l)＝IDFT[Y(l)] (5.4)

then, splicing the L frame voice signals to obtain a time domain output y (t):

y(t)＝[y(1),y(2),...,y(l)，...,y(L)] (5.5)

y (t) is the enhanced speech signal.

6. A multimodal remote speech perception apparatus, the apparatus comprising:

the rectangular microphone array is 8-10 m away from the sound source;

the lower computer is connected with the rectangular microphone array and is used for controlling command receiving, signal acquisition and data transmission; after receiving a 'start' control instruction sent by the upper computer, the lower computer performs voice signal acquisition through the rectangular microphone array and uploads data to the upper computer in real time; the lower computer stops uploading data after receiving a stop control instruction sent by the upper computer;

the upper computer is connected with the camera, receives the video signal and the voice signal sent by the lower computer, performs initial angle estimation on the target voice signal, and drives the camera to rotate to the direction opposite to the sound source by using the angle; extracting a target foreground from a video image, and mapping foreground coordinates to an accurate angle direction; and transmitting the high-precision azimuth parameters to a beam forming module, and outputting the beam forming in the azimuth to obtain an enhanced voice signal.

7. The multi-modal remote voice perception device according to claim 6, wherein the connection and data transmission between the lower computer and the upper computer are as follows: