CN110444220B - Multi-mode remote voice perception method and device - Google Patents

Multi-mode remote voice perception method and device Download PDF

Info

Publication number
CN110444220B
CN110444220B CN201910705872.0A CN201910705872A CN110444220B CN 110444220 B CN110444220 B CN 110444220B CN 201910705872 A CN201910705872 A CN 201910705872A CN 110444220 B CN110444220 B CN 110444220B
Authority
CN
China
Prior art keywords
signal
foreground
sound source
angle
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910705872.0A
Other languages
Chinese (zh)
Other versions
CN110444220A (en
Inventor
吴江南
顾冠杰
廉增辉
潘翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201910705872.0A priority Critical patent/CN110444220B/en
Publication of CN110444220A publication Critical patent/CN110444220A/en
Application granted granted Critical
Publication of CN110444220B publication Critical patent/CN110444220B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/04Time compression or expansion
    • G10L21/055Time compression or expansion for synchronising with other signals, e.g. video signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/141Systems for two-way working between two video terminals, e.g. videophone
    • H04N7/147Communication arrangements, e.g. identifying the communication as a video-communication, intermediate storage of the signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/18Closed-circuit television [CCTV] systems, i.e. systems in which the video signal is not broadcast
    • H04N7/183Closed-circuit television [CCTV] systems, i.e. systems in which the video signal is not broadcast for receiving images from a single remote source

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention discloses a multimode remote voice perception method and device. The perception method comprises the following steps: and collecting voice and video signals by using a rectangular microphone array and a camera. And carrying out preliminary arrival angle estimation on the target voice signal by utilizing beam forming so as to obtain a rough sound source direction. And by utilizing the preliminary information of the sound source position, the driving camera is over against the sound source direction. And establishing a background model based on the initial video data, and performing foreground detection and background updating. And transmitting the high-precision azimuth parameters corresponding to the foreground to a beam forming module, and forming the output of the beam in the azimuth, namely the enhanced voice signal.

Description

Multi-mode remote voice perception method and device
Technical Field
The invention relates to the field of multi-mode combined sensor acquisition and voice enhancement, in particular to a multi-mode remote voice sensing method and device based on rectangular microphone area array and camera combined acquisition.
Background
In recent years, remote video monitoring technology is more and more widely applied to the life of people. The red light running camera on the street, the monitoring camera in the office, various infrared detectors, thermal imaging technologies and the like, especially in the aspect of remote monitoring application, only one camera is needed, people can check remote monitoring pictures on intelligent equipment such as mobile phones and the like at any time and any place, and great convenience is brought to life of people. The use of microphones to process audio signals has found some applications in the fields of cell phones and personal computers. In these examples of applications, systems consisting of a single or two microphones are generally used. In recent years, amazon, microsoft, google, and the like have released products based on microphone array technology abroad. In China, companies such as news flight, cloud learning, acoustic intelligence and science and technology also provide mature microphone hardware schemes. The pickup and action distances of the products are within 10m, and the products are mainly oriented to near-field voice application scenes. However, conventional near-field speech applications have been increasingly unable to meet the needs of people. When a scene is switched to the outdoor, robot, vehicle-mounted or monitoring field, more complex voice control intelligent equipment is needed, and therefore, the microphone array technology becomes the core of far-field voice perception.
However, remote video can only process images and cannot sense sound, which is just as unsatisfactory. Meanwhile, in the traditional voice perception technology, the recognition rate of voice recognition reaches the level of identity recognition in a short distance, but the effect is greatly reduced in a long distance condition because the signal-to-noise ratio of the received voice signal is low and an interference signal exists.
The existing remote voice positioning technology has the following problems:
(1) The use of compressive sensing techniques for orientation estimation can improve orientation accuracy, but requires high signal-to-noise ratios;
(2) The convolution beam forming method is used for a small sensor array, and a higher signal-to-noise ratio is also needed while the azimuth estimation precision is improved;
(3) The large-scale microphone array can simultaneously meet high signal-to-noise ratio and narrow beams, but is very troublesome in engineering use, occupies a large space on one hand, and on the other hand, multi-channel data processing needs a powerful signal processor.
In order to solve the problem of inaccurate remote voice positioning, researchers provide a method for improving positioning accuracy by using image high-resolution capability, acquiring an effective position of a sound source, and then combining a microphone array, enhancing voice by using a beam forming algorithm, eliminating noise and improving voice quality.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a multi-mode remote voice perception method and a multi-mode remote voice perception device.
The purpose of the invention is realized by the following technical scheme: a multimodal remote speech perception method comprising the steps of:
step 1: collecting voice and video signals by using a rectangular microphone array and a camera;
step 2: carrying out primary arrival angle estimation on a target voice signal by utilizing beam forming so as to obtain a rough sound source direction;
and step 3: according to the rough sound source position, the driving camera is over against the sound source direction;
and 4, step 4: establishing a background model based on the initial data, and performing foreground extraction and background model self-adaptive updating;
and 5: the foreground spatial position is mapped to a high-precision azimuth, the high-precision azimuth parameter is transmitted to a beam forming module, and the output of beam forming in the azimuth is an enhanced voice signal.
Further, the step 2 specifically includes the following sub-steps:
step 2.1, for the speech signal framing, the signal of the L frame (L =1,.. Gtoreq.l) acquired by the array is recorded as x (L) = [ x ]) 1 (l),x 2 (l),...,x m (l),...,x M (l)]Where M represents the number of microphones, each microphone being a channel, x m (l)=[x m (0,l),x m (1,l),...,x m (n,l)...,x m (N-1,l)] T Representing the l frame signal collected on the m channel; applying a window function to each frame of signal, performing short-time Fourier transform, and performing Fourier transform on the time domain signal of the ith frame of the mth channel to obtain frequency domain representation:
Figure BDA0002152100470000021
where n denotes an index of time, k denotes a k-th frequency point, b n Representing a hanning window of length N;
the frequency domain signal defining the M channel is X (k, l):
X(k,l)=[X 1 (k,l),X 2 (k,l),...,X M (k,l)] T ,0≤k≤N-1 (2.2)
step 2.2, defining the space spectrum matrix of the signal as S X (k)=E{X(k,l)X H (k, L) }, E {. Cndot.) denotes the expectation of the L-frame signal, matrix element
Figure BDA0002152100470000022
Assuming that the incident angle of the voice signal is theta, performing weighted summation on the spatial spectrum estimation results of the N frequency points to obtain total beam power P (theta):
Figure BDA0002152100470000031
wherein, w DS (θ,k)=[w 1 (θ,k),w 2 (θ,k),...,w M (θ,k)] T A weight vector representing the k-th bin with the phase aligned,
w DS H (theta, k) represents w DS Conjugate transpose of (θ, k);
carrying out angle search on the total beam power P (theta) to obtain a rough sound source azimuth angle estimated preliminarily
Figure BDA0002152100470000032
Figure BDA0002152100470000033
Further, the step 3 specifically includes the following sub-steps:
step 3.1, according to the direction angle obtained in step 2
Figure BDA0002152100470000034
And judging the approximate direction of the sound source, wherein the driving camera is opposite to the direction of the sound source.
Further, the step 4 specifically includes the following sub-steps:
step 4.1, firstly, establishing a background model by using initial video data, and recording the collected p frame image as I p (x, y), (x, y) are image matrix pixel coordinates; after the image is converted into a gray image, averaging the previous S frames to be used as an initial background B 0 (x, y). The formula is as follows:
Figure BDA0002152100470000035
after background modeling is finished, subtracting a background model from a current frame to obtain a foreground Target (x, y):
D(x,y)=I p (x,y)-B 0 (x,y) (4.2)
Figure BDA0002152100470000036
I p (x, y) represents the current frame image, D (x, y) represents the foreground pixel, T is the set threshold, and 1 in the Target (x, y) matrix represents the foreground pixel.
Step 4.2, the obtained binary foreground image has the phenomena of discontinuous outline, incomplete foreground and the like, so that the image needs to be subjected to subsequent processing such as opening-closing operation and the like, and finally, a complete foreground image G is obtained p (x,y)。
When processing a video stream, the background model needs to be updated due to environmental changes such as light. The update formula is as follows:
Figure BDA0002152100470000037
wherein, B p (x, y) is a background model added into the p frame image for self-adaptive updating, 0<α<1 is an update factor, which varies according to the environment change.
Because the horizontal size of the target is small relative to the distance to the camera, the image coordinate and the direction coordinate can be regarded as a linear relation, and the position of the foreground can be converted into an angle
Figure BDA0002152100470000041
And outputting the signals to a beam forming module.
Further, the step 5 specifically includes the following sub-steps:
step 5.1, obtaining accurate angle information according to image processing
Figure BDA0002152100470000042
Array response vector of corresponding target signal
Figure BDA0002152100470000043
Comprises the following steps:
Figure BDA0002152100470000044
wherein
Figure BDA0002152100470000045
[p 1 ,p 2 ,...,p M ]Are the two-dimensional coordinates of the M microphone elements,
Figure BDA0002152100470000046
is the wavelength, f, corresponding to the k-th frequency point k Is the frequency of the k frequency point, c represents the propagation speed of the plane wave in the medium;
step 5.2, converting the linear constraint minimum variance beam forming into the following optimization problem:
Figure BDA0002152100470000047
wherein w (k, l) = [ w 1 (k,l),w 2 (k,l),...,w M (k,l)] T Weight vector, S, representing the signal of the l-th frame x (k, l) represents a spatial spectrum matrix of the l-th frame signal. Filtering according to a steepest descent adaptive algorithm:
w(k,l+1)=J(k)[w(k,l)-μX(k,l)Y * (k,l)]+F(k) (5.3)
wherein
Figure BDA0002152100470000048
Y(k,l)=w H (k, l) X (k, l) denotes a beamformed output signal, Y * (k, l) represents the complex conjugate of Y (k, l), μ ≧ 0 is the convergence step, the initial weight vector
Figure BDA0002152100470000049
Splicing the sub-band signals into a broadband signal: y (l) = [ Y (0, l), Y (1, l),.., Y (N-1, l) ];
step 5.3: finally, inverse Discrete Fourier Transform (IDFT) is performed on Y (l) to obtain a time-domain output signal Y (l) of the l-th frame:
y(l)=IDFT[Y(l)] (5.4)
then, splicing the L frame voice signals to obtain a time domain output y (t):
y(t)=[y(1),y(2),...,y(l),...,y(L)] (5.5)
y (t) is the enhanced speech signal.
It is another object of the present invention to provide a multimodal remote speech perception apparatus, comprising:
the rectangular microphone array is 8-10 m away from the sound source;
the camera is arranged on the end edge of the rectangular microphone array and rotates synchronously with the microphone array;
the lower computer is connected with the rectangular microphone array and is used for controlling command receiving, signal acquisition and data transmission; after receiving a 'start' control instruction sent by the upper computer, the lower computer performs voice signal acquisition through the rectangular microphone array and uploads data to the upper computer in real time; the lower computer stops uploading data to the upper computer in real time after receiving a stop control instruction sent by the upper computer;
and the upper computer is connected with the camera, receives the video signal and the voice signal sent by the lower computer, performs initial angle estimation on the target voice signal, and drives the camera to rotate to the direction opposite to the sound source by using the angle. Extracting a target foreground from the video image, and mapping foreground coordinates to an accurate position; and transmitting the high-precision azimuth parameters to a beam forming module, and outputting the beam forming in the azimuth to obtain an enhanced voice signal.
Further, the connection and data transmission of the lower computer and the upper computer are as follows:
a, determining data ports and connection interfaces of an upper computer, a lower computer, a microphone array and a camera, and establishing connection;
b, the upper computer issues a control command 'start' to start collecting audio and video data;
c, performing parallel-serial conversion on the sampling data of all channels of the rectangular microphone array, and sending an uplink data packet to an upper computer by a lower computer;
d, the upper computer sends a control command 'stop', the lower computer stops collecting data and waits for the upper computer to send the control command 'start' again;
and e, automatically storing the audio data into a dat file after the acquisition is finished, and storing the video data into an avi file.
Compared with the prior art, the invention has the beneficial effects that:
(1) The invention uses the audio and video combined voice positioning method, adds the video positioning portrait to facilitate the acquisition of accurate azimuth angles of the sound sources, and avoids the defects that the voice azimuth estimation resolution is low and a plurality of sound sources cannot be clearly distinguished in the traditional beam forming.
(2) The invention uses the angle returned by image processing and the microphone array to enhance the remote voice signal, and solves the problems of energy weakening and low signal-to-noise ratio of the remote voice signal after being transmitted in the space.
(3) The invention utilizes the self-adaptive linear constraint minimum variance beam former to restrain incoherent noise and interference signals, and solves the problem of serious noise interference when the voice signals are far away.
(4) Based on the three characteristics, the invention can realize the function of outdoor remote voice perception and has better practical value.
Drawings
FIG. 1 is a general flow chart of the multimodal remote speech perception method of the present invention;
FIG. 2 is a flow chart of the preliminary estimation of the azimuth of the sound source according to the present invention;
FIG. 3 is a flow chart of the image processing output accurate sound source azimuth of the present invention;
FIG. 4 is a flow chart of the adaptive beamforming for enhancing speech signals in the present invention;
FIG. 5 is a beam pattern diagram of primary positioning of upper computer beam forming in the present invention;
FIG. 6 is a diagram of the result of obtaining high-precision voice direction by video processing according to the present invention;
FIG. 7 is a waveform diagram of signals before and after speech enhancement according to the present invention;
FIG. 8 is a time-frequency diagram of signals before and after speech enhancement according to the present invention.
Detailed Description
The objects and effects of the present invention will become more apparent from the following detailed description of the present invention when taken in conjunction with the accompanying drawings.
Figure 1 shows a general flow diagram of the present invention. The multi-mode remote voice perception method comprises 5 steps, namely, firstly, a rectangular microphone array and a camera are used for collecting voice and video signals; carrying out preliminary azimuth estimation on the signals; according to the rough estimation result of the arrival angle, matching with a self-adaptive background modeling detection target to obtain an accurate azimuth angle of a sound source; based on the accurate azimuth angle obtained by image processing, the self-adaptive filtering of the voice signals is realized by utilizing linear constraint minimum variance beam forming and the steepest descent algorithm, and finally the enhanced clear voice signals are output.
The detection method of the invention has the following specific implementation modes:
step 1: placing the rectangular microphone array and the camera at the same angle, and collecting audio and video signals;
and 2, step: and estimating the arrival angle of the target voice signal to obtain a rough sound source azimuth. The flow chart is shown in fig. 2 and comprises the following sub-steps:
step 2.1, for the speech signal framing, the signal of the L frame (L =1,.. Gtoreq.l) acquired by the array is recorded as x (L) = [ x ]) 1 (l),x 2 (l),...,x m (l),...,x M (l)]Where M represents the number of microphones, each microphone being a channel, x m (l)=[x m (0,l),x m (1,l),...,x m (n,l)...,x m (N-1,l)] T Representing the l frame signal collected on the m channel; applying a window function to each frame of signal, performing short-time Fourier transform, and performing Fourier transform on the time domain signal of the ith frame of the mth channel to obtain frequency domain representation:
Figure BDA0002152100470000061
where n denotes an index of time, k denotes a k-th frequency point, b n Representing a hanning window of length N;
the frequency domain signal defining the M channel is X (k, l):
X(k,l)=[X 1 (k,l),X 2 (k,l),...,X M (k,l)] T ,0≤k≤N-1 (2.2)
preferably, in the implementation process, the sampling frequency is 48kHz, the short-time Fourier transform length N is 512, and the window function b is selected n A hanning window of length 512.
Step 2.2, defining the space spectrum matrix of the signal as S X (k)=E{X(k,l)X H (k, L) }, E { · } denotes the expectation of the L-frame signal, the matrix elements
Figure BDA0002152100470000071
Assuming that the incident angle of the voice signal is theta, performing weighted summation on the spatial spectrum estimation results of the N frequency points to obtain total beam power P (theta):
Figure BDA0002152100470000072
wherein, w DS (θ,k)=[w 1 (θ,k),w 2 (θ,k),...,w M (θ,k)] T Weight vector, w, representing the phase aligned k-th bin DS H (theta, k) represents w DS Conjugate transpose of (θ, k);
carrying out angle search on the total beam power P (theta) to obtain a rough sound source azimuth angle estimated preliminarily
Figure BDA0002152100470000073
Figure BDA0002152100470000074
In a specific implementation process, according to an actual situation, the search range of the angle θ is as follows: theta is more than or equal to minus 90 degrees and less than or equal to plus 90 degrees, and the angle step is 1 degree.
And step 3: and the camera is just opposite to the direction of the sound source by utilizing the initial information of the direction of the sound source.
And 4, step 4: establishing a background model based on the initial data, and performing foreground detection and background model self-adaptive updating; the flow chart is shown in fig. 3 and comprises the following sub-steps:
step 4.1, firstly, establishing a background model by using initial video data, and recording the collected p frame image as I p (x, y), (x, y) are image matrix pixel coordinates; after the image is converted into a gray image, averaging the previous S frames to be used as an initial background B 0 (x, y). The formula is as follows:
Figure BDA0002152100470000075
after background modeling is finished, subtracting a background model from a current frame to obtain a foreground Target (x, y):
D(x,y)=I p (x,y)-B 0 (x,y) (4.2)
Figure BDA0002152100470000081
I p (x, y) represents the current frame image, D (x, y) represents the foreground pixel, T is the set threshold, and 1 in the Target (x, y) matrix represents the foreground pixel.
Step 4.2, the obtained binary foreground image has the phenomena of discontinuous outline, incomplete foreground and the like, so that the image needs to be subjected to subsequent processing such as opening-closing operation and the like, and finally, a complete foreground image G is obtained p (x,y)。
When processing a video stream, the background model needs to be updated due to environmental changes such as light. The update formula is as follows:
Figure BDA0002152100470000082
wherein, B p (x, y) is a background model added into the p frame image for self-adaptive updating, 0<α<1 is an update factor, which varies according to the environment change.
Because the horizontal size of the target is small relative to the distance to the camera, the image coordinates and the direction coordinates can be regarded as a linear relation, and then the foreground image G is obtained p (x, y) position conversion to angle
Figure BDA0002152100470000083
And outputting the signals to a beam forming module. The exact angle obtained in the experiment was +27 °.
And 5: the precise angle is applied to a self-adaptive beam forming algorithm, and the signal-to-noise ratio of the voice signal is improved. The flow chart is shown in fig. 4 and includes the following sub-steps:
step 5.1, obtaining accurate angle information according to image processing
Figure BDA0002152100470000084
Array response vector of corresponding target signal
Figure BDA0002152100470000085
Comprises the following steps:
Figure BDA0002152100470000086
wherein
Figure BDA0002152100470000087
[p 1 ,p 2 ,...,p M ]Are the two-dimensional coordinates of the M microphone elements,
Figure BDA0002152100470000088
is the wavelength corresponding to the kth frequency point, f k Is the frequency of the k frequency point, c represents the propagation speed of the plane wave in the medium; in the specific implementation process, the microphone arrays are uniform matrixes of 2 multiplied by 6, the distances among the microphones are all 0.05m, only the horizontal direction angle is considered, and the pitching direction angle is not considered.
Step 5.2, converting the linear constraint minimum variance beam forming into the following optimization problem:
Figure BDA0002152100470000089
wherein w (k, l) = [ w 1 (k,l),w 2 (k,l),...,w M (k,l)] T Weight vector, S, representing the signal of the l-th frame x (k, l) represents a spatial spectrum matrix of the l-th frame signal. Filtering according to a steepest descent adaptive algorithm:
w(k,l+1)=J(k)[w(k,l)-μX(k,l)Y * (k,l)]+F(k) (5.3)
wherein
Figure BDA0002152100470000091
Y(k,l)=w H (k, l) X (k, l) denotes a beamformed output signal, Y * (k, l) represents the complex conjugate of Y (k, l), μ ≧0 is the convergence step size, the initial weight vector
Figure BDA0002152100470000092
In the specific implementation process, the selection of mu is changed according to different voice acquisition scenes, and mu is more than or equal to 0.00003 and less than or equal to 0.0001 in the experiment.
Splicing the sub-band signals into a broadband signal: y (l) = [ Y (0, l), Y (1, l),.., Y (N-1, l) ];
step 5.3: finally, inverse Discrete Fourier Transform (IDFT) is performed on Y (l) to obtain a time-domain output signal Y (l) of the l-th frame:
y(l)=IDFT[Y(l)] (5.4)
then, splicing the L frame voice signals to obtain a time domain output y (t):
y(t)=[y(1),y(2),...,y(l),...,y(L)] (5.5)
y (t) is the enhanced speech signal.
The multi-modal remote voice perception device comprises the following four modules:
a, a rectangular microphone array is 8-10 m away from a sound source;
b, a camera is arranged on the end edge of the rectangular microphone array and rotates synchronously with the microphone array;
c, the lower computer is connected with the rectangular microphone array and is used for controlling command receiving, signal acquisition and data transmission; after receiving a 'start' control instruction sent by the upper computer, the lower computer performs voice signal acquisition through the rectangular microphone array and uploads data to the upper computer in real time; the lower computer stops uploading data after receiving a stop control instruction sent by the upper computer;
and d, the upper computer is connected with the camera, receives the video signal and the voice signal sent by the lower computer, performs initial angle estimation on the target voice signal, and drives the camera to rotate to the direction opposite to the sound source by using the angle. Extracting a target foreground from the video image, and mapping foreground coordinates to an accurate position; and transmitting the high-precision azimuth parameters to a beam forming module, and outputting the beam forming in the azimuth to obtain an enhanced voice signal.
The connection and data transmission of the lower computer and the upper computer in the detection device are as follows:
a, determining data ports and connection interfaces of an upper computer, a lower computer, a microphone array and a camera, and establishing connection;
b, the upper computer issues a control command 'start' to start collecting audio and video data;
c, performing parallel-serial conversion on the sampling data of all channels of the rectangular microphone array, and sending an uplink data packet to an upper computer by a lower computer;
d, the upper computer sends a control command 'stop', the lower computer stops collecting data and waits for the upper computer to send the control command 'start' again;
and e, automatically storing the audio data into a dat file after the acquisition is finished, and storing the video data into an avi file.
Examples
In this embodiment, the detection method is applied to remote speech sensing, and the specific steps are as described above and will not be described herein again.
Performing initial arrival angle estimation on target voice by using beam forming, calculating total beam power P (theta) from-90 degrees to +90 degrees, and plotting by using angle theta as an x-axis coordinate and normalized power P (theta) as a y-axis coordinate, wherein the result is shown in fig. 5, and the rough azimuth angle of a target sound source (female voice)
Figure BDA0002152100470000101
The estimated angle of the interfering sound source is-29.
As shown in fig. 6 (a) of the original image processing and fig. 6 (b) of the background subtraction processing, it is seen that some noise and interference affect the result. To eliminate the interference, the result shown in fig. 6 (c) is obtained after the on-off operation processing, and the final sound source localization result is shown in fig. 6 (d). The precise positions of the sound source are +27 degrees and-25 degrees, and the precise azimuth angle of the sound source is selected according to the rough azimuth angle obtained by beam forming
Figure BDA0002152100470000102
Output to the beam formingThe module performs beamforming output.
The audio processing results of the upper computer audio-video joint algorithm are shown in fig. 7 and 8. Fig. 7 is a waveform diagram of signals before and after speech enhancement, after processing, noise is significantly reduced and signal-to-noise ratio is enhanced. Fig. 8 is a time-frequency diagram of signals before and after speech enhancement, and it can be seen from the time-frequency diagram that after beamforming, the interference (male voice) with noise and energy concentrated in the low frequency part is suppressed, and the target sound source (female voice) in the high frequency part is retained and enhanced.
And respectively evaluating the results of beam forming at the rough angle and the precise angle by using the signal-to-noise ratio and the PESQ score, and checking the performance of the multi-mode combined system in actual data processing. Processing results as shown in tables 1 and 2, the signal-to-noise ratio gain of the output signal of the beam forming at the precise azimuth reaches 12.1704dB, the PESQ score is improved by 0.655, and the performance of the beam forming is better than that of the beam forming at the output of the rough azimuth.
TABLE 1 beamforming SNR comparison
Coarse angle Precise angle
Signal-to-noise ratio gain (dB) 10.0168 12.1704
TABLE 2PESQ evaluation score comparison
Single channel signal Rough angle Precise angle
Evaluation of PESQ 1.6458 1.9473 2.3008
The processing method provided by the invention is tested in a Yongquan Square in the Yuquan school district of the university of Zhejiang Hangzhou, 2 multiplied by 6 microphone area arrays are adopted, the sound source distance is 10 meters, the sound source distance is divided into a target sound source (27 degrees) and an interference sound source (25 degrees), the sampling rate is 48kHz, and the test result is good. The invention can carry out the joint collection of remote voice and video, and sends the upper computer for processing and outputting.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (7)

1. A multimodal remote speech perception method, comprising the steps of:
step 1: collecting voice and video signals by using a rectangular microphone array and a camera;
and 2, step: carrying out preliminary arrival angle estimation on a target voice signal by utilizing beam forming to obtain a rough sound source direction;
and 3, step 3: according to the rough sound source position, the driving camera is over against the sound source direction;
and 4, step 4: establishing a background model based on the initial data, and performing foreground detection and background model self-adaptive updating;
and 5: and transmitting the high-precision azimuth parameters corresponding to the foreground to a beam forming module, wherein the output of the beam forming module at the azimuth is the enhanced voice signal.
2. The method according to claim 1, wherein the step 2 comprises the following sub-steps:
step 2.1, for the speech signal framing, the signal of the L frame (L =1,.. Gtoreq.l) acquired by the array is recorded as x (L) = [ x ]) 1 (l),x 2 (l),...,x m (l),...,x M (l)]Where M represents the number of microphones, each microphone being a channel, x m (l)=[x m (0,l),x m (1,l),...,x m (n,l)...,x m (N-1,l)] T Representing the l frame signal collected on the m channel; applying a window function to each frame of signal, performing short-time Fourier transform, and performing Fourier transform on the time domain signal of the ith frame of the mth channel to obtain frequency domain representation:
Figure FDA0003901067570000011
where n denotes the index of time, k denotes the kth frequency point, b n Representing a hanning window of length N;
the frequency domain signal defining the M channel is X (k, l):
X(k,l)=[X 1 (k,l),X 2 (k,l),...,X M (k,l)] T ,0≤k≤N-1 (2.2)
step 2.2, defining the space spectrum matrix of the signal as S X (k) Elements of the matrix
Figure FDA0003901067570000012
Assuming that the incident angle of the voice signal is theta, carrying out weighted summation on the space spectrum estimation results of N frequency points to obtain a total waveBeam power P (θ):
Figure FDA0003901067570000013
wherein, w DS (θ,k)=[w 1 (θ,k),w 2 (θ,k),...,w M (θ,k)] T Weight vector, w, representing the phase aligned k-th bin DS H (theta, k) represents w DS Conjugate transpose of (θ, k);
carrying out angle search on the total beam power P (theta) to obtain a rough azimuth angle of the sound source
Figure FDA0003901067570000021
Figure FDA0003901067570000022
3. The method according to claim 2, wherein the step 3 comprises the following sub-steps:
step 3.1, according to the direction angle obtained in step 2
Figure FDA0003901067570000023
And judging the approximate direction of the sound source, wherein the driving camera is opposite to the direction of the sound source.
4. The method according to claim 3, wherein the step 4 comprises the following sub-steps:
step 4.1, firstly, establishing a background model by using initial video data, and recording the collected p frame image as I p (x, y), (x, y) are image matrix pixel coordinates; after the image is converted into a gray image, averaging the previous S frames to be used as an initial background B 0 (x, y), the formula is as follows:
Figure FDA0003901067570000024
after background modeling is finished, subtracting a background model from a current frame to obtain a foreground Target (x, y):
D(x,y)=I p (x,y)-B 0 (x,y) (4.2)
Figure FDA0003901067570000025
I p (x, y) represents the current frame image, D (x, y) represents the foreground image point, T is the set threshold, and 1 in the Target (x, y) matrix represents the foreground pixel;
step 4.2, the obtained binaryzation foreground image is subjected to the subsequent processing of the opening-closing operation of the foreground image, and finally, a complete foreground image G is obtained p (x,y);
When the video stream is processed, the background model is updated, and the updating formula is as follows:
Figure FDA0003901067570000026
wherein, B p (x, y) is a background model added into the p frame image for self-adaptive updating, and alpha is more than 0 and less than 1 and is an updating factor;
the foreground image G p After the (x, y) horizontal scale coordinates are mapped to the angle coordinates, the position of the pixel where the foreground is located is converted into an angle
Figure FDA0003901067570000031
And output to the beam forming module.
5. The method according to claim 4, wherein the step 4 comprises the following sub-steps:
step 5.1, obtaining accurate angle information according to image processing
Figure FDA0003901067570000032
Array response vector of corresponding target signal
Figure FDA0003901067570000033
Comprises the following steps:
Figure FDA0003901067570000034
wherein
Figure FDA0003901067570000035
[p 1 ,p 2 ,...,p M ]Are the two-dimensional coordinates of the M microphone elements,
Figure FDA0003901067570000036
is the wavelength, f, corresponding to the k-th frequency point k Is the frequency of the k-th frequency point, c represents the sound velocity;
step 5.2, converting the linear constraint minimum variance beam forming into the following optimization problem:
Figure FDA0003901067570000037
wherein w (k, l) = [ w 1 (k,l),w 2 (k,l),...,w M (k,l)] T Weight vector, S, representing the signal of the l-th frame X (k, l) a spatial spectral matrix representing the l-th frame signal; filtering according to a steepest descent adaptive algorithm:
w(k,l+1)=J(k)[w(k,l)-μX(k,l)Y * (k,l)]+F(k) (5.3)
wherein
Figure FDA0003901067570000038
Y(k,l)=w H (k, l) X (k, l) denotes a beamformed output signal, Y * (k, l) represents the complex conjugate of Y (k, l), μ ≧ 0 is the convergence step, andweight vector
Figure FDA0003901067570000039
Splicing the sub-band signals into a broadband signal: y (l) = [ Y (0, l), Y (1, l),.., Y (N-1, l) ];
step 5.3: finally, inverse Discrete Fourier Transform (IDFT) is performed on Y (l) to obtain a time-domain output signal Y (l) of the l-th frame:
y(l)=IDFT[Y(l)] (5.4)
then, splicing the L frame voice signals to obtain a time domain output y (t):
y(t)=[y(1),y(2),...,y(l),...,y(L)] (5.5)
y (t) is the enhanced speech signal.
6. A multimodal remote speech perception apparatus, the apparatus comprising:
the rectangular microphone array is 8-10 m away from the sound source;
the camera is arranged on the end edge of the rectangular microphone array and rotates synchronously with the microphone array;
the lower computer is connected with the rectangular microphone array and is used for controlling command receiving, signal acquisition and data transmission; after receiving a 'start' control instruction sent by the upper computer, the lower computer performs voice signal acquisition through the rectangular microphone array and uploads data to the upper computer in real time; the lower computer stops uploading data after receiving a stop control instruction sent by the upper computer;
the upper computer is connected with the camera, receives the video signal and the voice signal sent by the lower computer, performs initial angle estimation on the target voice signal, and drives the camera to rotate to the direction opposite to the sound source by using the angle; extracting a target foreground from a video image, and mapping foreground coordinates to an accurate angle direction; and transmitting the high-precision azimuth parameters to a beam forming module, and outputting the beam forming in the azimuth to obtain an enhanced voice signal.
7. The multi-modal remote voice perception device according to claim 6, wherein the connection and data transmission between the lower computer and the upper computer are as follows:
a, determining data ports and connection interfaces of an upper computer, a lower computer, a microphone array and a camera, and establishing connection;
b, the upper computer issues a control command 'start' to start collecting audio and video data;
c, performing parallel-serial conversion on the sampling data of all channels of the rectangular microphone array, and sending an uplink data packet to an upper computer by a lower computer;
d, the upper computer sends a control command 'stop', the lower computer stops collecting data and waits for the upper computer to send the control command 'start' again;
and e, automatically storing the audio data into a dat file after the acquisition is finished, and storing the video data into an avi file.
CN201910705872.0A 2019-08-01 2019-08-01 Multi-mode remote voice perception method and device Active CN110444220B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910705872.0A CN110444220B (en) 2019-08-01 2019-08-01 Multi-mode remote voice perception method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910705872.0A CN110444220B (en) 2019-08-01 2019-08-01 Multi-mode remote voice perception method and device

Publications (2)

Publication Number Publication Date
CN110444220A CN110444220A (en) 2019-11-12
CN110444220B true CN110444220B (en) 2023-02-10

Family

ID=68432714

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910705872.0A Active CN110444220B (en) 2019-08-01 2019-08-01 Multi-mode remote voice perception method and device

Country Status (1)

Country Link
CN (1) CN110444220B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112951273B (en) * 2021-02-02 2024-03-29 郑州大学 Numerical control machine tool cutter abrasion monitoring device based on microphone array and machine vision
CN116504264B (en) * 2023-06-30 2023-10-31 小米汽车科技有限公司 Audio processing method, device, equipment and storage medium
CN116705047B (en) * 2023-07-31 2023-11-14 北京小米移动软件有限公司 Audio acquisition method, device and storage medium
CN117953914A (en) * 2024-03-27 2024-04-30 深圳市西昊智能家具有限公司 Speech data enhancement optimization method for intelligent office

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013175869A (en) * 2012-02-24 2013-09-05 Nippon Telegr & Teleph Corp <Ntt> Acoustic signal enhancement device, distance determination device, methods for the same, and program
WO2015196729A1 (en) * 2014-06-27 2015-12-30 中兴通讯股份有限公司 Microphone array speech enhancement method and device
CN106328156A (en) * 2016-08-22 2017-01-11 华南理工大学 Microphone array voice reinforcing system and microphone array voice reinforcing method with combination of audio information and video information

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8577677B2 (en) * 2008-07-21 2013-11-05 Samsung Electronics Co., Ltd. Sound source separation method and system using beamforming technique

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013175869A (en) * 2012-02-24 2013-09-05 Nippon Telegr & Teleph Corp <Ntt> Acoustic signal enhancement device, distance determination device, methods for the same, and program
WO2015196729A1 (en) * 2014-06-27 2015-12-30 中兴通讯股份有限公司 Microphone array speech enhancement method and device
CN106328156A (en) * 2016-08-22 2017-01-11 华南理工大学 Microphone array voice reinforcing system and microphone array voice reinforcing method with combination of audio information and video information

Also Published As

Publication number Publication date
CN110444220A (en) 2019-11-12

Similar Documents

Publication Publication Date Title
CN110444220B (en) Multi-mode remote voice perception method and device
CN111044973B (en) MVDR target sound source directional pickup method for microphone matrix
CN106328156B (en) Audio and video information fusion microphone array voice enhancement system and method
CN110992974A (en) Speech recognition method, apparatus, device and computer readable storage medium
CN111239687B (en) Sound source positioning method and system based on deep neural network
US6826284B1 (en) Method and apparatus for passive acoustic source localization for video camera steering applications
CN109490822B (en) Voice DOA estimation method based on ResNet
CN108318862B (en) Sound source positioning method based on neural network
CN108375763B (en) Frequency division positioning method applied to multi-sound-source environment
CN110010147A (en) A kind of method and system of Microphone Array Speech enhancing
CN108877827A (en) Voice-enhanced interaction method and system, storage medium and electronic equipment
CN111429939B (en) Sound signal separation method of double sound sources and pickup
Liu et al. Continuous sound source localization based on microphone array for mobile robots
CN109782231B (en) End-to-end sound source positioning method and system based on multi-task learning
CN107167770B (en) A kind of microphone array sound source locating device under the conditions of reverberation
CN112904279B (en) Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum
Li et al. Reverberant sound localization with a robot head based on direct-path relative transfer function
CN110515034B (en) Acoustic signal azimuth angle measurement system and method
CN113607447A (en) Acoustic-optical combined fan fault positioning device and method
Hu et al. Decoupled direction-of-arrival estimations using relative harmonic coefficients
US11636866B2 (en) Transform ambisonic coefficients using an adaptive network
CN112180318B (en) Sound source direction of arrival estimation model training and sound source direction of arrival estimation method
CN114417908A (en) Multi-mode fusion-based unmanned aerial vehicle detection system and method
CN116559778B (en) Vehicle whistle positioning method and system based on deep learning
CN105372644B (en) One kind is based on the modified Adaptive beamformer method and system of dynamic weight

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant