CN110444220B - Multi-mode remote voice perception method and device - Google Patents
Multi-mode remote voice perception method and device Download PDFInfo
- Publication number
- CN110444220B CN110444220B CN201910705872.0A CN201910705872A CN110444220B CN 110444220 B CN110444220 B CN 110444220B CN 201910705872 A CN201910705872 A CN 201910705872A CN 110444220 B CN110444220 B CN 110444220B
- Authority
- CN
- China
- Prior art keywords
- signal
- foreground
- sound source
- angle
- image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 24
- 230000008447 perception Effects 0.000 title claims abstract description 17
- 238000001514 detection method Methods 0.000 claims abstract description 7
- 238000012545 processing Methods 0.000 claims description 21
- 239000011159 matrix material Substances 0.000 claims description 15
- 239000008186 active pharmaceutical agent Substances 0.000 claims description 9
- 238000001228 spectrum Methods 0.000 claims description 8
- 230000005540 biological transmission Effects 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 5
- 238000005070 sampling Methods 0.000 claims description 5
- 230000003044 adaptive effect Effects 0.000 claims description 4
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 238000001914 filtration Methods 0.000 claims description 4
- 238000012935 Averaging Methods 0.000 claims description 3
- 238000009432 framing Methods 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 238000005457 optimization Methods 0.000 claims description 3
- 230000004044 response Effects 0.000 claims description 3
- 230000003595 spectral effect Effects 0.000 claims 1
- 238000010586 diagram Methods 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 7
- 230000008569 process Effects 0.000 description 6
- 238000012544 monitoring process Methods 0.000 description 5
- 238000003491 array Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000002708 enhancing effect Effects 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000002829 reductive effect Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000001931 thermography Methods 0.000 description 1
- 230000003313 weakening effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/24—Speech recognition using non-acoustical features
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0224—Processing in the time domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
- G10L21/028—Voice signal separating using properties of sound source
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/04—Time compression or expansion
- G10L21/055—Time compression or expansion for synchronising with other signals, e.g. video signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/45—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/57—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02166—Microphone arrays; Beamforming
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N7/00—Television systems
- H04N7/14—Systems for two-way working
- H04N7/141—Systems for two-way working between two video terminals, e.g. videophone
- H04N7/147—Communication arrangements, e.g. identifying the communication as a video-communication, intermediate storage of the signals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N7/00—Television systems
- H04N7/18—Closed-circuit television [CCTV] systems, i.e. systems in which the video signal is not broadcast
- H04N7/183—Closed-circuit television [CCTV] systems, i.e. systems in which the video signal is not broadcast for receiving images from a single remote source
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
The invention discloses a multimode remote voice perception method and device. The perception method comprises the following steps: and collecting voice and video signals by using a rectangular microphone array and a camera. And carrying out preliminary arrival angle estimation on the target voice signal by utilizing beam forming so as to obtain a rough sound source direction. And by utilizing the preliminary information of the sound source position, the driving camera is over against the sound source direction. And establishing a background model based on the initial video data, and performing foreground detection and background updating. And transmitting the high-precision azimuth parameters corresponding to the foreground to a beam forming module, and forming the output of the beam in the azimuth, namely the enhanced voice signal.
Description
Technical Field
The invention relates to the field of multi-mode combined sensor acquisition and voice enhancement, in particular to a multi-mode remote voice sensing method and device based on rectangular microphone area array and camera combined acquisition.
Background
In recent years, remote video monitoring technology is more and more widely applied to the life of people. The red light running camera on the street, the monitoring camera in the office, various infrared detectors, thermal imaging technologies and the like, especially in the aspect of remote monitoring application, only one camera is needed, people can check remote monitoring pictures on intelligent equipment such as mobile phones and the like at any time and any place, and great convenience is brought to life of people. The use of microphones to process audio signals has found some applications in the fields of cell phones and personal computers. In these examples of applications, systems consisting of a single or two microphones are generally used. In recent years, amazon, microsoft, google, and the like have released products based on microphone array technology abroad. In China, companies such as news flight, cloud learning, acoustic intelligence and science and technology also provide mature microphone hardware schemes. The pickup and action distances of the products are within 10m, and the products are mainly oriented to near-field voice application scenes. However, conventional near-field speech applications have been increasingly unable to meet the needs of people. When a scene is switched to the outdoor, robot, vehicle-mounted or monitoring field, more complex voice control intelligent equipment is needed, and therefore, the microphone array technology becomes the core of far-field voice perception.
However, remote video can only process images and cannot sense sound, which is just as unsatisfactory. Meanwhile, in the traditional voice perception technology, the recognition rate of voice recognition reaches the level of identity recognition in a short distance, but the effect is greatly reduced in a long distance condition because the signal-to-noise ratio of the received voice signal is low and an interference signal exists.
The existing remote voice positioning technology has the following problems:
(1) The use of compressive sensing techniques for orientation estimation can improve orientation accuracy, but requires high signal-to-noise ratios;
(2) The convolution beam forming method is used for a small sensor array, and a higher signal-to-noise ratio is also needed while the azimuth estimation precision is improved;
(3) The large-scale microphone array can simultaneously meet high signal-to-noise ratio and narrow beams, but is very troublesome in engineering use, occupies a large space on one hand, and on the other hand, multi-channel data processing needs a powerful signal processor.
In order to solve the problem of inaccurate remote voice positioning, researchers provide a method for improving positioning accuracy by using image high-resolution capability, acquiring an effective position of a sound source, and then combining a microphone array, enhancing voice by using a beam forming algorithm, eliminating noise and improving voice quality.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a multi-mode remote voice perception method and a multi-mode remote voice perception device.
The purpose of the invention is realized by the following technical scheme: a multimodal remote speech perception method comprising the steps of:
step 1: collecting voice and video signals by using a rectangular microphone array and a camera;
step 2: carrying out primary arrival angle estimation on a target voice signal by utilizing beam forming so as to obtain a rough sound source direction;
and step 3: according to the rough sound source position, the driving camera is over against the sound source direction;
and 4, step 4: establishing a background model based on the initial data, and performing foreground extraction and background model self-adaptive updating;
and 5: the foreground spatial position is mapped to a high-precision azimuth, the high-precision azimuth parameter is transmitted to a beam forming module, and the output of beam forming in the azimuth is an enhanced voice signal.
Further, the step 2 specifically includes the following sub-steps:
step 2.1, for the speech signal framing, the signal of the L frame (L =1,.. Gtoreq.l) acquired by the array is recorded as x (L) = [ x ]) 1 (l),x 2 (l),...,x m (l),...,x M (l)]Where M represents the number of microphones, each microphone being a channel, x m (l)=[x m (0,l),x m (1,l),...,x m (n,l)...,x m (N-1,l)] T Representing the l frame signal collected on the m channel; applying a window function to each frame of signal, performing short-time Fourier transform, and performing Fourier transform on the time domain signal of the ith frame of the mth channel to obtain frequency domain representation:
where n denotes an index of time, k denotes a k-th frequency point, b n Representing a hanning window of length N;
the frequency domain signal defining the M channel is X (k, l):
X(k,l)=[X 1 (k,l),X 2 (k,l),...,X M (k,l)] T ,0≤k≤N-1 (2.2)
step 2.2, defining the space spectrum matrix of the signal as S X (k)=E{X(k,l)X H (k, L) }, E {. Cndot.) denotes the expectation of the L-frame signal, matrix elementAssuming that the incident angle of the voice signal is theta, performing weighted summation on the spatial spectrum estimation results of the N frequency points to obtain total beam power P (theta):
wherein, w DS (θ,k)=[w 1 (θ,k),w 2 (θ,k),...,w M (θ,k)] T A weight vector representing the k-th bin with the phase aligned,
w DS H (theta, k) represents w DS Conjugate transpose of (θ, k);
carrying out angle search on the total beam power P (theta) to obtain a rough sound source azimuth angle estimated preliminarily
Further, the step 3 specifically includes the following sub-steps:
step 3.1, according to the direction angle obtained in step 2And judging the approximate direction of the sound source, wherein the driving camera is opposite to the direction of the sound source.
Further, the step 4 specifically includes the following sub-steps:
step 4.1, firstly, establishing a background model by using initial video data, and recording the collected p frame image as I p (x, y), (x, y) are image matrix pixel coordinates; after the image is converted into a gray image, averaging the previous S frames to be used as an initial background B 0 (x, y). The formula is as follows:
after background modeling is finished, subtracting a background model from a current frame to obtain a foreground Target (x, y):
D(x,y)=I p (x,y)-B 0 (x,y) (4.2)
I p (x, y) represents the current frame image, D (x, y) represents the foreground pixel, T is the set threshold, and 1 in the Target (x, y) matrix represents the foreground pixel.
Step 4.2, the obtained binary foreground image has the phenomena of discontinuous outline, incomplete foreground and the like, so that the image needs to be subjected to subsequent processing such as opening-closing operation and the like, and finally, a complete foreground image G is obtained p (x,y)。
When processing a video stream, the background model needs to be updated due to environmental changes such as light. The update formula is as follows:
wherein, B p (x, y) is a background model added into the p frame image for self-adaptive updating, 0<α<1 is an update factor, which varies according to the environment change.
Because the horizontal size of the target is small relative to the distance to the camera, the image coordinate and the direction coordinate can be regarded as a linear relation, and the position of the foreground can be converted into an angleAnd outputting the signals to a beam forming module.
Further, the step 5 specifically includes the following sub-steps:
step 5.1, obtaining accurate angle information according to image processingArray response vector of corresponding target signalComprises the following steps:
wherein[p 1 ,p 2 ,...,p M ]Are the two-dimensional coordinates of the M microphone elements,is the wavelength, f, corresponding to the k-th frequency point k Is the frequency of the k frequency point, c represents the propagation speed of the plane wave in the medium;
step 5.2, converting the linear constraint minimum variance beam forming into the following optimization problem:
wherein w (k, l) = [ w 1 (k,l),w 2 (k,l),...,w M (k,l)] T Weight vector, S, representing the signal of the l-th frame x (k, l) represents a spatial spectrum matrix of the l-th frame signal. Filtering according to a steepest descent adaptive algorithm:
w(k,l+1)=J(k)[w(k,l)-μX(k,l)Y * (k,l)]+F(k) (5.3)
whereinY(k,l)=w H (k, l) X (k, l) denotes a beamformed output signal, Y * (k, l) represents the complex conjugate of Y (k, l), μ ≧ 0 is the convergence step, the initial weight vector
Splicing the sub-band signals into a broadband signal: y (l) = [ Y (0, l), Y (1, l),.., Y (N-1, l) ];
step 5.3: finally, inverse Discrete Fourier Transform (IDFT) is performed on Y (l) to obtain a time-domain output signal Y (l) of the l-th frame:
y(l)=IDFT[Y(l)] (5.4)
then, splicing the L frame voice signals to obtain a time domain output y (t):
y(t)=[y(1),y(2),...,y(l),...,y(L)] (5.5)
y (t) is the enhanced speech signal.
It is another object of the present invention to provide a multimodal remote speech perception apparatus, comprising:
the rectangular microphone array is 8-10 m away from the sound source;
the camera is arranged on the end edge of the rectangular microphone array and rotates synchronously with the microphone array;
the lower computer is connected with the rectangular microphone array and is used for controlling command receiving, signal acquisition and data transmission; after receiving a 'start' control instruction sent by the upper computer, the lower computer performs voice signal acquisition through the rectangular microphone array and uploads data to the upper computer in real time; the lower computer stops uploading data to the upper computer in real time after receiving a stop control instruction sent by the upper computer;
and the upper computer is connected with the camera, receives the video signal and the voice signal sent by the lower computer, performs initial angle estimation on the target voice signal, and drives the camera to rotate to the direction opposite to the sound source by using the angle. Extracting a target foreground from the video image, and mapping foreground coordinates to an accurate position; and transmitting the high-precision azimuth parameters to a beam forming module, and outputting the beam forming in the azimuth to obtain an enhanced voice signal.
Further, the connection and data transmission of the lower computer and the upper computer are as follows:
a, determining data ports and connection interfaces of an upper computer, a lower computer, a microphone array and a camera, and establishing connection;
b, the upper computer issues a control command 'start' to start collecting audio and video data;
c, performing parallel-serial conversion on the sampling data of all channels of the rectangular microphone array, and sending an uplink data packet to an upper computer by a lower computer;
d, the upper computer sends a control command 'stop', the lower computer stops collecting data and waits for the upper computer to send the control command 'start' again;
and e, automatically storing the audio data into a dat file after the acquisition is finished, and storing the video data into an avi file.
Compared with the prior art, the invention has the beneficial effects that:
(1) The invention uses the audio and video combined voice positioning method, adds the video positioning portrait to facilitate the acquisition of accurate azimuth angles of the sound sources, and avoids the defects that the voice azimuth estimation resolution is low and a plurality of sound sources cannot be clearly distinguished in the traditional beam forming.
(2) The invention uses the angle returned by image processing and the microphone array to enhance the remote voice signal, and solves the problems of energy weakening and low signal-to-noise ratio of the remote voice signal after being transmitted in the space.
(3) The invention utilizes the self-adaptive linear constraint minimum variance beam former to restrain incoherent noise and interference signals, and solves the problem of serious noise interference when the voice signals are far away.
(4) Based on the three characteristics, the invention can realize the function of outdoor remote voice perception and has better practical value.
Drawings
FIG. 1 is a general flow chart of the multimodal remote speech perception method of the present invention;
FIG. 2 is a flow chart of the preliminary estimation of the azimuth of the sound source according to the present invention;
FIG. 3 is a flow chart of the image processing output accurate sound source azimuth of the present invention;
FIG. 4 is a flow chart of the adaptive beamforming for enhancing speech signals in the present invention;
FIG. 5 is a beam pattern diagram of primary positioning of upper computer beam forming in the present invention;
FIG. 6 is a diagram of the result of obtaining high-precision voice direction by video processing according to the present invention;
FIG. 7 is a waveform diagram of signals before and after speech enhancement according to the present invention;
FIG. 8 is a time-frequency diagram of signals before and after speech enhancement according to the present invention.
Detailed Description
The objects and effects of the present invention will become more apparent from the following detailed description of the present invention when taken in conjunction with the accompanying drawings.
Figure 1 shows a general flow diagram of the present invention. The multi-mode remote voice perception method comprises 5 steps, namely, firstly, a rectangular microphone array and a camera are used for collecting voice and video signals; carrying out preliminary azimuth estimation on the signals; according to the rough estimation result of the arrival angle, matching with a self-adaptive background modeling detection target to obtain an accurate azimuth angle of a sound source; based on the accurate azimuth angle obtained by image processing, the self-adaptive filtering of the voice signals is realized by utilizing linear constraint minimum variance beam forming and the steepest descent algorithm, and finally the enhanced clear voice signals are output.
The detection method of the invention has the following specific implementation modes:
step 1: placing the rectangular microphone array and the camera at the same angle, and collecting audio and video signals;
and 2, step: and estimating the arrival angle of the target voice signal to obtain a rough sound source azimuth. The flow chart is shown in fig. 2 and comprises the following sub-steps:
step 2.1, for the speech signal framing, the signal of the L frame (L =1,.. Gtoreq.l) acquired by the array is recorded as x (L) = [ x ]) 1 (l),x 2 (l),...,x m (l),...,x M (l)]Where M represents the number of microphones, each microphone being a channel, x m (l)=[x m (0,l),x m (1,l),...,x m (n,l)...,x m (N-1,l)] T Representing the l frame signal collected on the m channel; applying a window function to each frame of signal, performing short-time Fourier transform, and performing Fourier transform on the time domain signal of the ith frame of the mth channel to obtain frequency domain representation:
where n denotes an index of time, k denotes a k-th frequency point, b n Representing a hanning window of length N;
the frequency domain signal defining the M channel is X (k, l):
X(k,l)=[X 1 (k,l),X 2 (k,l),...,X M (k,l)] T ,0≤k≤N-1 (2.2)
preferably, in the implementation process, the sampling frequency is 48kHz, the short-time Fourier transform length N is 512, and the window function b is selected n A hanning window of length 512.
Step 2.2, defining the space spectrum matrix of the signal as S X (k)=E{X(k,l)X H (k, L) }, E { · } denotes the expectation of the L-frame signal, the matrix elementsAssuming that the incident angle of the voice signal is theta, performing weighted summation on the spatial spectrum estimation results of the N frequency points to obtain total beam power P (theta):
wherein, w DS (θ,k)=[w 1 (θ,k),w 2 (θ,k),...,w M (θ,k)] T Weight vector, w, representing the phase aligned k-th bin DS H (theta, k) represents w DS Conjugate transpose of (θ, k);
carrying out angle search on the total beam power P (theta) to obtain a rough sound source azimuth angle estimated preliminarily
In a specific implementation process, according to an actual situation, the search range of the angle θ is as follows: theta is more than or equal to minus 90 degrees and less than or equal to plus 90 degrees, and the angle step is 1 degree.
And step 3: and the camera is just opposite to the direction of the sound source by utilizing the initial information of the direction of the sound source.
And 4, step 4: establishing a background model based on the initial data, and performing foreground detection and background model self-adaptive updating; the flow chart is shown in fig. 3 and comprises the following sub-steps:
step 4.1, firstly, establishing a background model by using initial video data, and recording the collected p frame image as I p (x, y), (x, y) are image matrix pixel coordinates; after the image is converted into a gray image, averaging the previous S frames to be used as an initial background B 0 (x, y). The formula is as follows:
after background modeling is finished, subtracting a background model from a current frame to obtain a foreground Target (x, y):
D(x,y)=I p (x,y)-B 0 (x,y) (4.2)
I p (x, y) represents the current frame image, D (x, y) represents the foreground pixel, T is the set threshold, and 1 in the Target (x, y) matrix represents the foreground pixel.
Step 4.2, the obtained binary foreground image has the phenomena of discontinuous outline, incomplete foreground and the like, so that the image needs to be subjected to subsequent processing such as opening-closing operation and the like, and finally, a complete foreground image G is obtained p (x,y)。
When processing a video stream, the background model needs to be updated due to environmental changes such as light. The update formula is as follows:
wherein, B p (x, y) is a background model added into the p frame image for self-adaptive updating, 0<α<1 is an update factor, which varies according to the environment change.
Because the horizontal size of the target is small relative to the distance to the camera, the image coordinates and the direction coordinates can be regarded as a linear relation, and then the foreground image G is obtained p (x, y) position conversion to angleAnd outputting the signals to a beam forming module. The exact angle obtained in the experiment was +27 °.
And 5: the precise angle is applied to a self-adaptive beam forming algorithm, and the signal-to-noise ratio of the voice signal is improved. The flow chart is shown in fig. 4 and includes the following sub-steps:
step 5.1, obtaining accurate angle information according to image processingArray response vector of corresponding target signalComprises the following steps:
wherein[p 1 ,p 2 ,...,p M ]Are the two-dimensional coordinates of the M microphone elements,is the wavelength corresponding to the kth frequency point, f k Is the frequency of the k frequency point, c represents the propagation speed of the plane wave in the medium; in the specific implementation process, the microphone arrays are uniform matrixes of 2 multiplied by 6, the distances among the microphones are all 0.05m, only the horizontal direction angle is considered, and the pitching direction angle is not considered.
Step 5.2, converting the linear constraint minimum variance beam forming into the following optimization problem:
wherein w (k, l) = [ w 1 (k,l),w 2 (k,l),...,w M (k,l)] T Weight vector, S, representing the signal of the l-th frame x (k, l) represents a spatial spectrum matrix of the l-th frame signal. Filtering according to a steepest descent adaptive algorithm:
w(k,l+1)=J(k)[w(k,l)-μX(k,l)Y * (k,l)]+F(k) (5.3)
whereinY(k,l)=w H (k, l) X (k, l) denotes a beamformed output signal, Y * (k, l) represents the complex conjugate of Y (k, l), μ ≧0 is the convergence step size, the initial weight vectorIn the specific implementation process, the selection of mu is changed according to different voice acquisition scenes, and mu is more than or equal to 0.00003 and less than or equal to 0.0001 in the experiment.
Splicing the sub-band signals into a broadband signal: y (l) = [ Y (0, l), Y (1, l),.., Y (N-1, l) ];
step 5.3: finally, inverse Discrete Fourier Transform (IDFT) is performed on Y (l) to obtain a time-domain output signal Y (l) of the l-th frame:
y(l)=IDFT[Y(l)] (5.4)
then, splicing the L frame voice signals to obtain a time domain output y (t):
y(t)=[y(1),y(2),...,y(l),...,y(L)] (5.5)
y (t) is the enhanced speech signal.
The multi-modal remote voice perception device comprises the following four modules:
a, a rectangular microphone array is 8-10 m away from a sound source;
b, a camera is arranged on the end edge of the rectangular microphone array and rotates synchronously with the microphone array;
c, the lower computer is connected with the rectangular microphone array and is used for controlling command receiving, signal acquisition and data transmission; after receiving a 'start' control instruction sent by the upper computer, the lower computer performs voice signal acquisition through the rectangular microphone array and uploads data to the upper computer in real time; the lower computer stops uploading data after receiving a stop control instruction sent by the upper computer;
and d, the upper computer is connected with the camera, receives the video signal and the voice signal sent by the lower computer, performs initial angle estimation on the target voice signal, and drives the camera to rotate to the direction opposite to the sound source by using the angle. Extracting a target foreground from the video image, and mapping foreground coordinates to an accurate position; and transmitting the high-precision azimuth parameters to a beam forming module, and outputting the beam forming in the azimuth to obtain an enhanced voice signal.
The connection and data transmission of the lower computer and the upper computer in the detection device are as follows:
a, determining data ports and connection interfaces of an upper computer, a lower computer, a microphone array and a camera, and establishing connection;
b, the upper computer issues a control command 'start' to start collecting audio and video data;
c, performing parallel-serial conversion on the sampling data of all channels of the rectangular microphone array, and sending an uplink data packet to an upper computer by a lower computer;
d, the upper computer sends a control command 'stop', the lower computer stops collecting data and waits for the upper computer to send the control command 'start' again;
and e, automatically storing the audio data into a dat file after the acquisition is finished, and storing the video data into an avi file.
Examples
In this embodiment, the detection method is applied to remote speech sensing, and the specific steps are as described above and will not be described herein again.
Performing initial arrival angle estimation on target voice by using beam forming, calculating total beam power P (theta) from-90 degrees to +90 degrees, and plotting by using angle theta as an x-axis coordinate and normalized power P (theta) as a y-axis coordinate, wherein the result is shown in fig. 5, and the rough azimuth angle of a target sound source (female voice)The estimated angle of the interfering sound source is-29.
As shown in fig. 6 (a) of the original image processing and fig. 6 (b) of the background subtraction processing, it is seen that some noise and interference affect the result. To eliminate the interference, the result shown in fig. 6 (c) is obtained after the on-off operation processing, and the final sound source localization result is shown in fig. 6 (d). The precise positions of the sound source are +27 degrees and-25 degrees, and the precise azimuth angle of the sound source is selected according to the rough azimuth angle obtained by beam formingOutput to the beam formingThe module performs beamforming output.
The audio processing results of the upper computer audio-video joint algorithm are shown in fig. 7 and 8. Fig. 7 is a waveform diagram of signals before and after speech enhancement, after processing, noise is significantly reduced and signal-to-noise ratio is enhanced. Fig. 8 is a time-frequency diagram of signals before and after speech enhancement, and it can be seen from the time-frequency diagram that after beamforming, the interference (male voice) with noise and energy concentrated in the low frequency part is suppressed, and the target sound source (female voice) in the high frequency part is retained and enhanced.
And respectively evaluating the results of beam forming at the rough angle and the precise angle by using the signal-to-noise ratio and the PESQ score, and checking the performance of the multi-mode combined system in actual data processing. Processing results as shown in tables 1 and 2, the signal-to-noise ratio gain of the output signal of the beam forming at the precise azimuth reaches 12.1704dB, the PESQ score is improved by 0.655, and the performance of the beam forming is better than that of the beam forming at the output of the rough azimuth.
TABLE 1 beamforming SNR comparison
Coarse angle | Precise angle | |
Signal-to-noise ratio gain (dB) | 10.0168 | 12.1704 |
TABLE 2PESQ evaluation score comparison
Single channel signal | Rough angle | Precise angle | |
Evaluation of PESQ | 1.6458 | 1.9473 | 2.3008 |
The processing method provided by the invention is tested in a Yongquan Square in the Yuquan school district of the university of Zhejiang Hangzhou, 2 multiplied by 6 microphone area arrays are adopted, the sound source distance is 10 meters, the sound source distance is divided into a target sound source (27 degrees) and an interference sound source (25 degrees), the sampling rate is 48kHz, and the test result is good. The invention can carry out the joint collection of remote voice and video, and sends the upper computer for processing and outputting.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.
Claims (7)
1. A multimodal remote speech perception method, comprising the steps of:
step 1: collecting voice and video signals by using a rectangular microphone array and a camera;
and 2, step: carrying out preliminary arrival angle estimation on a target voice signal by utilizing beam forming to obtain a rough sound source direction;
and 3, step 3: according to the rough sound source position, the driving camera is over against the sound source direction;
and 4, step 4: establishing a background model based on the initial data, and performing foreground detection and background model self-adaptive updating;
and 5: and transmitting the high-precision azimuth parameters corresponding to the foreground to a beam forming module, wherein the output of the beam forming module at the azimuth is the enhanced voice signal.
2. The method according to claim 1, wherein the step 2 comprises the following sub-steps:
step 2.1, for the speech signal framing, the signal of the L frame (L =1,.. Gtoreq.l) acquired by the array is recorded as x (L) = [ x ]) 1 (l),x 2 (l),...,x m (l),...,x M (l)]Where M represents the number of microphones, each microphone being a channel, x m (l)=[x m (0,l),x m (1,l),...,x m (n,l)...,x m (N-1,l)] T Representing the l frame signal collected on the m channel; applying a window function to each frame of signal, performing short-time Fourier transform, and performing Fourier transform on the time domain signal of the ith frame of the mth channel to obtain frequency domain representation:
where n denotes the index of time, k denotes the kth frequency point, b n Representing a hanning window of length N;
the frequency domain signal defining the M channel is X (k, l):
X(k,l)=[X 1 (k,l),X 2 (k,l),...,X M (k,l)] T ,0≤k≤N-1 (2.2)
step 2.2, defining the space spectrum matrix of the signal as S X (k) Elements of the matrixAssuming that the incident angle of the voice signal is theta, carrying out weighted summation on the space spectrum estimation results of N frequency points to obtain a total waveBeam power P (θ):
wherein, w DS (θ,k)=[w 1 (θ,k),w 2 (θ,k),...,w M (θ,k)] T Weight vector, w, representing the phase aligned k-th bin DS H (theta, k) represents w DS Conjugate transpose of (θ, k);
carrying out angle search on the total beam power P (theta) to obtain a rough azimuth angle of the sound source
4. The method according to claim 3, wherein the step 4 comprises the following sub-steps:
step 4.1, firstly, establishing a background model by using initial video data, and recording the collected p frame image as I p (x, y), (x, y) are image matrix pixel coordinates; after the image is converted into a gray image, averaging the previous S frames to be used as an initial background B 0 (x, y), the formula is as follows:
after background modeling is finished, subtracting a background model from a current frame to obtain a foreground Target (x, y):
D(x,y)=I p (x,y)-B 0 (x,y) (4.2)
I p (x, y) represents the current frame image, D (x, y) represents the foreground image point, T is the set threshold, and 1 in the Target (x, y) matrix represents the foreground pixel;
step 4.2, the obtained binaryzation foreground image is subjected to the subsequent processing of the opening-closing operation of the foreground image, and finally, a complete foreground image G is obtained p (x,y);
When the video stream is processed, the background model is updated, and the updating formula is as follows:
wherein, B p (x, y) is a background model added into the p frame image for self-adaptive updating, and alpha is more than 0 and less than 1 and is an updating factor;
5. The method according to claim 4, wherein the step 4 comprises the following sub-steps:
step 5.1, obtaining accurate angle information according to image processingArray response vector of corresponding target signalComprises the following steps:
wherein[p 1 ,p 2 ,...,p M ]Are the two-dimensional coordinates of the M microphone elements,is the wavelength, f, corresponding to the k-th frequency point k Is the frequency of the k-th frequency point, c represents the sound velocity;
step 5.2, converting the linear constraint minimum variance beam forming into the following optimization problem:
wherein w (k, l) = [ w 1 (k,l),w 2 (k,l),...,w M (k,l)] T Weight vector, S, representing the signal of the l-th frame X (k, l) a spatial spectral matrix representing the l-th frame signal; filtering according to a steepest descent adaptive algorithm:
w(k,l+1)=J(k)[w(k,l)-μX(k,l)Y * (k,l)]+F(k) (5.3)
whereinY(k,l)=w H (k, l) X (k, l) denotes a beamformed output signal, Y * (k, l) represents the complex conjugate of Y (k, l), μ ≧ 0 is the convergence step, andweight vector
Splicing the sub-band signals into a broadband signal: y (l) = [ Y (0, l), Y (1, l),.., Y (N-1, l) ];
step 5.3: finally, inverse Discrete Fourier Transform (IDFT) is performed on Y (l) to obtain a time-domain output signal Y (l) of the l-th frame:
y(l)=IDFT[Y(l)] (5.4)
then, splicing the L frame voice signals to obtain a time domain output y (t):
y(t)=[y(1),y(2),...,y(l),...,y(L)] (5.5)
y (t) is the enhanced speech signal.
6. A multimodal remote speech perception apparatus, the apparatus comprising:
the rectangular microphone array is 8-10 m away from the sound source;
the camera is arranged on the end edge of the rectangular microphone array and rotates synchronously with the microphone array;
the lower computer is connected with the rectangular microphone array and is used for controlling command receiving, signal acquisition and data transmission; after receiving a 'start' control instruction sent by the upper computer, the lower computer performs voice signal acquisition through the rectangular microphone array and uploads data to the upper computer in real time; the lower computer stops uploading data after receiving a stop control instruction sent by the upper computer;
the upper computer is connected with the camera, receives the video signal and the voice signal sent by the lower computer, performs initial angle estimation on the target voice signal, and drives the camera to rotate to the direction opposite to the sound source by using the angle; extracting a target foreground from a video image, and mapping foreground coordinates to an accurate angle direction; and transmitting the high-precision azimuth parameters to a beam forming module, and outputting the beam forming in the azimuth to obtain an enhanced voice signal.
7. The multi-modal remote voice perception device according to claim 6, wherein the connection and data transmission between the lower computer and the upper computer are as follows:
a, determining data ports and connection interfaces of an upper computer, a lower computer, a microphone array and a camera, and establishing connection;
b, the upper computer issues a control command 'start' to start collecting audio and video data;
c, performing parallel-serial conversion on the sampling data of all channels of the rectangular microphone array, and sending an uplink data packet to an upper computer by a lower computer;
d, the upper computer sends a control command 'stop', the lower computer stops collecting data and waits for the upper computer to send the control command 'start' again;
and e, automatically storing the audio data into a dat file after the acquisition is finished, and storing the video data into an avi file.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910705872.0A CN110444220B (en) | 2019-08-01 | 2019-08-01 | Multi-mode remote voice perception method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910705872.0A CN110444220B (en) | 2019-08-01 | 2019-08-01 | Multi-mode remote voice perception method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110444220A CN110444220A (en) | 2019-11-12 |
CN110444220B true CN110444220B (en) | 2023-02-10 |
Family
ID=68432714
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910705872.0A Active CN110444220B (en) | 2019-08-01 | 2019-08-01 | Multi-mode remote voice perception method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110444220B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112951273B (en) * | 2021-02-02 | 2024-03-29 | 郑州大学 | Numerical control machine tool cutter abrasion monitoring device based on microphone array and machine vision |
CN116504264B (en) * | 2023-06-30 | 2023-10-31 | 小米汽车科技有限公司 | Audio processing method, device, equipment and storage medium |
CN116705047B (en) * | 2023-07-31 | 2023-11-14 | 北京小米移动软件有限公司 | Audio acquisition method, device and storage medium |
CN117953914A (en) * | 2024-03-27 | 2024-04-30 | 深圳市西昊智能家具有限公司 | Speech data enhancement optimization method for intelligent office |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2013175869A (en) * | 2012-02-24 | 2013-09-05 | Nippon Telegr & Teleph Corp <Ntt> | Acoustic signal enhancement device, distance determination device, methods for the same, and program |
WO2015196729A1 (en) * | 2014-06-27 | 2015-12-30 | 中兴通讯股份有限公司 | Microphone array speech enhancement method and device |
CN106328156A (en) * | 2016-08-22 | 2017-01-11 | 华南理工大学 | Microphone array voice reinforcing system and microphone array voice reinforcing method with combination of audio information and video information |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8577677B2 (en) * | 2008-07-21 | 2013-11-05 | Samsung Electronics Co., Ltd. | Sound source separation method and system using beamforming technique |
-
2019
- 2019-08-01 CN CN201910705872.0A patent/CN110444220B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2013175869A (en) * | 2012-02-24 | 2013-09-05 | Nippon Telegr & Teleph Corp <Ntt> | Acoustic signal enhancement device, distance determination device, methods for the same, and program |
WO2015196729A1 (en) * | 2014-06-27 | 2015-12-30 | 中兴通讯股份有限公司 | Microphone array speech enhancement method and device |
CN106328156A (en) * | 2016-08-22 | 2017-01-11 | 华南理工大学 | Microphone array voice reinforcing system and microphone array voice reinforcing method with combination of audio information and video information |
Also Published As
Publication number | Publication date |
---|---|
CN110444220A (en) | 2019-11-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110444220B (en) | Multi-mode remote voice perception method and device | |
CN111044973B (en) | MVDR target sound source directional pickup method for microphone matrix | |
CN106328156B (en) | Audio and video information fusion microphone array voice enhancement system and method | |
CN110992974A (en) | Speech recognition method, apparatus, device and computer readable storage medium | |
CN111239687B (en) | Sound source positioning method and system based on deep neural network | |
US6826284B1 (en) | Method and apparatus for passive acoustic source localization for video camera steering applications | |
CN109490822B (en) | Voice DOA estimation method based on ResNet | |
CN108318862B (en) | Sound source positioning method based on neural network | |
CN108375763B (en) | Frequency division positioning method applied to multi-sound-source environment | |
CN110010147A (en) | A kind of method and system of Microphone Array Speech enhancing | |
CN108877827A (en) | Voice-enhanced interaction method and system, storage medium and electronic equipment | |
CN111429939B (en) | Sound signal separation method of double sound sources and pickup | |
Liu et al. | Continuous sound source localization based on microphone array for mobile robots | |
CN109782231B (en) | End-to-end sound source positioning method and system based on multi-task learning | |
CN107167770B (en) | A kind of microphone array sound source locating device under the conditions of reverberation | |
CN112904279B (en) | Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum | |
Li et al. | Reverberant sound localization with a robot head based on direct-path relative transfer function | |
CN110515034B (en) | Acoustic signal azimuth angle measurement system and method | |
CN113607447A (en) | Acoustic-optical combined fan fault positioning device and method | |
Hu et al. | Decoupled direction-of-arrival estimations using relative harmonic coefficients | |
US11636866B2 (en) | Transform ambisonic coefficients using an adaptive network | |
CN112180318B (en) | Sound source direction of arrival estimation model training and sound source direction of arrival estimation method | |
CN114417908A (en) | Multi-mode fusion-based unmanned aerial vehicle detection system and method | |
CN116559778B (en) | Vehicle whistle positioning method and system based on deep learning | |
CN105372644B (en) | One kind is based on the modified Adaptive beamformer method and system of dynamic weight |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |