CN110534126B - Sound source positioning and voice enhancement method and system based on fixed beam forming - Google Patents

Sound source positioning and voice enhancement method and system based on fixed beam forming Download PDF

Info

Publication number
CN110534126B
CN110534126B CN201910845095.XA CN201910845095A CN110534126B CN 110534126 B CN110534126 B CN 110534126B CN 201910845095 A CN201910845095 A CN 201910845095A CN 110534126 B CN110534126 B CN 110534126B
Authority
CN
China
Prior art keywords
response power
controllable response
sound source
module
wave beam
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910845095.XA
Other languages
Chinese (zh)
Other versions
CN110534126A (en
Inventor
刘富春
杨洋
林其光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Zhi company artificial intelligence technology Co., Ltd.
South China University of Technology SCUT
Original Assignee
Guangzhou Zib Artificial Intelligence Technology Co ltd
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Zib Artificial Intelligence Technology Co ltd, South China University of Technology SCUT filed Critical Guangzhou Zib Artificial Intelligence Technology Co ltd
Priority to CN201910845095.XA priority Critical patent/CN110534126B/en
Publication of CN110534126A publication Critical patent/CN110534126A/en
Application granted granted Critical
Publication of CN110534126B publication Critical patent/CN110534126B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/18Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
    • G01S5/20Position of source determined by a plurality of spaced direction-finders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • General Physics & Mathematics (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)

Abstract

The invention discloses a sound source positioning and voice enhancing method and system based on fixed beam forming. The system comprises: the voice recognition system comprises a data acquisition module, a sound source positioning module based on the maximum controllable response power and a voice enhancement module; the data acquisition module comprises an audio file analysis module and a microphone driving module; the sound source positioning module based on the maximum controllable response power comprises a sub-band time delay beam former, a maximum controllable response power calculation module and a maximum controllable response power search module; the audio information stream acquired by the data acquisition module is transmitted to a sound source positioning module based on the maximum controllable response power, the sound source positioning module based on the maximum controllable response power outputs a sound source position estimation direction to a voice enhancement module, and the voice enhancement module takes the sound source position estimation direction as a core and realizes voice enhancement through beam forming to obtain sound source position information; the invention solves the key technical problems of sound source positioning and voice enhancement which provide support for the intelligent terminal.

Description

Sound source positioning and voice enhancement method and system based on fixed beam forming
Technical Field
The invention relates to the technical field of multimedia, in particular to a sound source positioning and voice enhancement method and system based on fixed beam forming.
Background
Since the 80 s of the last century, research on the application of microphone arrays to speech enhancement technology began, and this became a research hotspot in the 90 s. In a speech enhancement algorithm based on a microphone array, Delay and Sum beam forming (DSB) method proposed by Flanagan performs Delay compensation on data received by different sensors, so that the received signals of the sensors are synchronized in a time domain, and then are weighted and averaged to obtain an enhanced signal. The method is simple in principle and easy to implement, but the number of the microphones and the distribution mode of the microphones determine the performance of the algorithm, and the filter coefficients in the algorithm are fixed and unchanged, so the algorithm is also called fixed beam forming.
In order to adapt a speech enhancement algorithm based on beamforming to a more complex noise environment, adaptive beamforming methods have been proposed. Of these, the Generalized Sidelobe Canceling (GSC) algorithm and the linear Constrained Minimum Variance beamforming (LCMV) algorithm are most representative. The generalized sidelobe canceling algorithm filters the useful signals in the signal with noise by the blocking matrix through the fixed beamformer and the blocking matrix at the same time, thereby obtaining an estimated noise model. However, the signal passing through the blocking matrix still contains part of the target speech signal, so that the enhancement signal is distorted. The LCMV algorithm is based on a Minimum Variance Distortionless Response (MVDR) criterion, and on the premise of ensuring that a signal in an expected direction is constant, the output power of an array is Minimum, so that noise is suppressed.
The sound source positioning is used as a pre-algorithm necessary for multi-channel speech enhancement, a task of acquiring space position information of a target sound source in man-machine interaction needs to be completed, and the method is a key step of speech enhancement based on the Microphone Array. Among sound source localization algorithms based on micropene Array, a sound source localization algorithm based on Time Delay of Arrival (TDOA) estimation and a Multiple Signal Classification (MUSIC) algorithm based on subspace are widely used. The most classical method of time delay estimation is the Generalized Cross Correlation (GCC) method. The method uses a correlation function and Fourier transform to weight a received signal in a frequency domain, and extracts relative time delay from peak information of the correlation function. However, the GCC algorithm is very sensitive to noise, so that the effect in practical application is not ideal. The MUSIC algorithm needs prior knowledge of the number of sound sources and a certain number of microphones to realize, and on the basis, the calculation amount of the algorithm is large, so the algorithm is difficult to realize.
Therefore, one technical problem that needs to be urgently solved by those skilled in the art is: how to creatively provide a sound source positioning and voice enhancement method to simultaneously meet the positioning precision, the algorithm real-time performance and the voice enhancement quality.
Disclosure of Invention
In view of this, the present invention provides a sound source localization and speech enhancement method and system based on fixed beam forming, which applies the beam forming theory and method to solve the key technical problem of providing support for the intelligent robot applied in the indoor environment.
The purpose of the invention is realized by at least one of the following technical schemes:
a fixed beamforming based sound source localization and speech enhancement system comprising: the voice recognition system comprises a data acquisition module, a sound source positioning module based on the maximum controllable response power and a voice enhancement module; the data acquisition module comprises an audio file analysis module and a microphone driving module; the sound source positioning module based on the maximum controllable response power comprises a sub-band time delay beam former, a maximum controllable response power calculation module and a maximum controllable response power search module;
the microphone driving module is used for transmitting the audio information stream acquired by the microphone array with the M microphones in real time to the beam former with the sub-band time delay; the microphone array is a uniform circular array, but may take other geometric shapes;
the sub-band time-delay beam former receives the audio information stream generated by the data acquisition module, carries out delay-sum beam forming on each frame of audio data in the audio information stream according to a specific beam direction, forms a beam in the specific beam direction and transmits the beam to the maximum controllable response power calculation module, and outputs the controllable response power in the specific beam direction; the maximum controllable response power searching module searches a global maximum from the controllable response powers in different beam directions output by the maximum controllable response power calculating module and outputs the corresponding beam direction as a sound source position estimation direction, wherein sound source positioning is realized in the step of searching the global maximum of the controllable response power;
the voice enhancement module is a beam former, the sound source position estimation direction generated by the maximum controllable response power search module is sent to the voice enhancement module, and the voice enhancement module forms a beam in the sound source position estimation direction through delay-sum beam forming and outputs an enhanced voice signal to realize voice enhancement.
Further, the stream of audio information comprises M discrete signal sequences of equal length.
Further, the sub-band delay beamformer performs delay-sum beamforming on each frame of audio data in the audio information stream generated by the data acquisition module, and the following steps are performed:
s1.1, carrying out Fourier transform of adding a Hamming window on the audio information stream generated by the data acquisition module to obtain a discrete frequency domain signal;
s1.2, applying phase compensation corresponding to a specific beam direction to frequency points within 450 Hz-3000 Hz in the discrete frequency domain signals to obtain discrete frequency domain signals compensated by the specific beam direction;
s1.3, carrying out weighted average on the discrete frequency domain signals compensated in the specific wave beam direction to obtain average discrete frequency domain signals compensated in the specific wave beam direction;
and S1.4, finally, applying inverse Fourier transform to the average discrete frequency domain signal compensated in the specific beam direction to obtain the beam in the specific beam direction.
The phase compensation is determined by the beam direction, and the phase compensation corresponding to the beam direction is as follows:
Figure BDA0002194998440000031
Figure BDA0002194998440000032
wherein, tauOmic1The time difference between the arrival point of the sound wave at O point and the arrival at mic1, ω τOmic1Denotes phase compensation, i.e., phase difference, F () denotes fourier transform, s (t) denotes time domain signal, O is the origin of the coordinate system, and mic1 is the microphone numbered 1; r is the microphone array radius, theta is the azimuth angle, i.e. the included angle between the beam direction and the mic1 at the point O, beta is the beam direction pitch angle, and c is the sound velocity.
Further, the pitch angle beta of the beam direction is fixed to be 90 degrees, and the azimuth angle theta is selected to be an integer number of degrees between 0-359 degrees.
Further, in step S1.4, the formed beam is a discrete signal sequence comprising at least 512 data points in length, corresponding to a sampling frequency of 16 kHz.
Further, the maximum controllable response power calculation module receives a discrete signal sequence output by the sub-band delay beamformer, that is, a beam in a specific beam direction, and sums up the squares point by point to obtain a controllable response power in the beam direction, where the controllable response power calculation formula is as follows:
Figure BDA0002194998440000033
y (n) beam form for subband time delayThe discrete time domain output of the synthesizer, n is the point sequence number of the discrete sequence,
Figure BDA0002194998440000034
is a set of integers.
Further, the maximum controllable response power searching module generates the sound source position estimation direction, which comprises the following steps:
s2.1, in a coarse search stage, selecting 24 wave beam directions at equal intervals by taking 0 as a starting point between 0 and 359 degrees, calculating the controllable response power of the 24 wave beam directions through a wave beam former with sub-band time delay and a maximum controllable response power calculation module, comparing and sequencing the controllable response power of the 24 wave beam directions, and finding out the wave beam direction theta corresponding to the maximum controllable response power and the second-largest controllable response power10And theta11
S2.2, fine search first stage, in theta10And theta11The included acute angle is a new search range in theta10Selecting 5 wave beam directions at equal intervals as a starting point, calculating controllable response power of the 5 wave beam directions through a wave beam former with sub-band time delay and a maximum controllable response power calculation module, comparing and sequencing the controllable response power of the 5 wave beam directions, and finding out the wave beam direction theta corresponding to the maximum controllable response power and the second maximum controllable response power20And theta21
S2.3, fine search second stage, in theta20And theta21The included acute angle is a new search range in theta20Selecting 3 wave beam directions at equal intervals as a starting point, calculating controllable response power of the 3 wave beam directions through a wave beam former with sub-band time delay and a maximum controllable response power calculation module, comparing and sequencing the controllable response power of the 3 wave beam directions, and finding out the wave beam direction theta corresponding to the maximum controllable response powermaxAs a sound source position estimation direction;
and S2.4, outputting the sound source position estimation direction.
Further, delay-sum beamforming in the speech enhancement module is accomplished by:
s3.1, performing Fourier transform of adding a Hamming window on the audio information stream generated by the data acquisition module to obtain a discrete frequency domain signal;
s3.2, applying the full frequency points in the discrete frequency domain signals and searching the output beam direction theta of the maximum controllable response power searching modulemaxCorresponding phase compensation is carried out to obtain a discrete frequency domain signal after the sound source position estimation direction compensation;
s3.3, carrying out weighted average on the discrete frequency domain signals after the sound source position estimation direction compensation to obtain average discrete frequency domain signals after the sound source position estimation direction compensation;
and S3.4, finally, applying inverse Fourier transform to the average discrete frequency domain signal after the sound source position estimation direction compensation to obtain a wave beam in the sound source position estimation direction, and realizing voice enhancement for the enhanced voice signal.
A sound source localization and speech enhancement method based on fixed beam forming comprises the following steps:
step 1, collecting audio information flow through a data collection module and transmitting the audio information flow to a sound source positioning module based on maximum controllable response power;
step 2, selecting a beam direction needing to calculate controllable response power in a coarse search stage, and performing corresponding time delay compensation on the audio information stream generated by the data acquisition module by using a sub-band time delay beam former, namely performing corresponding phase compensation and weighted average in a frequency domain to obtain a beam in the beam direction;
step 3, utilizing a maximum controllable response power calculation module to perform point-by-point square summation on the wave beam output in the step 2 to obtain the controllable response power of the wave beam, and associating the controllable response power with the corresponding wave beam direction, and storing the controllable response power and the corresponding wave beam direction together for a subsequent maximum controllable response power search module to use;
step 4, in the coarse search stage, selecting 24 wave beam directions at equal intervals by taking 0 as a starting point between 0 and 359 degrees, repeating the step 2 and the step 3, calculating the controllable response power of the 24 wave beam directions, comparing and sequencing the controllable response power of the 24 wave beam directions, and finding out the maximum controllable response power and the second maximum controllable response powerControlling the beam direction theta corresponding to the response power10And theta11
Step 5, fine search in the first stage, using theta10And theta11The included acute angle is a new search range in theta10Selecting 5 wave beam directions at equal intervals as a starting point, repeating the step 2 and the step 3, calculating the controllable response power of the 5 wave beam directions, comparing and sequencing the controllable response power of the 5 wave beam directions, and finding out the wave beam direction theta corresponding to the maximum controllable response power and the second maximum controllable response power20And theta21
Step 6, fine search second stage, in theta20And theta21The included acute angle is a new search range in theta20Selecting 3 wave beam directions at equal intervals as a starting point, repeating the step 2 and the step 3, calculating the controllable response power of the 3 wave beam directions, comparing and sequencing the controllable response power of the 3 wave beam directions, and finding out the wave beam direction theta corresponding to the maximum controllable response powermaxAs the sound source position estimation direction and sending it to the speech enhancement module;
and 7, receiving the birth source position estimation direction by the voice enhancement module, forming a beam in the sound source position estimation direction through delay-sum beam forming, and outputting an enhanced voice signal to realize voice enhancement.
Compared with the prior art, the invention has the advantages that:
the invention provides a sound source positioning and voice enhancing method and system based on fixed beam forming by applying a beam forming theory and method, and compared with the traditional controllable beam forming positioning method based on maximum output power, the invention introduces a rapid controllable response power searching method, so that the algorithm real-time performance is improved on the premise that the positioning accuracy is not influenced; only one frequency domain sub-band is subjected to phase compensation in the delay-sum beam former, so that the adverse effect of partial noise on a positioning result is reduced, and the reliability of the algorithm on the positioning of a human sound source is improved; the pitch angle of the wave beam direction is fixed, sound source positioning in a two-dimensional plane is carried out, redundancy is reduced, efficiency is improved, and waste of computing resources is avoided; the method solves the key technical problems of sound source positioning and voice enhancement which are supported by an intelligent robot or other intelligent terminals applied to indoor environments.
Drawings
FIG. 1 is a block diagram of a DSB algorithm in an embodiment of the present invention;
FIG. 2 is a flowchart illustrating the overall steps in an embodiment of the present invention;
FIG. 3 is a flow chart of the operation of the beamformer in an embodiment of the present invention;
FIG. 4 is a flowchart illustrating operation of the maximum controllable response power searching module in an embodiment of the present invention;
FIG. 5 is a flowchart of the controllable response power comparison step in the embodiment of the present invention.
Detailed Description
The following description of specific embodiments of the present invention will be made with reference to the accompanying drawings and examples:
the invention provides a two-dimensional sound source positioning and multi-channel voice enhancement method based on a microphone array and a fixed beam forming technology, and solves the problem that an intelligent robot or other intelligent terminals applied to an indoor environment need to support the sound source positioning and voice enhancement technology. Meanwhile, an optimized controllable response power searching method is adopted, and step searching is carried out in a searching space, so that better algorithm real-time performance is realized on the premise of ensuring positioning accuracy. The basic idea of beam forming is to perform delay compensation on the received signals of each sensor respectively to synchronize the signals of each channel, and then perform weighted average on the synchronized signals to obtain beams. Beamforming with fixed filter coefficients is called fixed beamforming. The delay-sum beam forming (DSB) method is one of the most easily implemented fixed beam forming methods, and its algorithm is simple, but the noise cancellation performance is related to the number of microphone array elements, the more array elements, the stronger the noise cancellation performance, and the higher the sound source signal frequency, the narrower the spatial beam formed by the DSB method, so that the signal will cause the distortion of the broadband signal when deviating from the beam gaze direction. The DSB algorithm block diagram is shown in fig. 1, where the beamforming performs corresponding delay compensation on the received signals of M microphones, that is, after performing corresponding phase compensation in the frequency domain, each microphone is given a weight value, and the compensated M signal sequences are weighted and averaged to obtain an output signal, which is the output signal of the beamformer.
Example (b):
as shown in FIG. 2, the overall system structure of the invention comprises a data acquisition module, a sound source positioning module based on maximum controllable response power and a voice enhancement module.
The data acquisition module comprises an audio file analysis module and a microphone driving module; the sound source positioning module based on the maximum controllable response power comprises a sub-band time delay beam former, a maximum controllable response power calculation module and a maximum controllable response power search module;
the microphone driving module is used for transmitting the audio information stream acquired by the microphone array with the M microphones in real time to the beam former with the sub-band time delay;
the sub-band time-delay beam former receives the audio information stream generated by the data acquisition module, carries out delay-sum beam forming on each frame of audio data in the audio information stream according to a specific beam direction, forms a beam in the specific beam direction and transmits the beam to the maximum controllable response power calculation module, and outputs the controllable response power in the specific beam direction; the maximum controllable response power searching module searches a global maximum from the controllable response powers in different beam directions output by the maximum controllable response power calculating module and outputs the corresponding beam direction as a sound source position estimation direction, wherein sound source positioning is realized in the step of searching the global maximum of the controllable response power;
the voice enhancement module is a beam former, the sound source position estimation direction generated by the maximum controllable response power search module is sent to the voice enhancement module, and the voice enhancement module forms a beam in the sound source position estimation direction through delay-sum beam forming and outputs an enhanced voice signal to realize voice enhancement.
The sound source positioning module based on the maximum controllable response power comprises the following steps:
as shown in fig. 4, the sound source localization module based on the maximum controllable response power has the following working steps:
step S1: according to the steps of the maximum controllable response power searching module, firstly, controllable response power calculation is carried out on the beam direction in the coarse searching stage, a beam direction needing to be calculated in the coarse searching stage is selected, corresponding time delay compensation is carried out on the audio information stream generated by the data acquisition module by utilizing a sub-band time delay beam former, namely, corresponding phase compensation and weighted average are carried out in a frequency domain to obtain the beam in the beam direction.
However, the delay compensation of the collected audio information stream, i.e. the voice signal, directly in the time domain results in a large error. Assuming that the sampling rate of the microphone array is 16000Hz, a four-microphone circular array is adopted, the radius of the four-microphone circular array is 0.0485M, namely 4.85cm, the center of the microphone array is taken as a spherical coordinate origin, the pitch angle of the target sound source position is 90 degrees, the azimuth angle theta is the same as the microphone array element M, and theoretically, the time delay compensation applied to the signals received by the microphone M is as follows:
Figure BDA0002194998440000071
where r is the microphone array radius and c is the speed of sound. At this time, the time resolution t of the speech signal0Comprises the following steps:
Figure BDA0002194998440000072
wherein fs is the sampling rate. In which case τ is not t0Integer multiples of (d), delay compensating the speech signal directly in the time domain introduces errors close to 15%. Therefore, as shown in fig. 3, in the sub-band delay beamformer, a fourier transform with hamming window is performed on a speech signal, then a phase compensation is performed on a signal spectrum in a frequency domain to achieve a time delay compensation effect in a time domain, and then weighted average, inverse fourier transform and window shift operations are sequentially performed to obtain a beam in the specific beam direction. Window with windowThe length of the post-speech signal is generally between 10ms and 30ms, and has a short-time stationary characteristic. In this embodiment, the Fast Fourier Transform (FFT) point number is 512 points, the window length is 512 points, and the shift window length is 256 points. If F () is used to represent the fourier transform and s (t) is used to represent the time domain signal, then a delay τ (corresponding to equation (1) above) is applied to s (t) and the corresponding phase compensation ω τ is as follows:
Figure BDA0002194998440000081
step S2: and (4) performing point-by-point square summation on the beam output in the step (S0), namely the time domain discrete signal sequence, by using a maximum controllable response power calculation module to obtain the controllable response power of the beam, and associating the controllable response power and the corresponding beam direction, and storing the controllable response power and the corresponding beam direction together for use by a subsequent maximum controllable response power search module. The controllable response power can be calculated as follows:
Figure BDA0002194998440000082
where y (n) is the discrete time domain output of the delay-sum beamformer, n is the point sequence number of the discrete sequence,
Figure BDA0002194998440000083
is a set of integers.
Step S3: a coarse searching stage, repeating the steps S0 and S1 for 24 times respectively; the number of coarse searches may be determined based on accuracy and real-time requirements. In this embodiment, with 0 as a starting point, angles are taken every 15 ° as beam directions, and controllable response powers in 24 beam directions in the coarse search stage are obtained through steps S1 and S2. Utilizing the maximum controllable response power searching module and utilizing the flow shown in fig. 5 to find out the maximum and the second maximum controllable response powers of the 24 controllable response powers, which are respectively marked as SRP10And SRP11And its corresponding beam directions are respectively denoted as theta10And theta11Take θ10And theta11What is clamping sharpThe angle is used as the search interval for the first stage of the fine search.
In fig. 5, P1, P2 respectively represent two pointers pointing to two beam directions in the range of 0-359 ° in the current search stage; n represents the number of times of coarse search repetition and is a constant; SRP [ beta ] represents an array containing SRP values corresponding to all beam directions, and SRP [ P1] represents an SRP value corresponding to the beam direction pointed by P1; p1+ + represents the pointer P1 movement to point to the next beam direction in the search phase, in angular size of beam direction; max represents a pointer of a beam direction corresponding to the current maximum SRP value, and Max2 represents a pointer of a beam direction corresponding to the current maximum SRP value; MaximumSRP denotes the SRP maximum value, SRP [ Max ] also denotes the SRP maximum value, SecondlargeSRP denotes the SRP second largest value, and SRP [ Max2] also denotes the SRP second largest value.
The flow shown in fig. 5 is as follows: initializing P1, P2, N and Max, judging whether the circulation is ended or not, if the circulation is not ended, comparing the sizes of the SRP values pointed by the current P1 and P2, if the SRP pointed by the P1 is larger, reserving P1, and updating P2; if the SRP pointed by P2 is larger, storing P2 in Max; if the loop is over, Max is the pointer to the beam direction with the maximum SRP, and MaximumSRP saves the maximum SRP value. And then searching the next largest SRP and the beam direction thereof, firstly reinitializing P1, P2 and Max2, reserving Max, judging whether the Max is 0, if so, updating P1 and P2 and then judging whether the cycle is ended, and if not, directly judging whether the cycle is ended. If the loop is not finished, comparing the sizes of the SRP values pointed by the current P1 and the P2, if the SRP pointed by the P1 is larger, reserving the P1, and updating the P2; if the SRP pointed by P2 is larger, judging whether P2 is the pointer of the global maximum SRP value, if so, updating P2, and if not, firstly updating the Max2 to P2 and then updating P2. If the loop is finished, storing the next largest value of the SRP in SecondlargeSRP, wherein Max2 is a corresponding pointer of the next largest SRP value.
Under the ideal condition, the SRP curve (the abscissa is the azimuth angle, the range is 0-359 degrees, and the ordinate is the SRP corresponding to the azimuth angle) only has one maximum value point and one maximum value point, and the two points are superposed; in practical situations, other directional noises such as computer case fans,The direction of the noise emitted by air conditioners and the like is fixed, and the noise with directivity can cause some small maximum value points, namely false peaks, to appear on the SRP curve. In the coarse search, 24 wave beam directions are taken, the step length is 15 degrees, the interference of pseudo peaks can be effectively avoided, and the correct interval where the global maximum value is located, namely theta, is ensured to be found in a non-noisy environment10And theta11The included acute angle. Repeating steps S0 and S1 5 times, respectively, such as by θ10Taking an angle every 3 ° as a beam direction as a starting point, controllable response powers of 5 beam directions in the fine search first stage are obtained through steps S0 and S1.
Step S4: a fine search first stage in which the steps S1 and S2 are repeated 5 times in a search interval of the fine search first stage; in the present embodiment, θ is given10Taking an angle every 3 ° as a beam direction as a starting point, controllable response powers of 5 beam directions in the fine search first stage are obtained through steps S1 and S2. Then, the maximum controllable response power searching module is used for finding out the maximum controllable response power and the second maximum controllable response power in the 5 controllable response powers, which are respectively marked as SRP20And SRP21And its corresponding beam directions are respectively denoted as theta20And theta21Take θ20And theta21The included acute angle is taken as the next search interval, and the global maximum value is considered to be theta at the moment20And theta21And taking the acute angle as a search interval of the second stage of the fine search within the included acute angle.
Step S5: a second fine search stage, wherein in the search interval of the second fine search stage, the steps S1 and S2 are repeated for 3 times respectively; in the present embodiment, θ is given20Taking an angle every 1 ° as a beam direction as a starting point, controllable response powers of 3 beam directions in the first fine search stage are obtained through steps S1 and S2. Then, the maximum controllable response power searching module is utilized to find out theta20And theta21The maximum of the 5 controllable response powers, which corresponds to the beam direction being the sound source position estimation direction.
(II) voice enhancement module
The speech enhancement module performs speech enhancement based on fixed beam forming techniques, using the most classical delay-sum beam forming method. Delay-sum beamforming obtains a discrete frequency domain signal by performing a hamming window-added fourier transform on an audio information stream output by a module; applying phase compensation corresponding to the sound source position estimation direction to the full frequency points in the discrete frequency domain signals to obtain discrete frequency domain signals after the sound source position estimation direction compensation; then, carrying out weighted average on the discrete frequency domain signals after the sound source position estimation direction compensation to obtain average discrete frequency domain signals after the sound source position estimation direction compensation; and finally, applying inverse Fourier transform to the average discrete frequency domain signal after the sound source position estimation direction compensation, moving a window, repeating the steps, overlapping time domain discrete signal sequences obtained by each inverse Fourier transform to obtain a wave beam in the sound source position estimation direction, wherein the wave beam is an enhanced voice signal, and the voice enhancement is realized.
In this embodiment, in order to reduce the production cost and ensure the reliability of the system, the system is implemented in an embedded manner. RaspberryPi3B is selected as MPU, and an ARM Cortex-A53 of the MPU contains an FPU instruction and supports floating point operation, the working frequency is up to 1.2GHz, and 1GBRAM is adopted; ReSpeaker 4-Mics Pi HAT is selected as audio equipment to achieve embedded realization of sound source positioning and a voice enhancement system. The RaspberryPi3B includes a data acquisition module, a sound source localization module based on maximum controllable response power, and a speech enhancement module.
The above description is only a preferred embodiment of the present invention, the present invention is not limited to the above embodiment, and there may be some slight structural changes in the implementation, and if there are various changes or modifications to the present invention without departing from the spirit and scope of the present invention, and within the claims and equivalent technical scope of the present invention, the present invention is also intended to include those changes and modifications.

Claims (9)

1. A fixed beamforming based sound source localization and speech enhancement system, comprising: the voice recognition system comprises a data acquisition module, a sound source positioning module based on the maximum controllable response power and a voice enhancement module; the data acquisition module comprises an audio file analysis module and a microphone driving module; the sound source positioning module based on the maximum controllable response power comprises a sub-band time delay beam former, a maximum controllable response power calculation module and a maximum controllable response power search module;
the microphone driving module is used for transmitting the audio information stream acquired by the microphone array with the M microphones in real time to the beam former with the sub-band time delay;
the sub-band time-delay beam former receives the audio information stream generated by the data acquisition module, carries out delay-sum beam forming on each frame of audio data in the audio information stream according to a specific beam direction, forms a beam in the specific beam direction and transmits the beam to the maximum controllable response power calculation module, and outputs the controllable response power in the specific beam direction; the maximum controllable response power searching module searches a global maximum from the controllable response powers in different beam directions output by the maximum controllable response power calculating module and outputs the corresponding beam direction as a sound source position estimation direction, wherein sound source positioning is realized in the step of searching the global maximum of the controllable response power;
the voice enhancement module is a beam former, the sound source position estimation direction generated by the maximum controllable response power search module is sent to the voice enhancement module, the voice enhancement module forms a beam in the sound source position estimation direction through delay-sum beam forming, and an enhanced voice signal is output to realize voice enhancement;
the maximum controllable response power searching module generates the sound source position estimation direction and comprises the following steps:
s2.1, in a coarse search stage, selecting 24 wave beam directions at equal intervals by taking 0 as a starting point between 0 and 359 degrees, calculating the controllable response power of the 24 wave beam directions through a wave beam former with sub-band time delay and a maximum controllable response power calculation module, comparing and sequencing the controllable response power of the 24 wave beam directions, and finding out the wave beam direction theta corresponding to the maximum controllable response power and the second-largest controllable response power10And theta11
S2.2, fine search first stage, in theta10And theta11The included acute angle is a new search range in theta10Selecting 5 wave beam directions at equal intervals as a starting point, calculating controllable response power of the 5 wave beam directions through a wave beam former with sub-band time delay and a maximum controllable response power calculation module, comparing and sequencing the controllable response power of the 5 wave beam directions, and finding out the wave beam direction theta corresponding to the maximum controllable response power and the second maximum controllable response power20And theta21
S2.3, fine search second stage, in theta20And theta21The included acute angle is a new search range in theta20Selecting 3 wave beam directions at equal intervals as a starting point, calculating controllable response power of the 3 wave beam directions through a wave beam former with sub-band time delay and a maximum controllable response power calculation module, comparing and sequencing the controllable response power of the 3 wave beam directions, and finding out the wave beam direction theta corresponding to the maximum controllable response powermaxAs a sound source position estimation direction;
and S2.4, outputting the sound source position estimation direction.
2. The fixed beamforming-based sound source localization and speech enhancement system according to claim 1, wherein the audio information stream comprises M discrete signal sequences of equal length.
3. The fixed beamforming based sound source localization and speech enhancement system according to claim 1, wherein the subband delay beamformer performs delay-sum beamforming on each frame of audio data in the audio information stream generated by the data acquisition module by:
s1.1, carrying out Fourier transform of adding a Hamming window on the audio information stream generated by the data acquisition module to obtain a discrete frequency domain signal;
s1.2, applying phase compensation corresponding to a specific beam direction to frequency points within 450 Hz-3000 Hz in the discrete frequency domain signals to obtain discrete frequency domain signals compensated by the specific beam direction;
s1.3, carrying out weighted average on the discrete frequency domain signals compensated in the specific wave beam direction to obtain average discrete frequency domain signals compensated in the specific wave beam direction;
and S1.4, finally, applying inverse Fourier transform to the average discrete frequency domain signal compensated in the specific beam direction to obtain the beam in the specific beam direction.
4. A sound source localization and speech enhancement system based on fixed beam forming according to claim 3, characterized in that in step S1.2, the phase compensation is determined by the beam direction, and the phase compensation corresponding to the beam direction is:
Figure FDA0003433503100000021
Figure FDA0003433503100000022
wherein, tauOmic1The time difference between the arrival point of the sound wave at O point and the arrival at mic1, ω τOmic1Denotes phase compensation, i.e., phase difference, F () denotes fourier transform, s (t) denotes time domain signal, O is the origin of the coordinate system, and mic1 is the microphone numbered 1; r is the microphone array radius, theta is the azimuth angle, i.e. the included angle between the beam direction and the mic1 at the point O, beta is the beam direction pitch angle, and c is the sound velocity.
5. The system according to claim 4, wherein the pitch angle β of the beam direction is fixed to 90 ° and the azimuth angle θ is selected to be an integer number of degrees between 0-359 °.
6. A sound source localization and speech enhancement system based on fixed beam forming according to claim 3, characterized in that in step S1.4 the formed beam is a discrete signal sequence comprising at least 512 data points in length, corresponding to a sampling frequency of 16 kHz.
7. The sound source localization and speech enhancement system based on fixed beam forming according to claim 1, wherein the maximum controllable response power calculation module receives the discrete signal sequence output by the sub-band time-delayed beam former, i.e. the beam in a specific beam direction, and sums the beams point by point to obtain the controllable response power in the beam direction, and the controllable response power calculation formula is as follows:
Figure FDA0003433503100000031
y (n) is the discrete time domain output of the sub-band delay beamformer, n is the point sequence number of the discrete sequence,
Figure FDA0003433503100000032
is a set of integers.
8. The fixed beamforming based sound source localization and speech enhancement system according to claim 1, wherein the delay-sum beamforming in the speech enhancement module is performed by:
s3.1, performing Fourier transform of adding a Hamming window on the audio information stream generated by the data acquisition module to obtain a discrete frequency domain signal;
s3.2, applying the full frequency points in the discrete frequency domain signals and searching the output beam direction theta of the maximum controllable response power searching modulemaxCorresponding phase compensation is carried out to obtain a discrete frequency domain signal after the sound source position estimation direction compensation;
s3.3, carrying out weighted average on the discrete frequency domain signals after the sound source position estimation direction compensation to obtain average discrete frequency domain signals after the sound source position estimation direction compensation;
and S3.4, finally, applying inverse Fourier transform to the average discrete frequency domain signal after the sound source position estimation direction compensation to obtain a wave beam in the sound source position estimation direction, and realizing voice enhancement for the enhanced voice signal.
9. A method for fixed beam forming based sound source localization and speech enhancement using the system of claim 1, comprising the steps of:
step 1, collecting audio information flow through a data collection module and transmitting the audio information flow to a sound source positioning module based on maximum controllable response power;
step 2, selecting a beam direction needing to calculate controllable response power in a coarse search stage, and performing corresponding time delay compensation on the audio information stream generated by the data acquisition module by using a sub-band time delay beam former, namely performing corresponding phase compensation and weighted average in a frequency domain to obtain a beam in the beam direction;
step 3, utilizing a maximum controllable response power calculation module to perform point-by-point square summation on the wave beam output in the step 2 to obtain the controllable response power of the wave beam, and associating the controllable response power with the corresponding wave beam direction, and storing the controllable response power and the corresponding wave beam direction together for a subsequent maximum controllable response power search module to use;
step 4, in the coarse search stage, selecting 24 wave beam directions at equal intervals by taking 0 as a starting point between 0 and 359 degrees, repeating the step 2 and the step 3, calculating the controllable response power of the 24 wave beam directions, comparing and sequencing the controllable response power of the 24 wave beam directions, and finding out the wave beam direction theta corresponding to the maximum controllable response power and the second maximum controllable response power in the wave beam directions10And theta11
Step 5, fine search in the first stage, using theta10And theta11The included acute angle is a new search range in theta10Selecting 5 wave beam directions at equal intervals as a starting point, repeating the step 2 and the step 3, calculating the controllable response power of the 5 wave beam directions, comparing and sequencing the controllable response power of the 5 wave beam directions, and finding out the wave beam direction theta corresponding to the maximum controllable response power and the second maximum controllable response power20And theta21
Step 6, fine search second stage, in theta20And theta21What is clamping sharpAngle is the new search range in theta20Selecting 3 wave beam directions at equal intervals as a starting point, repeating the step 2 and the step 3, calculating the controllable response power of the 3 wave beam directions, comparing and sequencing the controllable response power of the 3 wave beam directions, and finding out the wave beam direction theta corresponding to the maximum controllable response powermaxAs the sound source position estimation direction and sending it to the speech enhancement module;
and 7, receiving the birth source position estimation direction by the voice enhancement module, forming a beam in the sound source position estimation direction through delay-sum beam forming, and outputting an enhanced voice signal to realize voice enhancement.
CN201910845095.XA 2019-09-07 2019-09-07 Sound source positioning and voice enhancement method and system based on fixed beam forming Active CN110534126B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910845095.XA CN110534126B (en) 2019-09-07 2019-09-07 Sound source positioning and voice enhancement method and system based on fixed beam forming

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910845095.XA CN110534126B (en) 2019-09-07 2019-09-07 Sound source positioning and voice enhancement method and system based on fixed beam forming

Publications (2)

Publication Number Publication Date
CN110534126A CN110534126A (en) 2019-12-03
CN110534126B true CN110534126B (en) 2022-03-22

Family

ID=68667628

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910845095.XA Active CN110534126B (en) 2019-09-07 2019-09-07 Sound source positioning and voice enhancement method and system based on fixed beam forming

Country Status (1)

Country Link
CN (1) CN110534126B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111624553B (en) * 2020-05-26 2023-07-07 锐迪科微电子科技(上海)有限公司 Sound source positioning method and system, electronic equipment and storage medium
CN111489753B (en) * 2020-06-24 2020-11-03 深圳市友杰智新科技有限公司 Anti-noise sound source positioning method and device and computer equipment
CN112558004B (en) * 2021-02-22 2021-05-28 北京远鉴信息技术有限公司 Method and device for determining wave arrival direction of beam information and storage medium
CN113281727B (en) * 2021-06-02 2021-12-07 中国科学院声学研究所 Output enhanced beam forming method and system based on horizontal line array
CN113687305A (en) * 2021-07-26 2021-11-23 浙江大华技术股份有限公司 Method, device and equipment for positioning sound source azimuth and computer readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6922116B1 (en) * 2001-09-12 2005-07-26 Kathrein-Werke Kg Generating arbitrary passive beam forming networks
CN108417036A (en) * 2018-05-07 2018-08-17 北京中电慧声科技有限公司 Vehicle whistle sound localization method and device in intelligent transportation system
CN109324322A (en) * 2018-10-31 2019-02-12 中国运载火箭技术研究院 A kind of direction finding and target identification method based on passive phased array antenna
CN110045322A (en) * 2019-03-21 2019-07-23 中国人民解放军战略支援部队信息工程大学 A kind of shortwave automatic direction finding method based on high-resolution direction finding sonagram intelligent recognition
CN111624553A (en) * 2020-05-26 2020-09-04 锐迪科微电子科技(上海)有限公司 Sound source positioning method and system, electronic equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7358891B2 (en) * 2006-05-27 2008-04-15 Bae Systems Information And Electronic Systems Integration Inc. Multipath resolving correlation interferometer direction finding
JP6542143B2 (en) * 2016-03-11 2019-07-10 株式会社Nttドコモ base station

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6922116B1 (en) * 2001-09-12 2005-07-26 Kathrein-Werke Kg Generating arbitrary passive beam forming networks
CN108417036A (en) * 2018-05-07 2018-08-17 北京中电慧声科技有限公司 Vehicle whistle sound localization method and device in intelligent transportation system
CN109324322A (en) * 2018-10-31 2019-02-12 中国运载火箭技术研究院 A kind of direction finding and target identification method based on passive phased array antenna
CN110045322A (en) * 2019-03-21 2019-07-23 中国人民解放军战略支援部队信息工程大学 A kind of shortwave automatic direction finding method based on high-resolution direction finding sonagram intelligent recognition
CN111624553A (en) * 2020-05-26 2020-09-04 锐迪科微电子科技(上海)有限公司 Sound source positioning method and system, electronic equipment and storage medium

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Kai-Bor Yu等.Digital beamforming of sub-aperture cluster beams with enhanced angle estimation capabilities.《2014 IEEE Radar Conference 》.2014, *
基于宽频带系统的被动雷达测向技术;初萍;《中国博士学位论文全文数据库》;20120515;全文 *
基于麦克风阵列的语音增强实现;程超;《https://d.wanfangdata.com.cn/thesis/ChJUaGVzaXNOZXdTMjAyMTA1MTkSCFkyNTExNzgwGggxODg0dTE2NA%3D%3D》;20140918;第14-17,34-40页 *
宽带波束形成器的研究与设计;唐金华;《https://d.wanfangdata.com.cn/thesis/ChJUaGVzaXNOZXdTMjAyMTA1MTkSB0QxNjY5NzkaCGsxYXBkbnEz》;20111230;第30-31页 *
波束形成声源识别技术研究进展;褚志刚,等;《声学技术》;20131015;全文 *

Also Published As

Publication number Publication date
CN110534126A (en) 2019-12-03

Similar Documents

Publication Publication Date Title
CN110534126B (en) Sound source positioning and voice enhancement method and system based on fixed beam forming
CN103308889B (en) Passive sound source two-dimensional DOA (direction of arrival) estimation method under complex environment
CN111445920B (en) Multi-sound source voice signal real-time separation method, device and pickup
CN110931036B (en) Microphone array beam forming method
CN111624553B (en) Sound source positioning method and system, electronic equipment and storage medium
CN109669159A (en) Auditory localization tracking device and method based on microphone partition ring array
CN108109617A (en) A kind of remote pickup method
CN110610718B (en) Method and device for extracting expected sound source voice signal
CN111798869B (en) Sound source positioning method based on double microphone arrays
Zhang et al. Robust DOA estimation based on convolutional neural network and time-frequency masking.
CN106371057B (en) Voice sound source direction-finding method and device
CN109188362A (en) A kind of microphone array auditory localization signal processing method
CN113702909A (en) Sound source positioning analytic solution calculation method and device based on sound signal arrival time difference
CN109212481A (en) A method of auditory localization is carried out using microphone array
CN113109764B (en) Sound source positioning method and system
Lleida et al. Robust continuous speech recognition system based on a microphone array
Wan et al. Improved steered response power method for sound source localization based on principal eigenvector
Calmes et al. Azimuthal sound localization using coincidence of timing across frequency on a robotic platform
CN113223544B (en) Audio direction positioning detection device and method and audio processing system
Tourbabin et al. Enhanced robot audition by dynamic acoustic sensing in moving humanoids
JP5635024B2 (en) Acoustic signal emphasizing device, perspective determination device, method and program thereof
CN108269581B (en) Double-microphone time delay difference estimation method based on frequency domain coherent function
Moore et al. 2D direction of arrival estimation of multiple moving sources using a spherical microphone array
CN109239665B (en) Multi-sound-source continuous positioning method and device based on signal subspace similarity spectrum and particle filter
CN111157949A (en) Voice recognition and sound source positioning method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20200116

Address after: 510300 4th floor unit, 1st, 2nd and 3rd floors, west side of No. 1383-5, Guangzhou Avenue South, Haizhu District, Guangzhou City, Guangdong Province

Applicant after: Guangzhou Zhi company artificial intelligence technology Co., Ltd.

Applicant after: South China University of Technology

Address before: 510300 4th floor unit, 1st, 2nd and 3rd floors, west side of No. 1383-5, Guangzhou Avenue South, Haizhu District, Guangzhou City, Guangdong Province

Applicant before: Guangzhou Zhi company artificial intelligence technology Co., Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant