CN112731289B

CN112731289B - Binaural sound source positioning method and device based on weighted template matching

Info

Publication number: CN112731289B
Application number: CN202011456914.0A
Authority: CN
Inventors: 丁润伟; 孙永恒; 杨冰; 刘宏
Original assignee: PKU-HKUST SHENZHEN-HONGKONG INSTITUTION; Peking University Shenzhen Graduate School
Current assignee: PKU-HKUST SHENZHEN-HONGKONG INSTITUTION; Peking University Shenzhen Graduate School
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2024-05-07
Anticipated expiration: 2040-12-10
Also published as: CN112731289A

Abstract

The invention discloses a binaural sound source positioning method and device based on weighted template matching. In the training stage, firstly, binaural cross-correlation functions and binaural intensity differences in different directions are extracted from training data, and templates are established for the extracted binaural cross-correlation functions and binaural intensity differences in different directions; and then training the weight values of different directions and different frequency bands by a gradient descent method. In the online positioning stage, the characteristics of the signals are extracted at first, then the extracted characteristics are subjected to similarity matching with templates in all directions on different characteristics and different frequency bands, finally the similarity of different frequency bands of different characteristics is fused through weighting, the final similarity of the sound source direction is obtained, and the direction of the maximum similarity is taken as the sound source direction. Experiments are carried out in different noise environments, and experimental results show that the invention can resist noise interference to a certain extent and realize the problem of angular positioning of sound sources.

Description

Binaural sound source positioning method and device based on weighted template matching

Technical Field

The invention belongs to the technical field of information, relates to a binaural sound source positioning method applied to voice perception and voice enhancement, and in particular relates to a binaural sound source positioning method and device based on weighted template matching.

Technical Field

Man-machine interaction plays an increasingly important role in the field of robots, and man-machine interaction can enable communication between a person and a machine to be more convenient, efficient and friendly. In daily life, the main ways of people to perceive external information are vision, hearing, touch, smell, taste and the like. Wherein the human visually obtains about 70% -80% of the information and audibly obtains about 10% -20% of the information. Auditory perception is one of the most natural, convenient and effective ways for people to communicate information with the outside. In addition, compared with visual signals, the audible signals have a 360-degree visual field, are not influenced by illumination, and do not need to meet the conditions of no shielding between a sound source and a microphone, so that the robot hearing is one of important ways for realizing man-machine interaction. The robot hearing mainly comprises sound source positioning and tracking, voice denoising, voice enhancement, voice separation, speaker recognition, voice emotion recognition and the like, wherein the sound source positioning is used as a task of the front end of the robot hearing, and voice space position information can be provided for other voice tasks as assistance. Robotic sound source localization has become an important component of the robotic auditory system.

Speech separation comes from a well-known ' cocktail ' problem, i.e., the ability of a person to focus on one's voice in numerous conversational sounds and noises, which has long been recognized as a challenging problem in speech separation. The method has the advantages that the azimuth information of the sound source is obtained by combining the sound source positioning technology in the voice separation, so that the method is beneficial to separating aliased voices, and the accuracy of recognizing the voices in the interesting directions can be improved. In the video conference, the position of the camera can be adjusted in time according to the result of the microphone sound source positioning, so that the camera can be turned to the position of a speaker. In video monitoring, the angle of the camera can be adjusted according to the sound source direction information, so that the monitoring range is enlarged, and a better monitoring effect is achieved.

Sound source localization techniques can be broadly divided into microphone array-based sound source localization and binaural microphone array-based sound source localization, depending on the number of microphones and whether there is a cochlear structure of the robot head. The binaural microphone positioning technology plays an important role in the field of humanoid robots, and can fully utilize the diffraction effect of a cochlear structure on sound to simulate the human auditory characteristics. The robot binaural sound source localization uses only two microphones, which are mounted on the left and right sides of the robot head. Compared with the sound source localization of two microphone arrays, the binaural sound source localization has the effects of cochlea, artificial head diffraction and the like on sound signals, can better simulate the auditory characteristics of human beings, and can be better applied to scenes such as humanoid robots, hearing aid voice enhancement, virtual reality and the like. And the problem of ambiguity of the front and back directions of the positioning of the sound sources of the two microphones can be eliminated.

The binaural sound source localization mainly comprises the following steps:

1. Simulation and recording of binaural signals. And convolving the binaural impulse function with the pure sound signal to obtain an analog binaural sound signal, or directly recording the binaural signal as a real signal.

2. Digital-to-analog conversion, pre-filtering of the signal. Firstly, the analog signal is pre-filtered, a 50Hz power noise signal is filtered by a high-pass filter, a part of the sound signal, the frequency component of which exceeds half of the sampling frequency, is filtered by a low-pass filter, aliasing interference is prevented, and the analog sound signal is sampled and quantized to obtain a digital signal.

3. Pre-emphasis. The signal passes through a high frequency emphasis filter impulse response H (z) =1-0.95 z ^-1 to compensate for the high frequency attenuation caused by the lip radiation.

4. Framing and windowing. The speech signal has a time-varying character, but the human mouth muscle movement is relatively slow, and it is generally considered that the speech signal is stationary for a short period of time, typically 10ms-30ms. The signal is often framed at the above-described time intervals, for example, every 20 ms. In addition, in order to prevent the problem caused by framing, windowing is generally performed on the signal after framing, and common windows include rectangular windows, hanning windows, hamming windows and the like, wherein the hamming windows are widely used.

5. And (5) extracting characteristics. Each frame of signal may extract binaural characteristics containing sound source azimuth information, and the binaural characteristics commonly used in binaural sound source localization include a binaural Cross-correlation function (interactive Cross-correlation Function, CCF), a binaural time difference (interactive TIME DIFFERENCE, ITD), a binaural intensity difference (interactive INTENSITY DIFFERENCE, IID), and the like. Since many methods of extracting binaural time differences are based on binaural cross-correlation functions, binaural cross-correlation functions and binaural intensity difference features are used in the present invention.

6. And (5) positioning. The extracted features are mapped to the corresponding directions. So that the posterior probability of the sound source in the real sound source direction is maximized. The mapping method includes a plurality of methods, such as using a Gaussian mixture model, a neural network model and the like, and the invention uses a method based on weighted template matching.

According to the traditional Gaussian mixture model and neural network model-based method, the sound source directions are calculated on different frequency bands respectively, and the final result is obtained by adding, so that the reliability of the different frequency bands and the reliability of different features are not considered. In addition, methods based on neural network models have unexplainability.

Disclosure of Invention

Aiming at the problems, the invention aims to provide an interpretable binaural sound source positioning method and device based on weighted template matching, which are used for calculating the likelihood value of a sound source in each direction on each frequency band respectively, and integrating the results through weights of different frequency bands and different characteristics to obtain a final sound source direction result.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

A binaural sound source localization method based on weighted template matching comprises the following steps:

Extracting binaural cross-correlation functions and binaural intensity differences in different directions from the training data;

establishing templates for the extracted binaural cross-correlation functions and binaural intensity differences in all directions;

Training weights of different binaural localization features and different frequency bands;

and during on-line positioning, extracting a binaural cross-correlation function and a binaural intensity difference of a sound source signal, performing similarity matching on the binaural cross-correlation function and the binaural intensity difference and templates in all directions, and fusing the similarity of different characteristics and different frequency bands through weights obtained through training to realize sound source positioning.

Further, the binaural positioning features in different directions are extracted from the training data, namely, a binaural impulse function is adopted to convolve with a pure voice signal or recorded voice signals are directly utilized to calculate cross-correlation functions and binaural intensity differences in all directions; wherein different directions are divided into different horizontal steering angles, and the steering angles are divided in a non-uniform way.

Further, the steering angle is divided in the following manner: -80, -65, -55, -45:5:45, 55, 65, 80 ].

Further, the building of the templates for the extracted binaural cross-correlation function and binaural intensity difference in each direction is to take the average value of the binaural localization features extracted from the noise-free voice frames sent from the same direction by multiple frames as the template in the direction.

Further, the training of different binaural localization features and weights of different frequency bands is performed by adopting a back propagation method, and the loss function is set to be square loss, so that the similarity between templates in the same direction is maximum, and the similarity between templates in different directions is as small as possible.

Further, the similarity is calculated using the following formula:

Wherein sim (θ) represents a weighted similarity matrix, ω _ccf,i represents a weight of the cross-correlation function on the i-th frequency band, sim _ccf,i (θ) represents a cosine similarity of the cross-correlation function on the i-th frequency band to the template in the direction θ; omega _iid,i represents the weight of the binaural intensity difference in the ith frequency band and sim _iid,i (θ) represents the similarity of the binaural intensity difference in the ith frequency band to the template in direction θ.

A binaural sound source localization device based on weighted template matching employing the above method, comprising:

the training module is used for extracting the binaural cross-correlation functions and the binaural intensity differences in different directions from the training data, establishing templates for the extracted binaural cross-correlation functions and the binaural intensity differences in different directions, and then training weights of different binaural positioning features and different frequency bands;

And the on-line positioning module is used for extracting the binaural cross-correlation function and the binaural intensity difference of the sound source signal, matching the binaural cross-correlation function and the binaural intensity difference with templates in all directions, and fusing the similarity of different characteristics and different frequency bands through weights obtained through training to realize sound source positioning.

The beneficial effects of the invention are as follows:

According to the invention, the likelihood values of sound sources in all directions are calculated on each frequency band, the results are integrated through weights of different frequency bands and different characteristics, the final sound source direction result is obtained, the interference of noise can be resisted to a certain extent, and the angle positioning of the sound source is realized.

Drawings

FIG. 1 is an overall flow chart of the method of the present invention.

Fig. 2 is an example of the extracted features of the present invention. Wherein (a) the map represents the extracted binaural cross-correlation function features and (b) the map represents the extracted binaural intensity difference features.

Fig. 3 is an example of similarity calculation between a sound source signal and each direction template according to the present invention. The upper half part represents the similarity between the cross-correlation function of the sound source signal and each direction template, and the lower half part represents the similarity between the binaural intensity difference of the sound source signal and each direction template. The abscissa indicates different directions.

Fig. 4 is the final weight training result of the present invention. The two broken lines total 64 points and respectively represent the weights of the binaural cross-correlation function and the binaural intensity difference of different frequency bands.

Detailed Description

The technical scheme of the present invention will be clearly and completely described in the following with reference to the embodiments and the accompanying drawings.

Fig. 1 is a flowchart of a binaural sound source localization method based on weighted template matching of the present invention, comprising the steps of:

1) And a data preparation stage for simulating signals in all directions of the ears and providing an original sound source signal.

1.1 Dividing the front half plane of the artificial head into 25 different horizontal steering angles, for example, the steering angles are divided unevenly: -80, -65, -55, -45:5:45, 55, 65, 80 ]. Wherein, -45 DEG to 5 DEG to 45 DEG means that an angle is set every 5 degrees.

1.2 Combining the clean speech signal provided by the TIMIT database with the binaural impulse response function provided by the CIPIC database, and the different kinds of noise signals provided by the NOISEX-92 database to construct training data and test data. Wherein the training data does not use noise signals. The test data used noise signals of different signal to noise ratios, and test signals from-10 dB to 35dB were used in the experiment.

2) And in the training stage, binaural cross-correlation function and binaural intensity difference data are extracted from the data, templates are established for the cross-correlation function (CCF) and the binaural intensity difference (IID), and weights corresponding to different characteristics and frequency bands are trained, so that the similarity between templates in the same sound source direction is maximum, and the similarity between templates in different sound source directions is as small as possible. The cross-correlation function and binaural intensity difference template in all directions may be calculated using a binaural impulse function (HRTF) convolved with the clean speech signal or directly using the recorded sound signal.

2.1 A gamma-pass filter of the order 4 32 channels is used to divide the signal with direction information, the maximum frequency is set to 7200Hz.

2.2 Extracting cross-correlation function (CCF) and binaural intensity difference (IID) from the noiseless training data, and integrating the average values of the multi-frame data to establish a template, i.e. taking the average value of the binaural positioning features extracted from the noiseless voice frames sent from the same direction by the multi-frame as the template of the direction.

The calculation formula of the cross correlation function is as follows:

Wherein the method comprises the steps of

G_p,q(i,τ)＝∑_nx_p(i,n)x_q(i，n+τ)，p，q∈{l，r}

Wherein l and r respectively represent a left ear and a right ear, i represents different frequency bands, n represents a sampling point of each frame, and τ represents time delay; x _p、x_q represents the signal received by the left ear or the signal received by the right ear after p, q are given the value of l or r; τ ₀ represents the number 0.

The binaural intensity difference calculation formula is as follows:

Where x _l represents the signal received by the left ear and x _r represents the signal received by the right ear.

2.3 25 Signals in different directions are labeled with One-hot, e.g., -80 degrees direction is set to [1,0,..0 ], with 24 total 0. And calculating the similarity between training data of each frame and the template on different frequency bands and features to obtain a similarity matrix of (2×32) ×25 ((number of frequency bands) ×number of candidate directions), wherein the matrix is desirably weighted to maximize the similarity of the candidate directions corresponding to the real sound source. The number of rows and columns of the weight matrix is 1×64, the number of rows and columns of the similarity matrix is 64×25, and the final result is a matrix sim (θ) of 1×25.

Wherein sim (θ) represents a weighted similarity matrix, ω _ccf,i represents a weight of the cross-correlation function on the i-th frequency band, sim _ccf,i (θ) represents a cosine similarity of the cross-correlation function on the i-th frequency band to the template in the direction θ; omega _iid,i represents the weight of the binaural intensity difference in the ith frequency band and sim _iid,i (θ) represents the similarity of the binaural intensity difference in the ith frequency band to the template in direction θ. Wherein cosine similarity is calculated using the following formula:

R _i,r (τ) refers to the target cross-correlation function, R _temp (θ, i, τ) represents the cross-correlation function template for frequency band i at θ, and R _l,r (i, τ) represents the cross-correlation function calculated for the received signal in frequency band i.

The binaural intensity difference similarity uses the following formula:

where i denotes the band index, temp denotes the template, θ denotes the direction, iid _temp,θ,i denotes the θ angular direction, iid _i denotes the binaural intensity difference template corresponding to the i-th band, and iid _i denotes the binaural intensity difference of the i-th band currently calculated from the test signal.

2.4 Weights ω _ccf,i and ω _iid,i therein are trained by the back propagation method.

The loss function is set to the square loss: Wherein y is the above-mentioned real tag,/> The training of the weights is to train simultaneously on two different binaural characteristics and different frequency bands, and the trained weights have visual interpretability.

3) In the test stage, firstly, the acquired signals are subjected to frequency division processing through a gamma pass filter, then, cross-correlation functions and binaural intensity difference features are extracted on each frequency band signal after frequency division, similarity calculation is carried out on the features and template features in all directions on different features and different frequency bands, finally, the weights obtained in the training stage are used for weighting to obtain likelihood values of sound sources from all directions, sound source direction information can be obtained, namely, the similarity of different frequency bands of different features is fused through weighting to obtain final sound source direction similarity, and the maximum similarity direction is taken as the sound source direction.

A specific application example is provided below. The embodiment adopts binaural impulse response recorded by artificial head based on CIPIC database 003, which divides horizontal angle into 25 different angles, and pitch angle into 50 different angles, so as to simulate signals in different directions of real environment. The present example uses 25 binaural impulse responses in the horizontal plane for horizontal angle localization. The sound source signal is taken from the timi database of real human speech sounds. The sound signal is convolved with the binaural impulse response to truly simulate the noiseless signal received by the human ear. Noise of different environments recorded by using NOISEX-92 databases is added to the binaural signals, so that signals received by human ears in different kinds of noise environments can be simulated truly.

In the training stage, the data prepared above are pre-emphasized, framed and windowed, and the signals of 32 different frequency bands are obtained through a gamma filter with 4-order 32 frequency bands, the lowest center frequency of 80Hz and the highest center frequency of 7200 Hz. Then, extracting a cross-correlation function by using a cross-correlation function calculation formula, wherein we consider that the maximum time difference of the binaural signal does not exceed plus or minus 1.1 milliseconds, and taking only the cross-correlation value of the cross-correlation function with the length of 37 in combination with the 16k sampling rate; and simultaneously, extracting the binaural intensity difference by using a formula for calculating the binaural intensity difference, and completing the feature extraction work of the frame signal (as shown in fig. 2). And taking the average value of the binaural positioning characteristics extracted from the noise-free voice frames sent from the same direction by a plurality of frames as a template of the direction. Finally, the similarity between the locating feature of each frame of signal and each direction template is calculated, 64 similarities are obtained in each candidate direction (as shown in fig. 3), and the final direction similarity is obtained by weighting the similarities. In conjunction with the given similarity tag (i.e., one-hot tag), the back-propagation adjusts the weight values (as shown in FIG. 4).

In the test stage, the data prepared above is firstly framed and windowed, and signals of 32 different frequency bands are obtained through a gamma filter of 4-order 32 frequency bands, the lowest center frequency of 80Hz and the highest center frequency of 7200 Hz. Then, extracting cross-correlation functions by using a cross-correlation function calculation formula respectively, wherein the maximum time difference of the binaural signals is considered to be not more than plus or minus 1.1 milliseconds, and the cross-correlation value of the cross-correlation function with the length of 37 is only taken by combining the 16k sampling rate; and simultaneously, extracting the binaural intensity difference by using a formula for calculating the binaural intensity difference, and completing the feature extraction work of the frame signal (as shown in fig. 2). And then calculating the similarity between the positioning features of the test signals and the templates in all directions, obtaining 64 similarity in each candidate direction (as shown in figure 3), and weighting the similarity to obtain the final direction similarity. And selecting the maximum similarity direction as the sound source direction.

The training stage uses noise-free signals, and the testing stage uses noise with different signal-to-noise ratios from-10 dB to 35dB, with 5dB intervals.

Experimental results show that the method can resist noise interference to a certain extent, and the angle positioning of the sound source is realized.

Based on the same inventive concept, another embodiment of the present invention provides a binaural sound source localization device based on weighted template matching using the above method, which includes:

Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smart phone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps in the inventive method.

Based on the same inventive concept, another embodiment of the present invention provides a computer readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program which, when executed by a computer, implements the steps of the inventive method.

It will be appreciated that the embodiments described above are only some, but not all, embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to fall within the scope of the present invention.

Claims

1. A binaural sound source localization method based on weighted template matching is characterized by comprising the following steps:

During on-line positioning, extracting a binaural cross-correlation function and a binaural intensity difference of a sound source signal, performing similarity matching on the binaural cross-correlation function and the binaural intensity difference and templates in all directions, and fusing the similarity of different characteristics and different frequency bands through weights obtained through training to realize sound source positioning;

Training different binaural positioning characteristics and weights of different frequency bands by adopting a back propagation method, wherein a loss function is set as square loss, so that the similarity between templates in the same direction is maximum, and the similarity between templates in different directions is as small as possible;

the similarity is calculated using the following formula:

Wherein sim (θ) represents a weighted similarity matrix, ω _ccf,i represents a weight of the cross-correlation function on the i-th frequency band, sim _ccf,i (θ) represents a cosine similarity of the cross-correlation function on the i-th frequency band to the template in the direction θ; omega _iid,i represents the weight of the binaural intensity difference in the ith frequency band, sim _iid,i (θ) represents the similarity of the binaural intensity difference in the ith frequency band to the template in direction θ;

wherein the calculation formula of sim _ccf,i(θ)、sim_iid,i (θ) is:

Where R _l,r (τ) is the target cross-correlation function, R _temp (θ, i, τ) represents the cross-correlation function template for frequency band i at θ, and R _l,r (i, τ) represents the cross-correlation function calculated for the received signal in frequency band i;

Where i denotes the band index, temp denotes the template, θ denotes the direction, iid _temp,θ,i denotes the θ angle direction, iid _i denotes the binaural intensity difference template corresponding to the i-th band, and iid _i denotes the binaural intensity difference of the i-th band currently calculated from the test signal.

2. The method of claim 1, wherein the extracting binaural localization features in different directions from the training data is by convolving the binaural impulse function with the clean speech signal or directly using the recorded sound signal, and calculating a cross-correlation function and a binaural intensity difference in all directions; wherein different directions are divided into different horizontal steering angles, and the steering angles are divided in a non-uniform way.

3. The method according to claim 1, wherein the steering angle is divided in the following manner: -80, -65, -55, -45:5:45, 55, 65, 80 ].

4. The method of claim 1, wherein the building a template for the extracted binaural cross-correlation function and binaural intensity difference for each direction is a template for a direction that is an average of binaural localization features extracted from a plurality of frames of noiseless speech frames emanating from the same direction.

5. A weighted template matching based binaural sound source localization device employing the method of any one of claims 1-4, comprising:

6. An electronic device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-4.

7. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a computer, implements the method of any of claims 1-4.