CN112034424A

CN112034424A - Neural network sound source direction finding method and system based on double microphones

Info

Publication number: CN112034424A
Application number: CN202010871213.7A
Authority: CN
Inventors: 刘明; 周彦兵; 孙冲武; 赵学华; 高波
Original assignee: Shenzhen Institute of Information Technology
Current assignee: Shenzhen Institute of Information Technology
Priority date: 2020-08-26
Filing date: 2020-08-26
Publication date: 2020-12-04

Abstract

The invention provides a neural network sound source direction finding method and system based on double microphones, wherein the neural network sound source direction finding method comprises the following steps: acquiring a two-path sampling signal: two microphones are adopted to collect time domain data of two paths of sound source signals, and the collected time domain data are simultaneously transmitted to a neural network analysis module and a correlation characteristic extraction module; an output characteristic obtaining step: receiving the time domain data collected in the step 1 by adopting a neural network analysis module, wherein the neural network analysis module comprises an input layer, a hidden layer 1, a hidden layer 2, a hidden layer 3 and an output layer, analyzing the time sequence relation between two paths of sound source signals by the neural network analysis module through a recurrent neural layer according to the received time domain data, and finally obtaining the processed output characteristics at the hidden layer 2; a correlation characteristic extraction step: and (5) carrying out subsequent processing steps. The invention has the beneficial effects that: and the two microphones are adopted to realize sound source detection, so that the cost and the power consumption of a voice product are reduced.

Description

Neural network sound source direction finding method and system based on double microphones

Technical Field

The invention relates to the field of data processing, in particular to a neural network sound source direction finding method and system based on double microphones.

Background

Currently, most of the speech products in the market adopt a single microphone system to pick up and process speech, and generally, such products adopt a high-performance and high-directivity directional microphone to acquire a high-quality sound source signal. However, a single microphone system with high directivity can only pick up one sound source signal, and cannot adjust the direction of the microphone along with the movement of the sound source, which greatly limits the flexibility of use. Furthermore, in some products where the direction and location of the sound source is required, the single-microphone system does not have the capability of automatically detecting the azimuth and tracking the sound source. Although there are some methods for enhancing the sensitivity of the voice product to the sound source position by using a plurality of microphones to form an array, these products usually need more than 4 microphones to implement, which also increases the cost and power consumption of the voice product to a large extent.

Disclosure of Invention

The invention provides a neural network sound source direction finding method based on double microphones, which comprises the following steps:

acquiring a two-path sampling signal: two microphones are adopted to collect time domain data of two paths of sound source signals, and the collected time domain data are simultaneously transmitted to a neural network analysis module and a correlation characteristic extraction module;

an output characteristic obtaining step: receiving time domain data collected in the acquisition step of the two-way sampling signal by adopting a neural network analysis module, wherein the neural network analysis module comprises an input layer, a hidden layer 1, a hidden layer 2, a hidden layer 3 and an output layer, analyzing the time sequence relation between the two-way sound source signals by the neural network analysis module through a recurrent neural layer by using the received time domain data, and finally obtaining the processed output characteristics at the hidden layer 2; a correlation characteristic extraction step: a correlation characteristic extraction module is adopted to receive time domain data collected in the acquisition step of the two paths of sampling signals, then the correlation characteristic extraction module calculates correlation coefficients of multiple angles of the two paths of sound source signals in the received time domain data, and the correlation coefficients are cascaded with output characteristics of a hidden layer 2 of a neural network analysis module;

and (3) subsequent processing steps: the cascaded output characteristics are sent to a hidden layer 3 of a neural network analysis module for subsequent processing, and an angle classification result is given out on an output layer of the neural network analysis module.

As a further improvement of the present invention, after performing the subsequent processing, the method further comprises performing the following steps: a statistical judgment step: and counting the classification results of the angles of all frames in the set value by adopting a counting and judging module, and finally outputting the angle value with the highest counting value as a direction finding result per second.

As a further improvement of the present invention, in the output characteristic acquiring step, the following steps are further performed:

step 1: sampling data is processed in frames by adopting a sampling frequency of 44kHz, the frame length of each frame is 5ms, and each frame has 220 sampling points;

step 2: the input layer inputs two paths of sampling data each time; the hidden layer 1 adopts an LSTM neural network structure with a memory effect on a time sequence; the hidden layer 2, the hidden layer 3 and the output layer adopt a full-connection layer structure.

As a further improvement of the present invention, in step 2, the input layer inputs 440-dimensional data together, the hidden layer 1 uses 256 LSTM neurons together, the hidden layer 2 uses 128 neurons, the hidden layer 3 uses 64 neurons, and the output layer uses 7 neurons.

As a further improvement of the present invention, in the step 2, the operation principle of the LSTM neural network structure is as follows:

LSTM unit inputs feature t of current frame_nOutput result h retained before_n-1Combining and keeping the last frame data in the state C_n-1Input together to process to generate an output h of a current frame_nAnd an output state C of the current frame_nRepeating the recursive operation to capture the timing relationship between the signals, wherein each operation generates the current frame output h_nIt is passed to the hidden layer 3 for subsequent operations.

As a further improvement of the invention, the output layer adopts a multi-classification Softmax function as an objective function of the model and takes cross entropy as a loss function of the training model, and the calculation formula is shown as follows:

wherein Ti represents a real classification label of the training data, and since 7 direction finding angles are output, the value of m in the calculation formula (8) is 7.

As a further improvement of the present invention, in the step of extracting the correlation characteristics, the correlation characteristics extracting module calculates correlation coefficients of 7 angles of two sound source signals in the received time domain data, where the 7 angles are 0 °, 30 °, 60 °, 90 °, 120 °, 150 °, and 180 °, respectively.

As a further improvement of the invention, the correlation coefficient calculation of the 7 angles is specifically as follows

Step S1: calculating the time difference of the two paths of sound source signals reaching the two microphones; the method comprises the following specific steps:

the distance d between the two microphones is 15cm, and the time difference of the two sound source signals reaching the two microphones is calculated according to the following formula:

where θ is the angle of the sound source, ranging from 0 to 180 degrees, v_SoundRepresenting the propagation speed of sound, which is taken as 340 m/s;

step S2: supposing that a signal received by one microphone is X1(t), a signal received by the other microphone is X2(t), when the correlation characteristic extraction module carries out processing, two frames of data are cached and respectively marked as a t-1 frame and a t frame, a correlation coefficient of the t-1 frame is only calculated each time, after calculation is finished, the t frame data is moved forwards by 220 points as a new t-1 frame, and the positions of 221-440 points are filled with newly sampled data as the t frame data;

step S3: the correlation characteristic extraction module calculates correlation coefficients from delays corresponding to different angles to obtain 7-dimensional correlation coefficients, specifically:

taking X1(t) as a reference, shifting X2(t) to the right to realize alignment, and calculating the correlation coefficient as follows:

taking X2(t) as a reference, shifting X1(t) to the right to realize alignment, and calculating the correlation coefficient as follows:

where Cov (·) represents covariance calculation of two frames of data, Var (·) represents variance calculation, n ═ 1, 2, 3, and 4 represent sound source incidence conditions at angles of 0 degree, 30 degrees, 60 degrees, and 90 degrees, respectively, and n ═ 5, 6, and 7 represent sound source incidence conditions at angles of 120 degrees, 150 degrees, and 180 degrees, respectively.

In the statistical judgment step, the statistical judgment module counts and counts the classification results of the angles of 200 frames within 1 s.

The invention also discloses a neural network sound source direction finding system based on the double microphones, which comprises the following components: a memory, a processor and a computer program stored on the memory, the computer program being configured to implement the steps of the neural network sound source direction finding method of the present invention when invoked by the processor.

The invention has the beneficial effects that: 1. the neural network sound source direction finding method only adopts two microphones to realize sound source detection, thereby reducing the cost and power consumption of a voice product; 2. the neural network sound source direction finding method has the advantages that the accuracy of sound source direction detection by adopting the neural network is higher, the anti-interference capability is higher, and the real-time tracking of a sound source can be realized; 3. the neural network sound source direction finding method disclosed by the invention is combined with the correlation characteristics of the two-way signals, so that the design of a neural network model is simplified, and the operation complexity of an algorithm is reduced.

Drawings

FIG. 1 is a diagram of the direction-finding angle of a dual-microphone sound source of the neural network sound source direction-finding method of the present invention;

FIG. 2 is a block diagram of a neural network sound source direction finding algorithm of a dual microphone of the neural network sound source direction finding method of the present invention;

FIG. 3 is a schematic diagram of the operation of the LSTM unit in the neural network sound source direction-finding method of the present invention;

fig. 4 is an alignment schematic diagram of two paths of microphone sampling data of the neural network sound source direction finding method of the present invention.

Detailed Description

As shown in fig. 2, the invention discloses a neural network sound source direction finding method based on two microphones, which uses two microphones to detect a sound source in the range of 180 degrees in front of the front, and the method comprises the following steps:

acquiring a two-path sampling signal: two microphones are adopted to collect time domain data of two paths of sound source signals, and the collected time domain data are simultaneously transmitted to a neural network analysis module 4 and a correlation characteristic extraction module 5;

an output characteristic obtaining step: the method comprises the steps that a neural network analysis module 4 is adopted to receive time domain data collected in the acquisition step of two paths of sampling signals, the neural network analysis module 4 comprises an input layer, a hidden layer 1, a hidden layer 2, a hidden layer 3 and an output layer, the neural network analysis module 4 analyzes the time sequence relation between two paths of sound source signals through a recurrent neural layer on the received time domain data, and finally processed output characteristics are obtained on the hidden layer 2;

a correlation characteristic extraction step: a correlation characteristic extraction module 5 is adopted to receive the time domain data acquired in the acquisition step of the two paths of sampling signals, then the correlation characteristic extraction module 5 calculates correlation coefficients of multiple angles of the two paths of sound source signals in the received time domain data, and the correlation coefficients are cascaded with the output characteristics of the hidden layer 2 of the neural network analysis module 4;

and (3) subsequent processing steps: the cascaded output characteristics are sent to a hidden layer 3 of a neural network analysis module 4 for subsequent processing, and an angle classification result is given at an output layer of the neural network analysis module 4.

In order to further improve the stability of the direction finding output angle value, the method further comprises the following steps after the subsequent processing is executed:

a statistical judgment step: and counting the classification results of the angles of all frames in the set value by adopting a counting and judging module 6, and finally outputting the angle value with the highest counting value as a direction finding result per second.

In the output characteristic obtaining step, the method further comprises the following steps:

step 1: sampling frequency of 44kHz is adopted, sampling data are processed in frames, the frame length of each frame is 5ms, namely 220 sampling points of each frame;

step 2: the input layer inputs two paths of sampling data each time; the hidden layer 1 adopts a Long Short-Term Memory (LSTM) unit with a Memory effect on a time sequence to ensure the sensitivity of the model to the time sequence; the hidden layer 2, the hidden layer 3 and the output layer adopt a full-connection layer structure.

In step 2, the input layer inputs 440-dimensional data together, the hidden layer 1 adopts 256 LSTM neurons together, the hidden layer 2 adopts 128 neurons, the hidden layer 3 adopts 64 neurons, and the output layer adopts 7 neurons.

As shown in fig. 3, in step 2, the operation principle of the LSTM neural network structure is as follows:

LSTM unit inputs feature t of current frame_nOutput result h retained before_n-1Combining and keeping the last frame data in the state C_n-1Are input together for processing and then are producedGenerating an output h of a current frame_nAnd an output state C of the current frame_nRepeating the recursive operation to capture the timing relationship between the signals, wherein each operation generates the current frame output h_nIt is passed to the hidden layer 3 for subsequent operations.

The calculation of each gate and its output is as follows, where (-) and tanh (-) represent the sigmoid and hyperbolic tangent activation functions, respectively:

f_n＝(W_f[h_n-1,x_n]+b_f) (2)

u_n＝(W_u[h_n-1,x_n]+b_u) (3)

O_n＝(W_o[h_n-1,x_n]+b_o) (4)

h_n＝O_n*tanh(C_n) (6)

wherein f is_nRepresenting the output of the forgetting gate of the current frame, u_nRepresenting the output of the current frame update gate, O_nRepresenting the output of the current frame output gate.

The

hidden layers

2 and 3 in the neural network model are fully connected layers, and after each neuron performs weighted summation, nonlinear activation operation is performed, as shown in the following formula (7):

h⁽ⁱ⁾＝g(W·h^(i-1)+b) (7)

where W and b are the weight and bias of the neuron, respectively, h represents the output of the hidden layer, i is the index of the layer, and g (-) represents the nonlinear activation operation, here the ReLU activation function is used.

In addition, the output layer of the neural network analysis module 4 adopts a full-connection structure, but only performs linear operation. The output layer adopts a multi-classification Softmax function as an objective function of the model, takes the cross entropy as a loss function of the training model, and has a calculation formula shown as the following formula:

wherein Ti represents a real classification label of the training data, and since 7 direction finding angles are output, the value of m in the calculation formula (8) is 7. That is, the output layer of the neural network analysis module 4 will give output probabilities of 7 neurons, the sum of their probabilities is 1, and the angle value corresponding to the neuron with the highest probability is taken as the sound source direction measured by the neural network analysis module 4.

In the step of extracting the correlation characteristics, the correlation characteristic extraction module 5 calculates correlation coefficients of 7 angles of two sound source signals in the received time domain data, where the 7 angles are 0 °, 30 °, 60 °, 90 °, 120 °, 150 ° and 180 °, respectively, and the angles are gradually increased from 0 ° in a counterclockwise direction from the front of two microphones, as shown in fig. 1.

In addition to inputting the 440-dimensional two-way sampling data to the neural network analysis module 5, the correlation coefficients of 7 angles extracted by the correlation feature extraction module 4 are also input to the neural network as a group of important features to assist the model in classifying the sound source angles. The correlation coefficient calculation for the 7 angles is specifically as follows:

where θ is the angle of the sound source, ranging from 0 to 180 degrees, v_SoundRepresenting the propagation speed of sound, which is taken as 340 m/s; when the sound source is positioned at 0 degree and 180 degrees of the double-microphone system, the time difference of the two paths of signals is the largest, and is about 0.44 ms; when the sound source signals are positioned at 30 degrees and 150 degrees, the time difference of the two paths of signals is about 0.38 ms; when the sound source signals are positioned at 60 degrees and 120 degrees, the time difference of the two paths of signals is about 0.22 ms; when the sound source is positioned right in front of the double-microphone system (at a position of 90 degrees), the two signals arrive at the same time, and no time difference exists. According to the difference of the arrival time of the two paths of signals, and the sequence of the signals received by the two microphones, 7 different sound source positions can be effectively distinguished. In order to ensure sufficient time resolution, the designed algorithm adopts a sampling rate of 44kHz, and if the time difference is converted into the number of sampling points, the time difference of two paths of signals is 19 sampling points when a sound source is positioned at 0 degree and 180 degrees; when the sound source is positioned at 30 degrees and 120 degrees, the time difference of the two paths of signals is 16 sampling points; when the sound source is positioned at 60 degrees and 150 degrees, the time difference of the two paths of signals is 9 sampling points; when the sound source is located at the 90-degree position, the time difference is 0. Therefore, two paths of sound source data can be correspondingly delayed in the time domain, and correlation coefficients of 7 different incidence angles are calculated to distinguish sound source positions.

Step S2: assuming that a signal received by the left microphone in fig. 4 is X1(t), a signal received by the right microphone is X2(t), when the correlation feature extraction module 5 performs processing, two frames of data are cached and respectively marked as a t-1 frame and a t frame, a correlation coefficient of the t-1 frame is only calculated each time, after the calculation is completed, the t frame data is moved forward by 220 points as a new t-1 frame, and the positions of 221-440 points are filled with the newly sampled data as the t frame data;

step S3: the correlation feature extraction module 5 will try to calculate the correlation coefficient from the delays corresponding to different angles, to obtain a 7-dimensional correlation coefficient, specifically:

In the statistical judgment step, the statistical judgment module 6 counts, counts and decides the classification result of the angles of 200 frames within 1 s. For example, in the neural network classification result within a certain second, the number of occurrences of 0 degree angle is 10, the number of occurrences of 30 degree angle is 170, the number of occurrences of 60 degree angle is 20, and the number of occurrences of 90 degree angle, 120 degree angle, 150 degree angle and 180 degree angle is 0, and then the angle with the largest number of occurrences is extracted, that is, 30 degrees is output as the direction-finding angle of the second. The count value is then set to zero and a new round of counting is restarted. The statistic and judgment module 6 is equivalent to counting the output result of the neural network within each second, and updating the direction-finding angle every other second, so that the real-time tracking of the system on the sound source is ensured, and the output stability and the anti-interference capability of the whole sound source direction-finding system are effectively improved.

In the neural network sound source direction finding method based on the double microphones, firstly, double-microphone hardware equipment with the distance of 15cm is utilized to respectively record the audio frequency of each sound source angle for 10 hours, the total audio frequency data is 70 hours, and in order to ensure the good generalization capability of a model, the distance between the sound source transmitted from each angle and the microphone is uncertain. Meanwhile, in order to improve the robustness of the model to noise interference, 10dB white noise is randomly added into the recorded audio to construct a training data set. And then, extracting time domain sampling data and correlation coefficient characteristics of the two paths of signals, dividing 10% of all training data to be used as a verification set, performing model parameter optimization by adopting a back propagation algorithm, and storing the model when the loss on the training set and the verification set is minimum so as to obtain the neural network model with the sound source direction finding capability.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A neural network sound source direction finding method based on two microphones is characterized by comprising the following steps:

an output characteristic obtaining step: receiving time domain data collected in the acquisition step of the two-way sampling signal by adopting a neural network analysis module, wherein the neural network analysis module comprises an input layer, a hidden layer 1, a hidden layer 2, a hidden layer 3 and an output layer, analyzing the time sequence relation between the two-way sound source signals by the neural network analysis module through a recurrent neural layer by using the received time domain data, and finally obtaining the processed output characteristics at the hidden layer 2;

a correlation characteristic extraction step: a correlation characteristic extraction module is adopted to receive time domain data collected in the acquisition step of the two paths of sampling signals, then the correlation characteristic extraction module calculates correlation coefficients of multiple angles of the two paths of sound source signals in the received time domain data, and the correlation coefficients are cascaded with output characteristics of a hidden layer 2 of a neural network analysis module;

2. The neural network sound source direction finding method according to claim 1, further comprising performing the following steps after performing the subsequent processing:

a statistical judgment step: and counting the classification results of the angles of all frames in the set value by adopting a counting and judging module, and finally outputting the angle value with the highest counting value as a direction finding result per second.

3. The neural network sound source direction finding method according to claim 1, further comprising, in the output feature acquiring step, performing the steps of:

4. The method as claimed in claim 3, wherein in step 2, the input layer inputs 440-dimensional data together, the hidden layer 1 uses 256 LSTM neurons together, the hidden layer 2 uses 128 neurons, the hidden layer 3 uses 64 neurons, and the output layer uses 7 neurons.

5. The neural network sound source direction finding method according to claim 3, wherein in the step 2, the operation principle of the LSTM neural network structure is as follows:

LSTM unit inputs feature t of current frame_nOutput result h retained before_n-1Combining and keeping the last frame data in the state C_n-1Input together to process to generate an output h of a current frame_nAnd an output state C of the current frame_nRepeating the recursive operation to capture the timing relationship between the signals, wherein each operation generates the current frame output h_nIt is passed to the hidden layer (3) to perform the subsequent operation.

6. The neural network sound source direction finding method according to claim 3, wherein the output layer adopts a multi-classification Softmax function as an objective function of the model and takes cross entropy as a loss function of the training model, and the calculation formula is as follows:

7. The method according to claim 1, wherein in the step of extracting the correlation characteristics, the correlation characteristics extraction module calculates correlation coefficients of 7 angles of two sound source signals in the received time domain data, and the 7 angles are 0 °, 30 °, 60 °, 90 °, 120 °, 150 °, and 180 °, respectively.

8. The neural network sound source direction finding method according to claim 7, wherein the correlation coefficient of the 7 angles is calculated as follows:

9. The neural network sound source direction finding method according to claim 2, wherein in the statistical determination step, the statistical determination module performs count statistics on the classification results of the angles of 200 frames within 1 s.

10. A neural network sound source direction finding system based on two microphones is characterized by comprising: a memory, a processor and a computer program stored on the memory, the computer program being configured to carry out the steps of the neural network sound source direction finding method of any one of claims 1-9 when invoked by the processor.