CN111161757B

CN111161757B - Sound source positioning method and device, readable storage medium and electronic equipment

Info

Publication number: CN111161757B
Application number: CN201911373874.0A
Authority: CN
Inventors: 莫凡; 孙珏; 刘士杰
Original assignee: Mgjia Beijing Technology Co ltd
Current assignee: Mgjia Beijing Technology Co ltd
Priority date: 2019-12-27
Filing date: 2019-12-27
Publication date: 2021-09-03
Anticipated expiration: 2039-12-27
Also published as: CN111161757A

Abstract

The disclosure relates to a sound source positioning method, a sound source positioning device, a readable storage medium and an electronic device. The method comprises the following steps: acquiring target audio signals from N microphones, wherein each microphone is arranged at a different position, and N is an integer greater than or equal to 3; extracting multi-dimensional audio features from the N target audio signals; and determining the sound source position of the target audio according to the multi-dimensional audio features and a pre-trained sound source positioning model. In this way, the sound source position of the target audio is determined based on the sound source localization model, and the accuracy of determining the sound source position can be improved. And the sound source position is determined through the multi-dimensional audio features, and compared with the prior art that the sound source position is determined only through the phase difference, the accuracy and the robustness of the determined sound source position are further improved.

Description

Sound source positioning method and device, readable storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of positioning technologies, and in particular, to a sound source positioning method, a sound source positioning device, a readable storage medium, and an electronic device.

Background

Sound localization refers to the process of determining the position of a sound object by an auditory (acoustic) system. The position of the sound-emitting object can be accurately identified through sound source positioning. Most of the existing sound source positioning methods are to arrange a plurality of microphones at different positions in space according to a certain sequence, process audio signals received by the microphones and finally obtain the final position of a sound source according to calculation.

In the conventional sound source positioning method, the phase difference of at least three audio signals reaching the microphones (i.e. the time difference between the audio signals reaching different microphones) needs to be calculated, and then the intersection point coordinates of hyperbolas obtained according to the phase difference are calculated, so that the positioning can be completed. However, since the phase difference is obtained according to the cross-correlation relationship between the audio signals collected by the two microphones, the cross-correlation relationship between different sound sources has a certain difference, which results in inaccurate positioning.

Disclosure of Invention

An object of the present disclosure is to provide a sound source localization method, apparatus, readable storage medium, and electronic device to improve accuracy and robustness of sound source localization.

In order to achieve the above object, a first aspect of the present disclosure provides a sound source localization method, including:

acquiring target audio signals from N microphones, wherein each microphone is arranged at a different position, and N is an integer greater than or equal to 3;

extracting multi-dimensional audio features from the N target audio signals;

and determining the sound source position of the target audio according to the multi-dimensional audio features and a pre-trained sound source positioning model.

Optionally, the determining the sound source position of the target audio according to the multi-dimensional audio features and a pre-trained sound source localization model includes:

inputting the multi-dimensional audio features to a pre-trained sound source positioning model to obtain a first position probability for each position, wherein the number of the positions is more than or equal to 2, and the first position probability is used for representing the probability that the corresponding position is the sound source position of the target audio;

and determining the position corresponding to the maximum first position probability as the sound source position of the target audio in the plurality of first position probabilities.

Optionally, the extracting multi-dimensional audio features from the N target audio signals includes:

for each entry mark audio signal, dividing the entry mark audio signal into M frames of audio signals;

and extracting multi-dimensional audio features from the N M frames of audio signals.

Optionally, the extracting multi-dimensional audio features from the N M-frame audio signals includes:

determining the energy value of each frame of audio signal in each target audio signal;

for the same frame of audio signals, the following steps are performed:

determining the phase difference of the frame of audio signals in every two target audio signals, and determining the energy difference of the frame of audio signals in every two target audio signals according to the energy value of the frame of audio signals in every two target audio signals;

and determining the audio characteristics of the frame of audio signals according to the phase difference, the energy value and the energy difference of the frame of audio signals.

Optionally, the inputting the multi-dimensional audio features into a pre-trained sound source localization model to obtain a first location probability for each location includes:

aiming at the same frame of audio signal, inputting the multi-dimensional audio features of the frame of audio signal into a pre-trained sound source positioning model to obtain a second position probability aiming at each position of the frame of audio signal output by the sound source positioning model;

and determining the first position probability of the target audio aiming at the position according to the M second position probabilities aiming at the same position.

Optionally, the sound source localization model is trained by:

acquiring sample audio signals generated by a plurality of sound sources from K microphones, wherein K is an integer greater than or equal to 3;

extracting multi-dimensional audio sample characteristics aiming at each sample audio signal, wherein the multi-dimensional audio sample characteristics are marked with the sound source position of the sample audio signal;

and training by taking the multi-dimensional audio sample characteristics as a model training sample to obtain the sound source positioning model.

A second aspect of the present disclosure provides a sound source localization apparatus, including:

the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring target audio signals from N microphones, each microphone is arranged at a different position, and N is an integer greater than or equal to 3;

the first extraction module is used for extracting multi-dimensional audio features from the N target audio signals;

and the determining module is used for determining the sound source position of the target audio according to the multi-dimensional audio features and a pre-trained sound source positioning model.

Optionally, the determining module includes:

the input submodule is used for inputting the multi-dimensional audio features to a pre-trained sound source positioning model so as to obtain a first position probability aiming at each position, wherein the number of the positions is more than or equal to 2, and the first position probability is used for representing the probability that the corresponding position is the sound source position of the target audio;

and the determining submodule is used for determining the position corresponding to the maximum first position probability as the sound source position of the target audio in the plurality of first position probabilities.

Optionally, the first extraction module includes:

a dividing submodule for dividing the entry mark audio signals into M frames of audio signals for each entry mark audio signal;

and the extraction submodule is used for extracting multi-dimensional audio features from the N M frames of audio signals.

Optionally, the extracting sub-module is configured to determine, in each target audio signal, an energy value of each frame of audio signal; for the same frame of audio signals, the following steps are performed: determining the phase difference of the frame of audio signals in every two target audio signals, and determining the energy difference of the frame of audio signals in every two target audio signals according to the energy value of the frame of audio signals in every two target audio signals; and determining the audio characteristics of the frame of audio signals according to the phase difference, the energy value and the energy difference of the frame of audio signals.

Optionally, the input sub-module is configured to, for a same frame of audio signals, input the multidimensional audio features of the frame of audio signals to a pre-trained sound source localization model to obtain second position probabilities, for each position, of the frame of audio signals output by the sound source localization model; and determining the first position probability of the target audio aiming at the position according to the M second position probabilities aiming at the same position.

Optionally, the apparatus further comprises:

the second acquisition module is used for acquiring sample audio signals generated by a plurality of sound sources from K microphones, wherein K is an integer greater than or equal to 3;

the second extraction module is used for extracting multi-dimensional audio sample characteristics aiming at each sample audio signal, wherein the multi-dimensional audio sample characteristics are marked with the sound source position of the sample audio signal;

and the training module is used for training the multi-dimensional audio sample characteristics as model training samples to obtain the sound source positioning model.

The third aspect of the present disclosure also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method provided by the first aspect of the present disclosure.

The fourth aspect of the present disclosure also provides an electronic device, including:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to implement the steps of the method provided by the first aspect of the present disclosure.

Through the technical scheme, the target audio signals are obtained from the N microphones, the multi-dimensional audio features are extracted from the N items of the marked audio signals, and the sound source position of the target audio is determined according to the audio features and the pre-trained sound source positioning model. In this way, the sound source position of the target audio is determined based on the sound source localization model, and the accuracy of determining the sound source position can be improved. And the sound source position is determined through the multi-dimensional audio features, and compared with the prior art that the sound source position is determined only through the phase difference, the accuracy and the robustness of the determined sound source position are further improved.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

fig. 1 is a flow chart illustrating a sound source localization method according to an exemplary embodiment.

Fig. 2 is a schematic diagram illustrating a sound source and microphone according to an exemplary embodiment.

Fig. 3 is a flow chart illustrating a sound source localization method according to another exemplary embodiment.

Fig. 4 is a diagram illustrating division of a target audio signal into multiple frames of audio signals according to an exemplary embodiment.

Fig. 5 is a block diagram illustrating a sound source localization apparatus according to an exemplary embodiment.

FIG. 6 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.

Fig. 1 is a flow chart illustrating a sound source localization method according to an exemplary embodiment. As shown in fig. 1, the method may include steps 11 to 13.

In step 11, a target audio signal is obtained from N microphones, where each microphone is disposed at a different position, and N is an integer greater than or equal to 3.

It should be understood by those skilled in the art that if only the target audio signals are obtained from the two microphones, only one phase difference can be determined, and thus only the position of the sound source can be determined to be located on one hyperbolic curve, and a specific sound source position cannot be further determined. If the target audio signals are acquired from three or more microphones, at least three phase differences can be determined, and then at least three hyperbolas can be determined, and a unique intersection point can be determined by the three hyperbolas, wherein the intersection point is the sound source position.

For example, as shown in fig. 2, it is assumed that the scene to which the present disclosure is applied includes 1 sound source and 4 microphones, and the 4 microphones are a microphone 1, a microphone 2, a microphone 3, and a microphone 4, respectively. Wherein each microphone is arranged at a different position, and the 4 microphones can acquire the audio signal emitted by the sound source. Thus, the electronic device performing the sound source localization method can acquire the target audio signal from the 4 microphones. Wherein the target audio signal is one of the audio signals emitted by the sound sources in fig. 2. It should be noted that, in the present disclosure, the electronic device executing the sound source localization method may also select 3 microphones from the 4 microphones in fig. 2, and acquire the target audio signals collected by the 3 microphones.

In addition, in the present disclosure, a preferred embodiment is that, if a plurality of microphones are located on different straight lines, the more significant the difference between the microphones acquiring the audio signals emitted by the same sound source is, and further based on the significant difference, the sound source position can be accurately determined: the N microphones are located on different straight lines.

In step 12, multi-dimensional audio features are extracted from the N-entry label audio signal.

In the present disclosure, the sound source position of the entry mark audio signal may be determined in units of the entire target audio signal, or the sound source position of the entire target audio signal may be determined in units of each frame audio signal in one target audio signal. If the whole target audio signal is taken as a unit, the audio characteristic is the multi-dimensional audio characteristic of the whole target audio signal; if each frame of audio signal is taken as a unit, the audio feature is a multi-dimensional audio feature of each frame of audio signal. The present disclosure does not specifically limit this. Further, the multi-dimensional audio features may include, but are not limited to, phase differences, energies, and energy differences.

In step 13, the sound source position of the target audio is determined according to the multi-dimensional audio features and the pre-trained sound source localization model.

Specifically, first, after obtaining a multi-dimensional audio feature, it is input to a sound source localization model trained in advance to obtain a first position probability for each position. The positions are preset, the number of the positions is more than or equal to 2, and the first position probability is used for representing the probability that the corresponding position is the sound source position of the target audio. For example, a four-seater vehicle may be divided into 4 positions, respectively a main driving position, a sub-driving position, a rear left position, and a rear right position, and a five-seater vehicle may be divided into 5 positions, respectively a main driving position, a sub-driving position, a rear left position, a rear center position, and a rear right position, and so on, according to the seat distribution of the vehicle. The present disclosure does not specifically limit the division of the positions.

Then, among the plurality of first position probabilities, a position corresponding to the largest first position probability is determined as a sound source position of the target audio.

When the entire target audio signal is input to a sound source localization model trained in advance, the first position probability for each position output by the sound source localization model can be obtained. If each frame of audio signal in a target audio signal is input to a pre-trained sound source positioning model as a unit, a second position probability of the frame of audio signal output by the sound source positioning model for each position can be obtained, and the electronic equipment executing the sound source positioning method determines a first position probability for each position based on the second position probability of each frame of audio signal for each position.

As described above, the first position probability for each position can be determined from the multidimensional audio features and the pre-trained sound source localization model, and if the set positions are the main driving position, the sub-driving position, the rear left position, and the rear right position, respectively, the first position probability for the main driving position, the first position probability for the sub-driving position, the first position probability for the rear left position, and the first position probability for the rear right position can be determined. Among the 4 first position probabilities, a larger numerical value of the first position probability indicates a higher probability that its corresponding position is the sound source position of the target audio, and thus, in the present disclosure, among the plurality of first position probabilities, a position corresponding to the largest first position probability is determined as the sound source position of the target audio.

By adopting the technical scheme, the target audio signals are obtained from the N microphones, the multi-dimensional audio features are extracted from the N items of the marked audio signals, and the sound source position of the target audio is determined according to the audio features and the pre-trained sound source positioning model. In this way, the sound source position of the target audio is determined based on the sound source localization model, and the accuracy of determining the sound source position can be improved. And the sound source position is determined through the multi-dimensional audio features, and compared with the prior art that the sound source position is determined only through the phase difference, the accuracy and the robustness of the determined sound source position are further improved.

Furthermore, the sound source localization model may be a DNN (Deep Neural Networks) classification model including a plurality of layers. And the sound source positioning model can be obtained by training in the following way:

first, sample audio signals generated by a plurality of sound sources are acquired from K microphones.

Wherein K may be the same as or different from N in the step 11. When K is different from N, K is still an integer of 3 or more. In the present disclosure, audio signals generated by different sound sources at different positions are collected in advance in K microphones, and the electronic device performing the training process may acquire the audio signals generated by a plurality of sound sources from the K microphones and use the audio signals as sample audio signals.

It should be noted that, considering that the audio characteristics are influenced by voice, environment and noise, the model is trained by using the audio signal close to the actual usage scene, so as to effectively improve the precision of the trained sound source localization model, and thus, the position of the sound source can be accurately determined according to the sound source localization model with higher precision. Thus, in a preferred embodiment, sample audio signals generated by a plurality of sound sources in a particular scene are acquired from K microphones. For example, if the position of the sound source in the vehicle during the travel of the vehicle is determined, the specific scene may be a scene in which the vehicle is in a traveling state.

Then, for each sample audio signal, multi-dimensional audio sample features are extracted.

As described above, the sound source position of the entry mark audio signal may be determined in units of the entire target audio signal, or may be determined in units of each frame audio signal in the target audio signal. Similarly, in the training process, the multidimensional audio sample features of the entire sample audio signal may be extracted by taking the entire sample audio signal as a unit, or the multidimensional audio sample features of the frame audio signal may be extracted by taking each frame audio signal as a unit. The present disclosure does not specifically limit this.

In addition, in the process of training the model, after the multi-dimensional audio sample features are extracted, the sound source positions of the sample audio signals can be marked in the multi-dimensional audio sample features. For example, if a multi-dimensional audio sample feature of an audio signal generated by a sound source located at a primary driving position is extracted, the sound source position labeled in the multi-dimensional audio sample feature is the primary driving position. For another example, if the sound source localization model outputs the first position probability or the second position probability for each position, the sound source position labeled in the multi-dimensional audio sample feature may be that the first position probability or the second position probability for the main driving position is 1, and the first position probability or the second position probability for the auxiliary driving position, the left rear position and the right rear position is 0.

And finally, training by taking the multi-dimensional audio sample characteristics as model training samples to obtain a sound source positioning model.

Illustratively, the deep neural network model can be trained through multi-dimensional audio sample characteristics to obtain a sound source localization model. In this way, when the audio features extracted in step 12 are input to the sound source localization model, the first position probability or the second position probability for each position output by the sound source localization model can be obtained.

The following is a detailed description of the above step 12 for extracting multi-dimensional audio features from the N target audio signals.

As shown in fig. 3, step 12 may specifically include step 121 and step 122.

In step 121, for each entry mark audio signal, the entry mark audio signal is divided into M-frame audio signals.

In this disclosure, for a target audio signal, the time lengths corresponding to each divided frame of audio signal may be the same or different.

For example, assuming that the time duration of the entire target audio signal is 40 msec, the entry mark audio signal may be divided into 4 frame audio signals having the same time duration, i.e., each frame audio signal has a time duration of 10 msec. Further illustratively, the entry mark audio signal may be further divided into 4-frame audio signals having different durations, for example, in the entry mark audio signal, the duration of the first frame audio signal is 20 milliseconds, the duration of the second frame audio signal is 10 milliseconds, the duration of the third frame audio signal is 5 milliseconds, and the duration of the fourth frame audio signal is 5 milliseconds.

In the present disclosure, the division rule of the N-entry label audio signal is the same. For example, if one target audio signal is divided into 4-frame audio signals having the same duration, the other N-1 entry mark audio signals are also divided into 4-frame audio signals having the same duration. If a target audio signal is divided into 4 frames of audio signals with different time lengths, and the division rule is as follows: the duration of the first frame audio signal is 20 milliseconds, the duration of the second frame audio signal is 10 milliseconds, the duration of the third frame audio signal is 5 milliseconds, and the duration of the fourth frame audio signal is 5 milliseconds, so that other N-1 entry mark audio signals are also divided according to the division rule.

According to the above scheme, for an N-entry mark audio signal, it is possible to divide into N M-frame audio signals. And the ith frame audio signal in each target audio signal and the ith frame audio signal in other target audio signals are the same frame audio signal, wherein the value range of i is [1, M ].

In step 122, multi-dimensional audio features are extracted from the N M-frame audio signals.

Specifically, the way of extracting the multi-dimensional audio features is as follows:

first, in each target audio signal, an energy value of each frame audio signal is determined.

In the present disclosure, the energy value of the audio signal refers to the sum of squares of amplitudes of the audio signal collected by the microphone at each time in the time domain. Illustratively, a frame of audio signal with a duration of 10 milliseconds has 16000 sampling points, and the energy value of the frame of audio signal can be obtained by calculating the sum of squares of the amplitudes of the 16000 sampling points. Thus, for each target audio signal, the energy value of each frame of audio signal in the entry mark audio signal can be calculated.

Then, for the same frame of audio signal, determining the phase difference of the frame of audio signal in every two target audio signals, and determining the energy difference of the frame of audio signal in every two target audio signals according to the energy value of the frame of audio signal in every two target audio signals.

It should be noted that the attenuation of the energy of the audio signal is different due to the difference in the distance from the microphone to the audio signal. Theoretically, the energy attenuation of the microphone reaching closer distance is smaller, the energy value of the audio signal collected by the microphone is larger, and therefore, the distance from the sound source to each microphone can be determined by the energy value. Further, the energy difference can be used to further highlight the energy decay information. Thus, in the present disclosure, the multi-dimensional audio features may include phase differences, energy values, and energy differences.

As described above, the ith frame of audio signal in each target audio signal and the ith frame of audio signal in the other target audio signals are the same frame of audio signal, for the ith frame of audio signal, the phase difference of the ith frame of audio signal is determined in each two target audio signals, and the energy difference of the ith frame of audio signal in each two target audio signals is determined according to the energy value of the ith frame of audio signal in each target audio signal.

For example, as shown in fig. 4, assuming that N is 4 and M is 3, the N-entry tag audio signals are a first target audio signal, a second target audio signal, a third target audio signal and a fourth target audio signal, respectively, and each target audio signal includes a first frame audio signal, a second frame audio signal and a third frame audio signal. For a first frame audio signal, energy values of the first frame audio signal in each entry mark audio signal may be calculated, and 6 energy differences are calculated based on the 4 energy values, and the following 6 phase differences are calculated by a cross-correlation method: a phase difference of the first frame audio signal in the first target audio signal and the second target audio signal, a phase difference of the first target audio signal and the third target audio signal, a phase difference of the first target audio signal and the fourth target audio signal, a phase difference of the second target audio signal and the third target audio signal, a phase difference of the second target audio signal and the fourth target audio signal, and a phase difference of the third target audio signal and the fourth target audio signal.

And finally, determining the audio characteristics of the frame of audio signals according to the calculated phase difference, energy value and energy difference of the frame of audio signals.

In this disclosure, all the phase differences, energy values, and energy differences of the frame of audio signal calculated as above may be spliced to form the audio feature of the frame of audio signal, or a first preset number (greater than or equal to 3) of phase differences may be selected from all the phase differences of the frame of audio signal, a second preset number of energy values may be selected from all the energy values, and a third preset number of energy differences may be selected from all the energy differences to be spliced to form the audio feature of the frame of audio signal. The first preset number, the second preset number and the third preset number may be the same or different, and the disclosure does not specifically limit this.

Illustratively, taking the above example as an example, the 4 energy values, 6 energy differences, and 6 phase differences may be spliced to form the multi-dimensional audio feature of the frame audio signal. Or 3, 4 or 5 values can be respectively selected from the 6 energy differences and/or the 6 phase differences, and the 4 energy values are spliced to form the multi-dimensional audio feature of the frame audio signal.

It should be noted that before the audio features of the frame of audio signal are formed by splicing, normalization processing may be performed on the calculated energy values, and then, an energy difference is calculated according to the normalized energy values, and the calculated phase difference, the normalized energy values, and the energy difference calculated by using the normalized energy values are spliced to form the audio features of the frame of audio signal.

In the above manner, the audio characteristics of each frame of audio signal can be acquired.

After the audio features of each frame of audio signal are acquired, the multi-dimensional audio features of the frame of audio signal are input to a pre-trained sound source positioning model for the same frame of audio signal, so that second position probabilities of the frame of audio signal output by the sound source positioning model and aiming at each position are obtained.

Exemplarily, the multi-dimensional audio features of the i-th frame of audio signal are input into a pre-trained sound source localization model, so that the second position probabilities of the i-th frame of audio signal output by the sound source localization model for each position can be obtained. For example, the in-vehicle position is divided into a main driving position, a sub-driving position, a left rear position, and a right rear position in advance. The second position probability of the ith frame of audio signal for the main driving position is denoted as P_i1And a second position probability for the co-driver position is denoted as P_i2And the second position probability for the left rear position is recorded as P_i3And a second position probability for the right rear position is recorded as P_i4. In this way, a second position probability for each position of each frame of the audio signal can be obtained.

In acquiring the second position probabilities for the positions of each frame of audio signal in the above manner, M second position probabilities for the same position may be obtained. Illustratively, as shown in fig. 4, 3 second position probabilities for the main driving position, 3 second position probabilities for the subsidiary driving position, 3 second position probabilities for the rear left position, and 3 second position probabilities for the rear right position may be obtained, and thereafter, in order to obtain the sound source position of the target audio signal, the 3 second position probabilities for the primary driving position may be averaged to obtain a first position probability for the primary driving position, averaging the 3 second position probabilities for the co-driving position to obtain a first position probability for the co-driving position, for the 3 second position probabilities for the rear left position, to obtain a first position probability for the rear left driving position, and, and (4) performing second position probability aiming at the right rear positions on the 3 pieces to obtain first position probability aiming at the right rear driving position.

After the first position probabilities for the respective positions are acquired, the position corresponding to the maximum first position probability is determined as the sound source position of the target audio.

By adopting the technical scheme, when the position of the sound source is determined, the position of the sound source can be more accurately determined by referring to the energy value and the energy difference of the audio reaching the microphone in addition to the reference phase difference.

Based on the same inventive concept, the present disclosure also provides a sound source positioning device. Fig. 5 is a block diagram illustrating a sound source localization apparatus according to an exemplary embodiment. The apparatus may include:

a first obtaining module 51, configured to obtain a target audio signal from N microphones, where each of the microphones is disposed at a different position, and N is an integer greater than or equal to 3;

a first extraction module 52, configured to extract multi-dimensional audio features from the N target audio signals;

and the determining module 53 is configured to determine a sound source position of the target audio according to the multi-dimensional audio features and a pre-trained sound source positioning model.

Optionally, the determining module 53 may include:

Optionally, the first extraction module 52 may include:

Optionally, the extracting sub-module may be configured to determine, in each target audio signal, an energy value of each frame of audio signal; for the same frame of audio signals, the following steps are performed: determining the phase difference of the frame of audio signals in every two target audio signals, and determining the energy difference of the frame of audio signals in every two target audio signals according to the energy value of the frame of audio signals in every two target audio signals; and determining the audio characteristics of the frame of audio signals according to the phase difference, the energy value and the energy difference of the frame of audio signals.

Optionally, the input sub-module may be configured to, for the same frame of audio signal, input the multidimensional audio features of the frame of audio signal to a pre-trained sound source localization model to obtain a second position probability, for each position, of the frame of audio signal output by the sound source localization model; and determining the first position probability of the target audio aiming at the position according to the M second position probabilities aiming at the same position.

Optionally, the apparatus may further include:

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

FIG. 6 is a block diagram illustrating an electronic device in accordance with an example embodiment. As shown in fig. 6, the electronic device 600 may include: a processor 601 and a memory 602. The electronic device 600 may also include one or more of a multimedia component 603, an input/output (I/O) interface 604, and a communications component 605.

The processor 601 is configured to control the overall operation of the electronic device 600, so as to complete all or part of the steps in the sound source localization method. The memory 602 is used to store various types of data to support operation at the electronic device 600, such as instructions for any application or method operating on the electronic device 600 and application-related data, such as contact data, transmitted and received messages, pictures, audio, video, and so forth. The Memory 602 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk. The multimedia component 503 may include a screen and an audio component. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may further be stored in the memory 602 or transmitted through the communication component 605. The audio assembly also includes at least one speaker for outputting audio signals. The I/O interface 604 provides an interface between the processor 601 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 605 is used for wired or wireless communication between the electronic device 600 and other devices. Wireless Communication, such as Wi-Fi, bluetooth, Near Field Communication (NFC), 2G, 3G, 4G, NB-IOT, eMTC, or other 5G, etc., or a combination of one or more of them, which is not limited herein. The corresponding communication component 605 may therefore include: Wi-Fi module, Bluetooth module, NFC module, etc.

In an exemplary embodiment, the electronic Device 600 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components for performing the sound source localization method described above.

In another exemplary embodiment, a computer readable storage medium comprising program instructions which, when executed by a processor, implement the steps of the sound source localization method described above is also provided. For example, the computer readable storage medium may be the above-described memory 602 comprising program instructions executable by the processor 601 of the electronic device 600 to perform the above-described sound source localization method.

In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the sound source localization method described above when executed by the programmable apparatus.

The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.

It should be noted that the various features described in the above embodiments may be combined in any suitable manner without departing from the scope of the invention. In order to avoid unnecessary repetition, various possible combinations will not be separately described in this disclosure.

In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure, as long as it does not depart from the spirit of the present disclosure.

Claims

1. A sound source localization method, comprising:

acquiring target audio signals from N microphones, wherein each microphone is arranged at different positions on different straight lines, and N is an integer greater than or equal to 3;

for the same frame of audio signals, the following steps are performed:

determining the audio characteristics of the frame of audio signals according to the phase difference, the energy value and the energy difference of the frame of audio signals;

determining a first position probability of the target audio aiming at the position according to M second position probabilities aiming at the same position, wherein the number of the positions is more than or equal to 2, and the first position probability is used for representing the probability that the corresponding position is the sound source position of the target audio;

2. The method of claim 1, wherein the sound source localization model is trained by:

3. A sound source localization apparatus, comprising:

the first acquisition module is used for acquiring target audio signals from N microphones, wherein each microphone is arranged at different positions on different straight lines, and N is an integer greater than or equal to 3;

an extraction sub-module operable to determine, in each target audio signal, an energy value of each frame of audio signal; for the same frame of audio signals, the following steps are performed: determining the phase difference of the frame of audio signals in every two target audio signals, and determining the energy difference of the frame of audio signals in every two target audio signals according to the energy value of the frame of audio signals in every two target audio signals; determining the audio characteristics of the frame of audio signals according to the phase difference, the energy value and the energy difference of the frame of audio signals;

the input submodule can be used for inputting the multi-dimensional audio features of the frame of audio signals to a pre-trained sound source positioning model aiming at the same frame of audio signals so as to obtain second position probabilities of the frame of audio signals output by the sound source positioning model aiming at all positions; determining a first position probability of the target audio aiming at the position according to M second position probabilities aiming at the same position, wherein the number of the positions is more than or equal to 2, and the first position probability is used for representing the probability that the corresponding position is the sound source position of the target audio;

4. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1-2.

5. An electronic device, comprising:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to carry out the steps of the method of any one of claims 1-2.