CN111161757B - Sound source positioning method and device, readable storage medium and electronic equipment - Google Patents

Sound source positioning method and device, readable storage medium and electronic equipment Download PDF

Info

Publication number
CN111161757B
CN111161757B CN201911373874.0A CN201911373874A CN111161757B CN 111161757 B CN111161757 B CN 111161757B CN 201911373874 A CN201911373874 A CN 201911373874A CN 111161757 B CN111161757 B CN 111161757B
Authority
CN
China
Prior art keywords
sound source
audio
frame
audio signals
audio signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911373874.0A
Other languages
Chinese (zh)
Other versions
CN111161757A (en
Inventor
莫凡
孙珏
刘士杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mgjia Beijing Technology Co ltd
Original Assignee
Mgjia Beijing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mgjia Beijing Technology Co ltd filed Critical Mgjia Beijing Technology Co ltd
Priority to CN201911373874.0A priority Critical patent/CN111161757B/en
Publication of CN111161757A publication Critical patent/CN111161757A/en
Application granted granted Critical
Publication of CN111161757B publication Critical patent/CN111161757B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/18Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
    • G01S5/24Position of single direction-finder fixed by determining direction of a plurality of spaced sources of known location
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band

Abstract

The disclosure relates to a sound source positioning method, a sound source positioning device, a readable storage medium and an electronic device. The method comprises the following steps: acquiring target audio signals from N microphones, wherein each microphone is arranged at a different position, and N is an integer greater than or equal to 3; extracting multi-dimensional audio features from the N target audio signals; and determining the sound source position of the target audio according to the multi-dimensional audio features and a pre-trained sound source positioning model. In this way, the sound source position of the target audio is determined based on the sound source localization model, and the accuracy of determining the sound source position can be improved. And the sound source position is determined through the multi-dimensional audio features, and compared with the prior art that the sound source position is determined only through the phase difference, the accuracy and the robustness of the determined sound source position are further improved.

Description

Sound source positioning method and device, readable storage medium and electronic equipment
Technical Field
The present disclosure relates to the field of positioning technologies, and in particular, to a sound source positioning method, a sound source positioning device, a readable storage medium, and an electronic device.
Background
Sound localization refers to the process of determining the position of a sound object by an auditory (acoustic) system. The position of the sound-emitting object can be accurately identified through sound source positioning. Most of the existing sound source positioning methods are to arrange a plurality of microphones at different positions in space according to a certain sequence, process audio signals received by the microphones and finally obtain the final position of a sound source according to calculation.
In the conventional sound source positioning method, the phase difference of at least three audio signals reaching the microphones (i.e. the time difference between the audio signals reaching different microphones) needs to be calculated, and then the intersection point coordinates of hyperbolas obtained according to the phase difference are calculated, so that the positioning can be completed. However, since the phase difference is obtained according to the cross-correlation relationship between the audio signals collected by the two microphones, the cross-correlation relationship between different sound sources has a certain difference, which results in inaccurate positioning.
Disclosure of Invention
An object of the present disclosure is to provide a sound source localization method, apparatus, readable storage medium, and electronic device to improve accuracy and robustness of sound source localization.
In order to achieve the above object, a first aspect of the present disclosure provides a sound source localization method, including:
acquiring target audio signals from N microphones, wherein each microphone is arranged at a different position, and N is an integer greater than or equal to 3;
extracting multi-dimensional audio features from the N target audio signals;
and determining the sound source position of the target audio according to the multi-dimensional audio features and a pre-trained sound source positioning model.
Optionally, the determining the sound source position of the target audio according to the multi-dimensional audio features and a pre-trained sound source localization model includes:
inputting the multi-dimensional audio features to a pre-trained sound source positioning model to obtain a first position probability for each position, wherein the number of the positions is more than or equal to 2, and the first position probability is used for representing the probability that the corresponding position is the sound source position of the target audio;
and determining the position corresponding to the maximum first position probability as the sound source position of the target audio in the plurality of first position probabilities.
Optionally, the extracting multi-dimensional audio features from the N target audio signals includes:
for each entry mark audio signal, dividing the entry mark audio signal into M frames of audio signals;
and extracting multi-dimensional audio features from the N M frames of audio signals.
Optionally, the extracting multi-dimensional audio features from the N M-frame audio signals includes:
determining the energy value of each frame of audio signal in each target audio signal;
for the same frame of audio signals, the following steps are performed:
determining the phase difference of the frame of audio signals in every two target audio signals, and determining the energy difference of the frame of audio signals in every two target audio signals according to the energy value of the frame of audio signals in every two target audio signals;
and determining the audio characteristics of the frame of audio signals according to the phase difference, the energy value and the energy difference of the frame of audio signals.
Optionally, the inputting the multi-dimensional audio features into a pre-trained sound source localization model to obtain a first location probability for each location includes:
aiming at the same frame of audio signal, inputting the multi-dimensional audio features of the frame of audio signal into a pre-trained sound source positioning model to obtain a second position probability aiming at each position of the frame of audio signal output by the sound source positioning model;
and determining the first position probability of the target audio aiming at the position according to the M second position probabilities aiming at the same position.
Optionally, the sound source localization model is trained by:
acquiring sample audio signals generated by a plurality of sound sources from K microphones, wherein K is an integer greater than or equal to 3;
extracting multi-dimensional audio sample characteristics aiming at each sample audio signal, wherein the multi-dimensional audio sample characteristics are marked with the sound source position of the sample audio signal;
and training by taking the multi-dimensional audio sample characteristics as a model training sample to obtain the sound source positioning model.
A second aspect of the present disclosure provides a sound source localization apparatus, including:
the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring target audio signals from N microphones, each microphone is arranged at a different position, and N is an integer greater than or equal to 3;
the first extraction module is used for extracting multi-dimensional audio features from the N target audio signals;
and the determining module is used for determining the sound source position of the target audio according to the multi-dimensional audio features and a pre-trained sound source positioning model.
Optionally, the determining module includes:
the input submodule is used for inputting the multi-dimensional audio features to a pre-trained sound source positioning model so as to obtain a first position probability aiming at each position, wherein the number of the positions is more than or equal to 2, and the first position probability is used for representing the probability that the corresponding position is the sound source position of the target audio;
and the determining submodule is used for determining the position corresponding to the maximum first position probability as the sound source position of the target audio in the plurality of first position probabilities.
Optionally, the first extraction module includes:
a dividing submodule for dividing the entry mark audio signals into M frames of audio signals for each entry mark audio signal;
and the extraction submodule is used for extracting multi-dimensional audio features from the N M frames of audio signals.
Optionally, the extracting sub-module is configured to determine, in each target audio signal, an energy value of each frame of audio signal; for the same frame of audio signals, the following steps are performed: determining the phase difference of the frame of audio signals in every two target audio signals, and determining the energy difference of the frame of audio signals in every two target audio signals according to the energy value of the frame of audio signals in every two target audio signals; and determining the audio characteristics of the frame of audio signals according to the phase difference, the energy value and the energy difference of the frame of audio signals.
Optionally, the input sub-module is configured to, for a same frame of audio signals, input the multidimensional audio features of the frame of audio signals to a pre-trained sound source localization model to obtain second position probabilities, for each position, of the frame of audio signals output by the sound source localization model; and determining the first position probability of the target audio aiming at the position according to the M second position probabilities aiming at the same position.
Optionally, the apparatus further comprises:
the second acquisition module is used for acquiring sample audio signals generated by a plurality of sound sources from K microphones, wherein K is an integer greater than or equal to 3;
the second extraction module is used for extracting multi-dimensional audio sample characteristics aiming at each sample audio signal, wherein the multi-dimensional audio sample characteristics are marked with the sound source position of the sample audio signal;
and the training module is used for training the multi-dimensional audio sample characteristics as model training samples to obtain the sound source positioning model.
The third aspect of the present disclosure also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method provided by the first aspect of the present disclosure.
The fourth aspect of the present disclosure also provides an electronic device, including:
a memory having a computer program stored thereon;
a processor for executing the computer program in the memory to implement the steps of the method provided by the first aspect of the present disclosure.
Through the technical scheme, the target audio signals are obtained from the N microphones, the multi-dimensional audio features are extracted from the N items of the marked audio signals, and the sound source position of the target audio is determined according to the audio features and the pre-trained sound source positioning model. In this way, the sound source position of the target audio is determined based on the sound source localization model, and the accuracy of determining the sound source position can be improved. And the sound source position is determined through the multi-dimensional audio features, and compared with the prior art that the sound source position is determined only through the phase difference, the accuracy and the robustness of the determined sound source position are further improved.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:
fig. 1 is a flow chart illustrating a sound source localization method according to an exemplary embodiment.
Fig. 2 is a schematic diagram illustrating a sound source and microphone according to an exemplary embodiment.
Fig. 3 is a flow chart illustrating a sound source localization method according to another exemplary embodiment.
Fig. 4 is a diagram illustrating division of a target audio signal into multiple frames of audio signals according to an exemplary embodiment.
Fig. 5 is a block diagram illustrating a sound source localization apparatus according to an exemplary embodiment.
FIG. 6 is a block diagram illustrating an electronic device in accordance with an example embodiment.
Detailed Description
The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.
Fig. 1 is a flow chart illustrating a sound source localization method according to an exemplary embodiment. As shown in fig. 1, the method may include steps 11 to 13.
In step 11, a target audio signal is obtained from N microphones, where each microphone is disposed at a different position, and N is an integer greater than or equal to 3.
It should be understood by those skilled in the art that if only the target audio signals are obtained from the two microphones, only one phase difference can be determined, and thus only the position of the sound source can be determined to be located on one hyperbolic curve, and a specific sound source position cannot be further determined. If the target audio signals are acquired from three or more microphones, at least three phase differences can be determined, and then at least three hyperbolas can be determined, and a unique intersection point can be determined by the three hyperbolas, wherein the intersection point is the sound source position.
For example, as shown in fig. 2, it is assumed that the scene to which the present disclosure is applied includes 1 sound source and 4 microphones, and the 4 microphones are a microphone 1, a microphone 2, a microphone 3, and a microphone 4, respectively. Wherein each microphone is arranged at a different position, and the 4 microphones can acquire the audio signal emitted by the sound source. Thus, the electronic device performing the sound source localization method can acquire the target audio signal from the 4 microphones. Wherein the target audio signal is one of the audio signals emitted by the sound sources in fig. 2. It should be noted that, in the present disclosure, the electronic device executing the sound source localization method may also select 3 microphones from the 4 microphones in fig. 2, and acquire the target audio signals collected by the 3 microphones.
In addition, in the present disclosure, a preferred embodiment is that, if a plurality of microphones are located on different straight lines, the more significant the difference between the microphones acquiring the audio signals emitted by the same sound source is, and further based on the significant difference, the sound source position can be accurately determined: the N microphones are located on different straight lines.
In step 12, multi-dimensional audio features are extracted from the N-entry label audio signal.
In the present disclosure, the sound source position of the entry mark audio signal may be determined in units of the entire target audio signal, or the sound source position of the entire target audio signal may be determined in units of each frame audio signal in one target audio signal. If the whole target audio signal is taken as a unit, the audio characteristic is the multi-dimensional audio characteristic of the whole target audio signal; if each frame of audio signal is taken as a unit, the audio feature is a multi-dimensional audio feature of each frame of audio signal. The present disclosure does not specifically limit this. Further, the multi-dimensional audio features may include, but are not limited to, phase differences, energies, and energy differences.
In step 13, the sound source position of the target audio is determined according to the multi-dimensional audio features and the pre-trained sound source localization model.
Specifically, first, after obtaining a multi-dimensional audio feature, it is input to a sound source localization model trained in advance to obtain a first position probability for each position. The positions are preset, the number of the positions is more than or equal to 2, and the first position probability is used for representing the probability that the corresponding position is the sound source position of the target audio. For example, a four-seater vehicle may be divided into 4 positions, respectively a main driving position, a sub-driving position, a rear left position, and a rear right position, and a five-seater vehicle may be divided into 5 positions, respectively a main driving position, a sub-driving position, a rear left position, a rear center position, and a rear right position, and so on, according to the seat distribution of the vehicle. The present disclosure does not specifically limit the division of the positions.
Then, among the plurality of first position probabilities, a position corresponding to the largest first position probability is determined as a sound source position of the target audio.
When the entire target audio signal is input to a sound source localization model trained in advance, the first position probability for each position output by the sound source localization model can be obtained. If each frame of audio signal in a target audio signal is input to a pre-trained sound source positioning model as a unit, a second position probability of the frame of audio signal output by the sound source positioning model for each position can be obtained, and the electronic equipment executing the sound source positioning method determines a first position probability for each position based on the second position probability of each frame of audio signal for each position.
As described above, the first position probability for each position can be determined from the multidimensional audio features and the pre-trained sound source localization model, and if the set positions are the main driving position, the sub-driving position, the rear left position, and the rear right position, respectively, the first position probability for the main driving position, the first position probability for the sub-driving position, the first position probability for the rear left position, and the first position probability for the rear right position can be determined. Among the 4 first position probabilities, a larger numerical value of the first position probability indicates a higher probability that its corresponding position is the sound source position of the target audio, and thus, in the present disclosure, among the plurality of first position probabilities, a position corresponding to the largest first position probability is determined as the sound source position of the target audio.
By adopting the technical scheme, the target audio signals are obtained from the N microphones, the multi-dimensional audio features are extracted from the N items of the marked audio signals, and the sound source position of the target audio is determined according to the audio features and the pre-trained sound source positioning model. In this way, the sound source position of the target audio is determined based on the sound source localization model, and the accuracy of determining the sound source position can be improved. And the sound source position is determined through the multi-dimensional audio features, and compared with the prior art that the sound source position is determined only through the phase difference, the accuracy and the robustness of the determined sound source position are further improved.
Furthermore, the sound source localization model may be a DNN (Deep Neural Networks) classification model including a plurality of layers. And the sound source positioning model can be obtained by training in the following way:
first, sample audio signals generated by a plurality of sound sources are acquired from K microphones.
Wherein K may be the same as or different from N in the step 11. When K is different from N, K is still an integer of 3 or more. In the present disclosure, audio signals generated by different sound sources at different positions are collected in advance in K microphones, and the electronic device performing the training process may acquire the audio signals generated by a plurality of sound sources from the K microphones and use the audio signals as sample audio signals.
It should be noted that, considering that the audio characteristics are influenced by voice, environment and noise, the model is trained by using the audio signal close to the actual usage scene, so as to effectively improve the precision of the trained sound source localization model, and thus, the position of the sound source can be accurately determined according to the sound source localization model with higher precision. Thus, in a preferred embodiment, sample audio signals generated by a plurality of sound sources in a particular scene are acquired from K microphones. For example, if the position of the sound source in the vehicle during the travel of the vehicle is determined, the specific scene may be a scene in which the vehicle is in a traveling state.
Then, for each sample audio signal, multi-dimensional audio sample features are extracted.
As described above, the sound source position of the entry mark audio signal may be determined in units of the entire target audio signal, or may be determined in units of each frame audio signal in the target audio signal. Similarly, in the training process, the multidimensional audio sample features of the entire sample audio signal may be extracted by taking the entire sample audio signal as a unit, or the multidimensional audio sample features of the frame audio signal may be extracted by taking each frame audio signal as a unit. The present disclosure does not specifically limit this.
In addition, in the process of training the model, after the multi-dimensional audio sample features are extracted, the sound source positions of the sample audio signals can be marked in the multi-dimensional audio sample features. For example, if a multi-dimensional audio sample feature of an audio signal generated by a sound source located at a primary driving position is extracted, the sound source position labeled in the multi-dimensional audio sample feature is the primary driving position. For another example, if the sound source localization model outputs the first position probability or the second position probability for each position, the sound source position labeled in the multi-dimensional audio sample feature may be that the first position probability or the second position probability for the main driving position is 1, and the first position probability or the second position probability for the auxiliary driving position, the left rear position and the right rear position is 0.
And finally, training by taking the multi-dimensional audio sample characteristics as model training samples to obtain a sound source positioning model.
Illustratively, the deep neural network model can be trained through multi-dimensional audio sample characteristics to obtain a sound source localization model. In this way, when the audio features extracted in step 12 are input to the sound source localization model, the first position probability or the second position probability for each position output by the sound source localization model can be obtained.
The following is a detailed description of the above step 12 for extracting multi-dimensional audio features from the N target audio signals.
As shown in fig. 3, step 12 may specifically include step 121 and step 122.
In step 121, for each entry mark audio signal, the entry mark audio signal is divided into M-frame audio signals.
In this disclosure, for a target audio signal, the time lengths corresponding to each divided frame of audio signal may be the same or different.
For example, assuming that the time duration of the entire target audio signal is 40 msec, the entry mark audio signal may be divided into 4 frame audio signals having the same time duration, i.e., each frame audio signal has a time duration of 10 msec. Further illustratively, the entry mark audio signal may be further divided into 4-frame audio signals having different durations, for example, in the entry mark audio signal, the duration of the first frame audio signal is 20 milliseconds, the duration of the second frame audio signal is 10 milliseconds, the duration of the third frame audio signal is 5 milliseconds, and the duration of the fourth frame audio signal is 5 milliseconds.
In the present disclosure, the division rule of the N-entry label audio signal is the same. For example, if one target audio signal is divided into 4-frame audio signals having the same duration, the other N-1 entry mark audio signals are also divided into 4-frame audio signals having the same duration. If a target audio signal is divided into 4 frames of audio signals with different time lengths, and the division rule is as follows: the duration of the first frame audio signal is 20 milliseconds, the duration of the second frame audio signal is 10 milliseconds, the duration of the third frame audio signal is 5 milliseconds, and the duration of the fourth frame audio signal is 5 milliseconds, so that other N-1 entry mark audio signals are also divided according to the division rule.
According to the above scheme, for an N-entry mark audio signal, it is possible to divide into N M-frame audio signals. And the ith frame audio signal in each target audio signal and the ith frame audio signal in other target audio signals are the same frame audio signal, wherein the value range of i is [1, M ].
In step 122, multi-dimensional audio features are extracted from the N M-frame audio signals.
Specifically, the way of extracting the multi-dimensional audio features is as follows:
first, in each target audio signal, an energy value of each frame audio signal is determined.
In the present disclosure, the energy value of the audio signal refers to the sum of squares of amplitudes of the audio signal collected by the microphone at each time in the time domain. Illustratively, a frame of audio signal with a duration of 10 milliseconds has 16000 sampling points, and the energy value of the frame of audio signal can be obtained by calculating the sum of squares of the amplitudes of the 16000 sampling points. Thus, for each target audio signal, the energy value of each frame of audio signal in the entry mark audio signal can be calculated.
Then, for the same frame of audio signal, determining the phase difference of the frame of audio signal in every two target audio signals, and determining the energy difference of the frame of audio signal in every two target audio signals according to the energy value of the frame of audio signal in every two target audio signals.
It should be noted that the attenuation of the energy of the audio signal is different due to the difference in the distance from the microphone to the audio signal. Theoretically, the energy attenuation of the microphone reaching closer distance is smaller, the energy value of the audio signal collected by the microphone is larger, and therefore, the distance from the sound source to each microphone can be determined by the energy value. Further, the energy difference can be used to further highlight the energy decay information. Thus, in the present disclosure, the multi-dimensional audio features may include phase differences, energy values, and energy differences.
As described above, the ith frame of audio signal in each target audio signal and the ith frame of audio signal in the other target audio signals are the same frame of audio signal, for the ith frame of audio signal, the phase difference of the ith frame of audio signal is determined in each two target audio signals, and the energy difference of the ith frame of audio signal in each two target audio signals is determined according to the energy value of the ith frame of audio signal in each target audio signal.
For example, as shown in fig. 4, assuming that N is 4 and M is 3, the N-entry tag audio signals are a first target audio signal, a second target audio signal, a third target audio signal and a fourth target audio signal, respectively, and each target audio signal includes a first frame audio signal, a second frame audio signal and a third frame audio signal. For a first frame audio signal, energy values of the first frame audio signal in each entry mark audio signal may be calculated, and 6 energy differences are calculated based on the 4 energy values, and the following 6 phase differences are calculated by a cross-correlation method: a phase difference of the first frame audio signal in the first target audio signal and the second target audio signal, a phase difference of the first target audio signal and the third target audio signal, a phase difference of the first target audio signal and the fourth target audio signal, a phase difference of the second target audio signal and the third target audio signal, a phase difference of the second target audio signal and the fourth target audio signal, and a phase difference of the third target audio signal and the fourth target audio signal.
And finally, determining the audio characteristics of the frame of audio signals according to the calculated phase difference, energy value and energy difference of the frame of audio signals.
In this disclosure, all the phase differences, energy values, and energy differences of the frame of audio signal calculated as above may be spliced to form the audio feature of the frame of audio signal, or a first preset number (greater than or equal to 3) of phase differences may be selected from all the phase differences of the frame of audio signal, a second preset number of energy values may be selected from all the energy values, and a third preset number of energy differences may be selected from all the energy differences to be spliced to form the audio feature of the frame of audio signal. The first preset number, the second preset number and the third preset number may be the same or different, and the disclosure does not specifically limit this.
Illustratively, taking the above example as an example, the 4 energy values, 6 energy differences, and 6 phase differences may be spliced to form the multi-dimensional audio feature of the frame audio signal. Or 3, 4 or 5 values can be respectively selected from the 6 energy differences and/or the 6 phase differences, and the 4 energy values are spliced to form the multi-dimensional audio feature of the frame audio signal.
It should be noted that before the audio features of the frame of audio signal are formed by splicing, normalization processing may be performed on the calculated energy values, and then, an energy difference is calculated according to the normalized energy values, and the calculated phase difference, the normalized energy values, and the energy difference calculated by using the normalized energy values are spliced to form the audio features of the frame of audio signal.
In the above manner, the audio characteristics of each frame of audio signal can be acquired.
After the audio features of each frame of audio signal are acquired, the multi-dimensional audio features of the frame of audio signal are input to a pre-trained sound source positioning model for the same frame of audio signal, so that second position probabilities of the frame of audio signal output by the sound source positioning model and aiming at each position are obtained.
Exemplarily, the multi-dimensional audio features of the i-th frame of audio signal are input into a pre-trained sound source localization model, so that the second position probabilities of the i-th frame of audio signal output by the sound source localization model for each position can be obtained. For example, the in-vehicle position is divided into a main driving position, a sub-driving position, a left rear position, and a right rear position in advance. The second position probability of the ith frame of audio signal for the main driving position is denoted as Pi1And a second position probability for the co-driver position is denoted as Pi2And the second position probability for the left rear position is recorded as Pi3And a second position probability for the right rear position is recorded as Pi4. In this way, a second position probability for each position of each frame of the audio signal can be obtained.
And determining the first position probability of the target audio aiming at the position according to the M second position probabilities aiming at the same position.
In acquiring the second position probabilities for the positions of each frame of audio signal in the above manner, M second position probabilities for the same position may be obtained. Illustratively, as shown in fig. 4, 3 second position probabilities for the main driving position, 3 second position probabilities for the subsidiary driving position, 3 second position probabilities for the rear left position, and 3 second position probabilities for the rear right position may be obtained, and thereafter, in order to obtain the sound source position of the target audio signal, the 3 second position probabilities for the primary driving position may be averaged to obtain a first position probability for the primary driving position, averaging the 3 second position probabilities for the co-driving position to obtain a first position probability for the co-driving position, for the 3 second position probabilities for the rear left position, to obtain a first position probability for the rear left driving position, and, and (4) performing second position probability aiming at the right rear positions on the 3 pieces to obtain first position probability aiming at the right rear driving position.
After the first position probabilities for the respective positions are acquired, the position corresponding to the maximum first position probability is determined as the sound source position of the target audio.
By adopting the technical scheme, when the position of the sound source is determined, the position of the sound source can be more accurately determined by referring to the energy value and the energy difference of the audio reaching the microphone in addition to the reference phase difference.
Based on the same inventive concept, the present disclosure also provides a sound source positioning device. Fig. 5 is a block diagram illustrating a sound source localization apparatus according to an exemplary embodiment. The apparatus may include:
a first obtaining module 51, configured to obtain a target audio signal from N microphones, where each of the microphones is disposed at a different position, and N is an integer greater than or equal to 3;
a first extraction module 52, configured to extract multi-dimensional audio features from the N target audio signals;
and the determining module 53 is configured to determine a sound source position of the target audio according to the multi-dimensional audio features and a pre-trained sound source positioning model.
Optionally, the determining module 53 may include:
the input submodule is used for inputting the multi-dimensional audio features to a pre-trained sound source positioning model so as to obtain a first position probability aiming at each position, wherein the number of the positions is more than or equal to 2, and the first position probability is used for representing the probability that the corresponding position is the sound source position of the target audio;
and the determining submodule is used for determining the position corresponding to the maximum first position probability as the sound source position of the target audio in the plurality of first position probabilities.
Optionally, the first extraction module 52 may include:
a dividing submodule for dividing the entry mark audio signals into M frames of audio signals for each entry mark audio signal;
and the extraction submodule is used for extracting multi-dimensional audio features from the N M frames of audio signals.
Optionally, the extracting sub-module may be configured to determine, in each target audio signal, an energy value of each frame of audio signal; for the same frame of audio signals, the following steps are performed: determining the phase difference of the frame of audio signals in every two target audio signals, and determining the energy difference of the frame of audio signals in every two target audio signals according to the energy value of the frame of audio signals in every two target audio signals; and determining the audio characteristics of the frame of audio signals according to the phase difference, the energy value and the energy difference of the frame of audio signals.
Optionally, the input sub-module may be configured to, for the same frame of audio signal, input the multidimensional audio features of the frame of audio signal to a pre-trained sound source localization model to obtain a second position probability, for each position, of the frame of audio signal output by the sound source localization model; and determining the first position probability of the target audio aiming at the position according to the M second position probabilities aiming at the same position.
Optionally, the apparatus may further include:
the second acquisition module is used for acquiring sample audio signals generated by a plurality of sound sources from K microphones, wherein K is an integer greater than or equal to 3;
the second extraction module is used for extracting multi-dimensional audio sample characteristics aiming at each sample audio signal, wherein the multi-dimensional audio sample characteristics are marked with the sound source position of the sample audio signal;
and the training module is used for training the multi-dimensional audio sample characteristics as model training samples to obtain the sound source positioning model.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
FIG. 6 is a block diagram illustrating an electronic device in accordance with an example embodiment. As shown in fig. 6, the electronic device 600 may include: a processor 601 and a memory 602. The electronic device 600 may also include one or more of a multimedia component 603, an input/output (I/O) interface 604, and a communications component 605.
The processor 601 is configured to control the overall operation of the electronic device 600, so as to complete all or part of the steps in the sound source localization method. The memory 602 is used to store various types of data to support operation at the electronic device 600, such as instructions for any application or method operating on the electronic device 600 and application-related data, such as contact data, transmitted and received messages, pictures, audio, video, and so forth. The Memory 602 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk. The multimedia component 503 may include a screen and an audio component. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may further be stored in the memory 602 or transmitted through the communication component 605. The audio assembly also includes at least one speaker for outputting audio signals. The I/O interface 604 provides an interface between the processor 601 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 605 is used for wired or wireless communication between the electronic device 600 and other devices. Wireless Communication, such as Wi-Fi, bluetooth, Near Field Communication (NFC), 2G, 3G, 4G, NB-IOT, eMTC, or other 5G, etc., or a combination of one or more of them, which is not limited herein. The corresponding communication component 605 may therefore include: Wi-Fi module, Bluetooth module, NFC module, etc.
In an exemplary embodiment, the electronic Device 600 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components for performing the sound source localization method described above.
In another exemplary embodiment, a computer readable storage medium comprising program instructions which, when executed by a processor, implement the steps of the sound source localization method described above is also provided. For example, the computer readable storage medium may be the above-described memory 602 comprising program instructions executable by the processor 601 of the electronic device 600 to perform the above-described sound source localization method.
In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the sound source localization method described above when executed by the programmable apparatus.
The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.
It should be noted that the various features described in the above embodiments may be combined in any suitable manner without departing from the scope of the invention. In order to avoid unnecessary repetition, various possible combinations will not be separately described in this disclosure.
In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure, as long as it does not depart from the spirit of the present disclosure.

Claims (5)

1. A sound source localization method, comprising:
acquiring target audio signals from N microphones, wherein each microphone is arranged at different positions on different straight lines, and N is an integer greater than or equal to 3;
for each entry mark audio signal, dividing the entry mark audio signal into M frames of audio signals;
determining the energy value of each frame of audio signal in each target audio signal;
for the same frame of audio signals, the following steps are performed:
determining the phase difference of the frame of audio signals in every two target audio signals, and determining the energy difference of the frame of audio signals in every two target audio signals according to the energy value of the frame of audio signals in every two target audio signals;
determining the audio characteristics of the frame of audio signals according to the phase difference, the energy value and the energy difference of the frame of audio signals;
aiming at the same frame of audio signal, inputting the multi-dimensional audio features of the frame of audio signal into a pre-trained sound source positioning model to obtain a second position probability aiming at each position of the frame of audio signal output by the sound source positioning model;
determining a first position probability of the target audio aiming at the position according to M second position probabilities aiming at the same position, wherein the number of the positions is more than or equal to 2, and the first position probability is used for representing the probability that the corresponding position is the sound source position of the target audio;
and determining the position corresponding to the maximum first position probability as the sound source position of the target audio in the plurality of first position probabilities.
2. The method of claim 1, wherein the sound source localization model is trained by:
acquiring sample audio signals generated by a plurality of sound sources from K microphones, wherein K is an integer greater than or equal to 3;
extracting multi-dimensional audio sample characteristics aiming at each sample audio signal, wherein the multi-dimensional audio sample characteristics are marked with the sound source position of the sample audio signal;
and training by taking the multi-dimensional audio sample characteristics as a model training sample to obtain the sound source positioning model.
3. A sound source localization apparatus, comprising:
the first acquisition module is used for acquiring target audio signals from N microphones, wherein each microphone is arranged at different positions on different straight lines, and N is an integer greater than or equal to 3;
a dividing submodule for dividing the entry mark audio signals into M frames of audio signals for each entry mark audio signal;
an extraction sub-module operable to determine, in each target audio signal, an energy value of each frame of audio signal; for the same frame of audio signals, the following steps are performed: determining the phase difference of the frame of audio signals in every two target audio signals, and determining the energy difference of the frame of audio signals in every two target audio signals according to the energy value of the frame of audio signals in every two target audio signals; determining the audio characteristics of the frame of audio signals according to the phase difference, the energy value and the energy difference of the frame of audio signals;
the input submodule can be used for inputting the multi-dimensional audio features of the frame of audio signals to a pre-trained sound source positioning model aiming at the same frame of audio signals so as to obtain second position probabilities of the frame of audio signals output by the sound source positioning model aiming at all positions; determining a first position probability of the target audio aiming at the position according to M second position probabilities aiming at the same position, wherein the number of the positions is more than or equal to 2, and the first position probability is used for representing the probability that the corresponding position is the sound source position of the target audio;
and the determining submodule is used for determining the position corresponding to the maximum first position probability as the sound source position of the target audio in the plurality of first position probabilities.
4. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1-2.
5. An electronic device, comprising:
a memory having a computer program stored thereon;
a processor for executing the computer program in the memory to carry out the steps of the method of any one of claims 1-2.
CN201911373874.0A 2019-12-27 2019-12-27 Sound source positioning method and device, readable storage medium and electronic equipment Active CN111161757B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911373874.0A CN111161757B (en) 2019-12-27 2019-12-27 Sound source positioning method and device, readable storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911373874.0A CN111161757B (en) 2019-12-27 2019-12-27 Sound source positioning method and device, readable storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN111161757A CN111161757A (en) 2020-05-15
CN111161757B true CN111161757B (en) 2021-09-03

Family

ID=70558281

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911373874.0A Active CN111161757B (en) 2019-12-27 2019-12-27 Sound source positioning method and device, readable storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN111161757B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2530484A1 (en) * 2011-06-01 2012-12-05 Dolby Laboratories Licensing Corporation Sound source localization apparatus and method
CN109782231A (en) * 2019-01-17 2019-05-21 北京大学 A kind of end-to-end sound localization method and system based on multi-task learning
CN110068795A (en) * 2019-03-31 2019-07-30 天津大学 A kind of indoor microphone array sound localization method based on convolutional neural networks
CN110517705A (en) * 2019-08-29 2019-11-29 北京大学深圳研究生院 A kind of binaural sound sources localization method and system based on deep neural network and convolutional neural networks

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9788109B2 (en) * 2015-09-09 2017-10-10 Microsoft Technology Licensing, Llc Microphone placement for sound source direction estimation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2530484A1 (en) * 2011-06-01 2012-12-05 Dolby Laboratories Licensing Corporation Sound source localization apparatus and method
CN109782231A (en) * 2019-01-17 2019-05-21 北京大学 A kind of end-to-end sound localization method and system based on multi-task learning
CN110068795A (en) * 2019-03-31 2019-07-30 天津大学 A kind of indoor microphone array sound localization method based on convolutional neural networks
CN110517705A (en) * 2019-08-29 2019-11-29 北京大学深圳研究生院 A kind of binaural sound sources localization method and system based on deep neural network and convolutional neural networks

Also Published As

Publication number Publication date
CN111161757A (en) 2020-05-15

Similar Documents

Publication Publication Date Title
EP3812969A1 (en) Neural network model compression method, corpus translation method and device
EP3825923A1 (en) Hypernetwork training method and device, electronic device and storage medium
CN112149740B (en) Target re-identification method and device, storage medium and equipment
CN109754793A (en) Device and method for recommending the function of vehicle
CN109657539B (en) Face value evaluation method and device, readable storage medium and electronic equipment
CN110543849B (en) Detector configuration method and device, electronic equipment and storage medium
CN111739539A (en) Method, device and storage medium for determining number of speakers
CN110321410B (en) Log extraction method and device, storage medium and electronic equipment
KR20180025634A (en) Voice recognition apparatus and method
CN110930984A (en) Voice processing method and device and electronic equipment
CN111667810B (en) Method and device for acquiring polyphone corpus, readable medium and electronic equipment
KR20180133645A (en) Method and apparatus for searching geographic information using interactive speech recognition
CN110334716B (en) Feature map processing method, image processing method and device
KR20170120645A (en) Method and device for determining interchannel time difference parameter
CN111161757B (en) Sound source positioning method and device, readable storage medium and electronic equipment
JP5949311B2 (en) Estimation program, estimation apparatus, and estimation method
CN110956128A (en) Method, apparatus, electronic device, and medium for generating lane line image
US20200046595A1 (en) Source-of-sound based navigation for a visually-impaired user
JP7340630B2 (en) Multi-speaker diarization of speech input using neural networks
US20230269515A1 (en) Earphone position adjustment method and apparatus, and equipment and storage medium
CN111857366B (en) Method and device for determining double-click action of earphone and earphone
US11482211B2 (en) Method and apparatus for outputting analysis abnormality information in spoken language understanding
Jeon et al. Acoustic surveillance of hazardous situations using nonnegative matrix factorization and hidden Markov model
Frisch et al. A Bayesian stochastic machine for sound source localization
CN114863943B (en) Self-adaptive positioning method and device for environmental noise source based on beam forming

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant