CN111880148A

CN111880148A - Sound source positioning method, device, equipment and storage medium

Info

Publication number: CN111880148A
Application number: CN202010790574.9A
Authority: CN
Inventors: 王备
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2020-08-07
Filing date: 2020-08-07
Publication date: 2020-11-03

Abstract

The application relates to a sound source positioning method, a sound source positioning device, a sound source positioning equipment and a storage medium. The method comprises the following steps: the audio acquisition equipment acquires audio signals by utilizing an arranged microphone array, wherein the microphone array comprises a plurality of microphones which are respectively arranged in different directions of the audio acquisition equipment; the audio acquisition equipment determines a frequency domain signal of the audio signal; the audio acquisition equipment calculates target generalized cross-correlation values in multiple directions based on frequency information of the frequency domain signal corresponding to the microphone; the audio acquisition device determines a sound source direction of the audio signal from the plurality of directions based on the delay characteristics characterized by the target generalized cross-correlation values in the plurality of directions. Therefore, the sound source can be quickly positioned based on the generalized cross-correlation value, and quick positioning can be realized even under the condition of multiple sound sources, so that the problems that the prior art is slow in reaction speed and cannot support multiple sound source scenes are fundamentally solved.

Description

Sound source positioning method, device, equipment and storage medium

Technical Field

The present application relates to audio processing technologies, and in particular, to a sound source localization method, apparatus, device, and storage medium.

Background

At present, a microphone array used in a teleconference scene usually adopts an energy estimation method to perform audio transmission, that is, a microphone array beam forming technology is utilized to select a signal of a fixed beam with the maximum acquired energy as a target signal in a plurality of preset fixed beams in different directions to complete audio transmission. Although this method is simple to implement, it has the following disadvantages: firstly, sound source localization cannot be realized; secondly, the reaction speed is slow, and the phenomenon of word loss is easy to occur instantly when a new sound source occurs, because energy estimation needs certain accumulation time, instant reaction is not easy to realize; secondly, when multiple sound sources are simultaneously present, the relatively weak sound source is ignored.

Disclosure of Invention

In order to solve the problems, the invention provides a sound source positioning method, a sound source positioning device, sound source positioning equipment and a sound source positioning storage medium, which can quickly position a sound source based on generalized cross-correlation values, can realize quick positioning even under the condition of multiple sound sources, and fundamentally solve the problems that the prior art is slow in response speed and cannot support multiple sound source scenes.

In a first aspect, an embodiment of the present application provides a sound source localization method, including:

the audio acquisition equipment acquires audio signals by utilizing an arranged microphone array, wherein the microphone array comprises a plurality of microphones which are respectively arranged in different directions of the audio acquisition equipment and are used for acquiring the audio signals from different directions;

the audio acquisition equipment determines a frequency domain signal of the audio signal;

the audio acquisition equipment calculates target generalized cross-correlation values in multiple directions based on frequency information of the frequency domain signals corresponding to the microphones, wherein the target generalized cross-correlation values in any one direction of the multiple directions are used for representing delay characteristics of frequency information reaching a pair of microphones in the microphone array;

the audio acquisition device determines a sound source direction of the audio signal from the plurality of directions based on the delay characteristics characterized by the target generalized cross-correlation values in the plurality of directions.

In a specific example of the present application, the method further includes:

combining any two microphones in the microphone array to obtain N pairs of microphones, wherein the microphones are combined in a matrix of N pairs

M is the number of microphones in the microphone arrayAmount of the compound (A).

In a specific example of the present application, the calculating, by the audio acquisition device, target generalized cross-correlation values in multiple directions based on frequency information of the frequency domain signal corresponding to the microphone includes:

calculating generalized cross-correlation values corresponding to each pair of microphones in the microphone array for one direction based on each frequency information of the frequency domain signal corresponding to the microphone;

and obtaining a target generalized cross-correlation value in one direction based on all the generalized cross-correlation values in one direction so as to obtain target generalized cross-correlation values in multiple directions.

In a specific example of the present application, the determining, by the audio acquisition device, a sound source direction of the audio signal from the plurality of directions based on the delay features represented by the target generalized cross-correlation values in the plurality of directions includes:

and taking the direction corresponding to the maximum value in the target generalized cross-correlation values as the sound source direction of the audio signal.

selecting a suspected sound source direction of the audio signal from a plurality of directions based on the delay characteristics represented by the target generalized mutual values;

determining a plurality of adjacent directions corresponding to the suspected sound source direction;

and calculating to obtain target generalized cross-correlation values of the multiple adjacent directions, and determining the sound source direction of the audio signal from the multiple adjacent directions.

In a specific example of the present application, the determining a sound source direction of the audio signal from the plurality of adjacent directions includes:

and taking the direction corresponding to the maximum value in the target generalized cross-correlation values of the multiple adjacent directions as the sound source direction of the audio signal.

In a second aspect, an embodiment of the present application provides a sound source localization apparatus, including:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring audio signals by utilizing an arranged microphone array, the microphone array comprises a plurality of microphones, and the microphones are respectively arranged in different directions of the audio acquisition equipment and are used for acquiring the audio signals from different directions;

a signal conversion unit for determining a frequency domain signal of the audio signal;

a calculating unit, configured to calculate target generalized cross-correlation values in multiple directions based on frequency information of the frequency-domain signal corresponding to the microphone, where the target generalized cross-correlation value in any one of the multiple directions is used to characterize delay characteristics of frequency information reaching a pair of microphones in the microphone array;

a positioning unit, configured to determine a sound source direction of the audio signal from the multiple directions based on the delay characteristics represented by the target generalized cross-correlation values in the multiple directions.

In a specific example of the present application, the calculating unit is further configured to:

The M is the number of microphones in the microphone array.

In a specific example of the present application, the positioning unit is further configured to use a direction corresponding to a maximum value in the target generalized cross-correlation value as a sound source direction of the audio signal.

In a specific example of the present application, the positioning unit is further configured to:

In a third aspect, an embodiment of the present application provides a sound source localization apparatus, including:

one or more processors;

a memory communicatively coupled to the one or more processors;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the methods described above.

In a fourth aspect, the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the method described above.

Therefore, according to the scheme of the application, the microphones in the microphone array can be used for collecting audio signals from different directions, the delay characteristics of the audio signals reaching the microphones in the microphone pair (namely a pair of microphones formed by any two microphones in the microphone array) are different according to the frequency information corresponding to the audio signals, and the generalized cross-correlation values (namely the target generalized cross-correlation values) in multiple directions are rapidly calculated, so that the sound source positioning is completed; moreover, even if the positioning mode is also suitable for multiple sound sources, the positioning mode can also realize quick positioning without word loss, thereby fundamentally solving the problems that the prior art has slow reaction speed and cannot support multiple sound source scenes.

Drawings

FIG. 1 is a schematic flow chart of a sound source localization method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a sound source localization method according to an embodiment of the present application in a specific application scenario;

FIG. 3 is a schematic diagram of a two-step positioning method according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a sound source localization apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a sound source localization apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In some of the flows described in the specification and claims of the present application and in the above-described figures, a number of operations are included that occur in a particular order, but it should be clearly understood that the flows may include more or less operations, and that the operations may be performed sequentially or in parallel.

The embodiment of the application provides a sound source positioning method, a sound source positioning device, sound source positioning equipment and a storage medium; specifically, fig. 1 is a schematic flow chart of an implementation of a sound source localization method according to an embodiment of the present invention, and as shown in fig. 1, the method includes:

step 101: the audio acquisition equipment acquires audio signals by utilizing the arranged microphone array, wherein the microphone array comprises a plurality of microphones which are respectively arranged in different directions of the audio acquisition equipment and used for acquiring the audio signals from different directions. For example, the microphone array includes M microphones, and each of the microphones is disposed in a different direction of the audio acquisition device to acquire an audio signal from the different direction.

In the scheme, the microphone array is a fully-directional microphone array, wherein each microphone in the microphone array is arranged in different directions of the audio acquisition equipment, so that audio signals in all directions can be acquired; in addition, due to the fact that the delay of the audio signals reaching the microphones at different positions is different, the foundation can be laid for achieving sound source positioning.

Step 102: the audio acquisition device determines a frequency domain signal of the audio signal, for example, the audio acquisition device performs short-time fourier transform on the audio signal acquired by all the microphones to obtain the frequency domain signal of the audio signal.

In practical applications, for the same sound source, since the microphones are disposed in different directions, the audio signals collected by different microphones for the same sound source are different. At this time, the audio acquisition device performs short-time fourier transform on the audio signals acquired by all the microphones to obtain frequency domain signals corresponding to the audio signals acquired by each microphone.

Step 103: the audio acquisition equipment calculates target generalized cross-correlation values in multiple directions based on frequency information of the frequency domain signals corresponding to the microphones, wherein the target generalized cross-correlation values in any one direction of the multiple directions are used for representing delay characteristics of frequency information reaching a pair of microphones in the microphone array.

In a specific example, a pair of microphones, which may also be referred to as a microphone pair, may be obtained by, for example, combining any two microphones in the microphone array to obtain N microphone pairs, where the microphone pair is obtained

That is, the microphones are combined two by two to obtain the microphone pair, thus laying a foundation for realizing sound source positioning。

Step 104: the audio acquisition device determines a sound source direction of the audio signal from the plurality of directions based on the delay characteristics characterized by the target generalized cross-correlation values in the plurality of directions.

In practical application, the audio signal can be an audio signal corresponding to one sound source, and can also be an audio signal with a plurality of sound sources mixed, that is, the scheme of the application can be used for positioning one sound source, and meanwhile, multi-sound-source positioning can also be realized. Here, when the audio signal is an audio signal in which a plurality of sound sources are mixed, the determined sound source directions are also a plurality of directions.

Here, the directions are a plurality of directions set in advance, and thus, sound source localization is accomplished by calculating a target generalized cross-correlation value in the directions set in advance.

In a specific example, the target generalized cross-correlation value may be obtained by calculating a generalized cross-correlation value corresponding to each pair of microphones in the microphone array for one direction based on each frequency information of the frequency domain signal corresponding to the microphone; and obtaining a target generalized cross-correlation value in one direction based on all the generalized cross-correlation values in one direction so as to obtain target generalized cross-correlation values in multiple directions. For example, a generalized cross-correlation value is obtained based on one frequency information of the frequency domain signal corresponding to the microphone and one microphone pair, and so on, all generalized cross-correlation values for one direction are obtained, in other words, one frequency information and one microphone pair are obtained, one generalized cross-correlation value is obtained, the frequency domain signal includes a plurality of frequency information, and at the same time, there are a plurality of microphone pairs, so that a plurality of generalized cross-correlation values can be obtained for one direction, and further, a target generalized cross-correlation value in one direction is obtained based on all generalized cross-correlation values for one direction, and further, target generalized cross-correlation values in a plurality of directions are obtained. For example, the target generalized cross-correlation value in a direction is obtained by adding all the generalized cross-correlation values in the direction.

In practical application, after the target generalized cross-correlation values in each direction are determined, the direction corresponding to the maximum value in the target generalized cross-correlation values is taken as the sound source direction of the audio signal, so that sound source positioning is realized.

In a specific example, to reduce the amount of computation, a two-step method may be adopted to obtain the sound source direction of the audio signal, for example, first, rough positioning, and a suspected sound source direction of the audio signal is selected from multiple directions based on the delay characteristics represented by the target generalized mutual values, where, for simplicity, the direction of rough positioning may be specifically a collecting direction opposite to a microphone; then, in fine positioning, that is, determining a plurality of adjacent directions corresponding to the suspected sound source direction, for example, selecting the plurality of adjacent directions from positive and negative preset degrees of the suspected sound source direction, further calculating to obtain target generalized cross-correlation values of the plurality of adjacent directions in the same manner, and determining the sound source direction of the audio signal from the plurality of adjacent directions based on the target generalized cross-correlation values of the plurality of adjacent directions. In this way, fast positioning is achieved with a reduced amount of data. Of course, in a specific example, a direction corresponding to a maximum value of the target generalized cross-correlation values of the plurality of adjacent directions may be taken as a sound source direction of the audio signal.

In the scene, after the sound source is positioned, audio acquisition is carried out from the positioned sound source direction based on the positioning result, and the acquired signals are subjected to aliasing and then transmitted, so that the audio acquisition and transmission processes in the conference call scene are completed; moreover, the scheme of the application can carry out quick positioning, so that the problems of word loss and the like can be avoided; when the direction of the sound source is changed, the sound source can still be quickly positioned, so that the problems that the prior art is slow in response speed and cannot support a multi-sound-source conference scene are fundamentally solved. Furthermore, because the selected audio is only transmitted in the scheme of the application, the call quality in the call scene can be ensured, and a foundation is laid for improving the user experience.

The following describes the present application in further detail with reference to specific examples, and specifically, the present example takes a conference call scene as an example, and uses a uniform annular microphone array (the uniform annular microphone array, all microphones are placed on a circle at equal intervals) to achieve fast localization of a sound source, thereby fundamentally solving the problems of slow response speed and incapability of supporting multiple sound source scenes in the prior art. Here, Sound Localization (SSL) in this example refers to fast Sound Localization using phase information of signals collected by microphones.

As shown in fig. 2, the fast sound source localization procedure based on the phase information is as follows:

obtaining a frame of time domain signals for all microphones in the array of microphones;

for all microphones, transforming the acquired Time domain signal into a frequency domain signal by Short Time Fourier Transform (STFT);

for all microphones, calculating the phase of the frequency domain signal to obtain frequency information corresponding to the frequency domain signal;

pairing all the microphones in the microphone array to obtain

And each microphone pair, wherein M is the number of the microphones. For example,6 microphones were used, in this case, 15 microphone pairs were obtained;

presetting D target directions, wherein D is a positive integer greater than or equal to 2; for each target direction, calculating a generalized cross-correlation (GCC) value based on each microphone pair and frequency information corresponding to each time domain signal, calculating in the same manner for all frequency updates and all microphone pairs, that is, obtaining a plurality of generalized cross-correlation values for one target direction, summing the plurality of generalized cross-correlation values, and finally obtaining an output value (that is, a target generalized cross-correlation value) in the target direction (indicated by a bold arrow in the figure);

for the D directions, D output values are obtained, and the maximum value of the D output values is found, and the corresponding target Direction is the Direction of sound source (DOA).

In practical application, a generalized cross-correlation (hereinafter referred to as GCC) value can be calculated as follows:

taking the microphone pair composed of microphones # 1 and # 2 as an example, at an angular frequency ω, the generalized cross-correlation is defined as:

therein, Ψ₁₂(ω) is a weighting function related to angular frequency, ω being 2^πf (f denotes frequency information), X₁(omega) and X₂And (ω) is a frequency domain coefficient at an angular frequency ω after the microphones STFT 1 and STFT 2, respectively, which represents a conjugate, and τ represents a Time Difference between the positions of the microphones 1 and 2 of the far-field sound source in the current direction (Time Difference of Arrival, TDOA).

Here, in order to achieve fast sound source localization, a Phase Transform (PHAT) weighting function, that is, a Phase Transform (PHAT) weighting function may be used

At this time, the generalized cross-correlation formula can be transformed into:

taking into account the conjugate symmetry of the Fourier transform of the real signal, i.e. < X >_m(-ω)＝-∠X_m(ω), m is 1,2, …, and the above formula can be simplified as:

in practical application, the upper and lower limits of the integral are limited to a certain frequency band, for example [500, 3000] Hz, in consideration of the voice characteristics, so that the stability of the algorithm is increased.

Here, to reduce the amount of calculation, the present example may also adopt a Two-step SSL (Two-step SSL) approach to achieve fast localization, and in particular, in a conference call scenario, sound source localization is usually required in a Two-dimensional horizontal plane, that is, 360 ° full space. In order to improve the positioning accuracy, it is desirable that the number D of preset target directions is as large as possible. For example, when D is 60, the positioning accuracy may be 360 °/2D is 3 °. However, as D increases, the amount of computation also increases, and therefore, the amount of computation is significantly reduced without reducing the value of D (i.e., without reducing the accuracy), and this example proposes a two-step positioning method positioning scheme based on the circumferential symmetry of the uniform circular array. The method comprises the following specific steps:

assuming that M microphones are included in a uniform circular array, for convenience of implementation, D is an integer multiple of M, i.e., D ═ D₁M,D₁∈Z⁺. For example, for a 6 microphone uniform annular array, i.e., M-6, D-D may be selected₁M＝60,D₁10, wherein D₁Is the precision.

First, rough positioning: firstly, scanning the direction opposite to the M microphones to find out the direction corresponding to the maximum value of the GCC.

Second, fine positioning: on both sides in the direction of return of the first step D ₁1 fine directions, 2 (D) in total₁After scanning in the directions of-1) +1, finding out the maximum value correspondence of GCCThe direction of the positioning table is used for obtaining a final positioning result.

For example, as described in fig. 3, for example, M-6 and D-60:

in the first step, coarse positioning is performed in the directions indicated by the 6 microphones to obtain the direction corresponding to the maximum GCC value (the direction corresponding to the gray icon in the figure).

Second, fine scanning (shown as an arc) is performed on both sides of the gray icon in the corresponding direction, each side having a D₁-1-D/M-1-9 equally spaced directions, with the directions on both sides corresponding to the gray icons, for a total of 2 (D)₁And (3) 1) +1 is 19 directions, and the final positioning result is obtained.

In the example shown in fig. 3, using two positioning directions, only 6+19 needs to be scanned in 25 directions, which reduces the amount of computation by more than half compared to directly scanning 60 directions.

Here, Voice Activity Detection (VAD), when audio does not exist, the sound source localization algorithm may localize a noise source if it continues to work, or a random localization result occurs, and thus, the sound source localization algorithm needs to work in cooperation with the VAD algorithm. Only when the audio is detected to exist, the sound source is positioned to output DOA; when speech is not present, the sound source localization output is NULL.

An embodiment of the present application further provides a sound source localization apparatus, as shown in fig. 4, the apparatus includes:

the acquisition unit 41 is configured to acquire an audio signal by using an arranged microphone array, where the microphone array includes a plurality of microphones, and the plurality of microphones are respectively arranged in different directions of the audio acquisition device and are used to acquire audio signals from different directions;

a signal conversion unit 42 for determining a frequency domain signal of the audio signal;

a calculating unit 43, configured to calculate target generalized cross-correlation values in multiple directions based on frequency information of the frequency-domain signal corresponding to the microphone, where the target generalized cross-correlation value in any one of the multiple directions is used to characterize delay characteristics of frequency information arriving at a pair of microphones in the microphone array;

a localization unit 44 configured to determine a sound source direction of the audio signal from the plurality of directions based on the delay characteristics represented by the target generalized cross-correlation values in the plurality of directions.

In a specific example of the present application, the calculating unit 43 is further configured to:

The M is the number of microphones in the microphone array.

In a specific example of the present application, the positioning unit 44 is further configured to use a direction corresponding to a maximum value of the target generalized cross-correlation values as a sound source direction of the audio signal.

In a specific example of the present application, the positioning unit 44 is further configured to:

Here, it should be noted that: the descriptions of the embodiments of the apparatus are similar to the descriptions of the methods, and have the same advantages as the embodiments of the methods, and therefore are not repeated herein. For technical details that are not disclosed in the embodiments of the apparatus of the present invention, those skilled in the art should refer to the description of the embodiments of the method of the present invention to understand, and for brevity, will not be described again here.

An embodiment of the present application further provides a sound source localization apparatus, including: one or more processors; a memory communicatively coupled to the one or more processors; one or more application programs; wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method described above.

In a specific example, the sound source device according to the embodiment of the present application may be embodied as a structure as shown in fig. 5, and the sound source device at least includes a processor 51, a storage medium 52, and at least one external communication interface 53; the processor 51, the storage medium 52 and the external communication interface 53 are all connected by a bus 54. The processor 51 may be a microprocessor, a central processing unit, a digital signal processor, a programmable logic array, or other electronic components with processing functions. The storage medium has stored therein computer executable code capable of performing the method of any of the above embodiments. In practical applications, the acquisition unit 41, the signal conversion unit 42, the calculation unit 43, and the positioning unit 44 can be implemented by the processor 51.

Here, it should be noted that: the above description of the embodiment of the sound source localization device is similar to the above description of the method, and has the same beneficial effects as the embodiment of the method, and therefore, the description thereof is omitted. For technical details not disclosed in the embodiments of the sound source localization apparatus of the present invention, those skilled in the art should refer to the description of the embodiments of the method of the present invention to understand, and for the sake of brevity, no further description is provided here.

Embodiments of the present application also provide a computer-readable storage medium, which stores a computer program, and when the program is executed by a processor, the computer program implements the method described above.

A computer-readable storage medium can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable read-only memory (CDROM). Additionally, the computer-readable storage medium may even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that all or part of the steps carried by the method for implementing the above embodiments can be implemented by hardware related to instructions of a program, which can be stored in a computer readable storage medium, and the program includes one or a combination of the steps of the method embodiments when the program is executed.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer readable storage medium. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

The embodiments described above are only a part of the embodiments of the present invention, and not all of them. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Claims

1. A sound source localization method, characterized in that the method comprises:

2. The method of claim 1, further comprising:

The M is the number of microphones in the microphone array.

3. The method of claim 1 or 2, wherein the audio acquisition device calculates the target generalized cross-correlation values in a plurality of directions based on the frequency information of the frequency domain signal corresponding to the microphone, comprising:

4. The method of claim 1, wherein the audio acquisition device determines a sound source direction of the audio signal from the plurality of directions based on the delay characteristics characterized by the target generalized cross-correlation values in the plurality of directions, comprising:

5. The method of claim 1, wherein the audio acquisition device determines a sound source direction of the audio signal from the plurality of directions based on the delay characteristics characterized by the target generalized cross-correlation values in the plurality of directions, comprising:

6. The method according to claim 5, wherein said determining a sound source direction of the audio signal from the plurality of adjacent directions comprises:

7. A sound source localization apparatus, characterized in that the apparatus comprises:

8. The apparatus of claim 7, wherein the computing unit is further configured to:

The M is the number of microphones in the microphone array.

9. A sound source localization apparatus, comprising:

one or more processors;

a memory communicatively coupled to the one or more processors;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of any of claims 1-6.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 6.