CN111986692A

CN111986692A - Sound source tracking and pickup method and device based on microphone array

Info

Publication number: CN111986692A
Application number: CN201910440423.8A
Authority: CN
Inventors: 范展; 简小征; 姜开宇; 李傲
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-05-24
Filing date: 2019-05-24
Publication date: 2020-11-24

Abstract

Microphone array based sound source tracking and pickup methods and apparatus are described herein. The method comprises the following steps: estimating instantaneous sound source orientation based on snapshot data received by microphone array

(ii) a Calculating the instantaneous sound source orientation at the estimated

Spatial spectral divergence of energy of up-speech signal

(ii) a Based on calculated spatial spectral divergence

Detecting voice activity; upon detection of voice activity, based on the estimated instantaneous sound source position

Updating sound source orientations

。

Description

Sound source tracking and pickup method and device based on microphone array

Technical Field

The present disclosure relates to the field of microphone array technologies, and in particular, to a method and an apparatus for sound source tracking and pickup based on a microphone array.

Background

In recent years, with the rapid development of computer technology, people hope to control intelligent devices in more complex environments at a longer distance, and the traditional near-field voice technology cannot meet application requirements. Therefore, smart speech technology, especially far-field sound pickup technology based on microphone array, is becoming a current research focus. The dual-microphone array is a preferred solution for consumer electronics products such as smart televisions, smart speakers, mobile robots, and the like, due to advantages such as lower cost, flexibility in installation and use, and low power consumption compared to a multi-microphone array.

Beamforming is a core technology of a microphone array, and protects signals in a desired direction by performing weighted summation on acquired array data, and simultaneously suppresses noise and interference in other directions, so as to achieve the purpose of picking up sound in a far field (usually, beyond 1 meter). Beamforming is generally divided into two types: one type is data-independent beam forming, such as a delay-add method, a fixed summation method, and the like, and the suppression effect of the method on the strong spatial interference source is often not ideal. The other type is data-dependent beam forming, such as an adaptive beam forming method, an adaptive side lobe cancellation method and the like, the weighting coefficients of the adaptive beam forming and the adaptive side lobe cancellation method can be adaptively adjusted according to the change of an external environment so as to achieve the purpose of restraining a strong interference source, and the beam forming of the type is very sensitive to the error of a basic array model.

In application scenarios such as mobile robots and smart homes, a sound source is likely to be in motion, and therefore the position of the sound source relative to a microphone is often changed. The existing adaptive beamforming algorithm is very sensitive to directional deviation, and especially when the observation data contains an expected signal component, even a small directional deviation can easily distort a main lobe beam pattern, so that the expected signal is cancelled.

Disclosure of Invention

In view of the above, the present disclosure provides a microphone array-based tracking sound source and a sound pickup method.

According to a first aspect of the present disclosure, there is provided a microphone array based sound source tracking and pickup method, comprising: estimating instantaneous sound source orientation based on snapshot data received by microphone array

(ii) a Calculating the instantaneous sound source orientation at the estimated

Spatial spectral divergence of energy of up-speech signal

(ii) a Based on calculated spatial spectral divergence

Updating sound source orientations

。

In some embodiments, the instantaneous sound source position is estimated

Further comprising: estimating instantaneous sound source orientation on working frequency band of microphone array based on N pieces of snapshot data received by microphone array in a time period by adopting maximum likelihood estimation method

And N is a positive integer.

In some embodiments, employing the maximum likelihood estimation method further comprises: construction about observation orientation

And the center frequencies of the sub-bands of the operating band of the microphone array

Likelihood function of

Wherein

The array elements of the microphone array are numbered m =0, 1,

is as follows

The snapshot data received by each array element, c is the propagation speed of sound waves in the air,

For array element spacing of microphone array, observing azimuth

Is a set of discrete observation orientation sequences for an observation interval.

In some embodiments, the spatial spectral divergence is calculated

Further comprising: constructing the instantaneous sound source orientation

Function of (2)

To calculate the sound at the instantSource direction

Spatial spectral divergence of energy of up-speech signal

Wherein

Is a set of discrete observation orientations for an observation interval,

for speech signals in observation directions

Is determined.

In some embodiments, detecting voice activity further comprises: comparing the calculated spatial spectral divergence

And a predetermined detection threshold

To determine whether voice activity is detected.

In some embodiments, the divergence in the spatial spectrum

Greater than the detection threshold

When the voice activity is detected, determining that the voice activity is detected; and in spatial spectral divergence

Less than the detection threshold

When no voice activity is detected, it is determined.

In some embodiments, the sound source bearing is updated

Further comprising: the update is performed as follows:

wherein

Is a constant number of times that the number of the first,

indicating the previous sound source direction.

In some embodiments, the method further comprises updating the location of the sound source based on the updated location of the sound source

Adaptive beamforming weighting coefficients for the microphone array are calculated.

In some embodiments, calculating the adaptive beamforming weighting factor further comprises: calculating the self-adaptive beam forming weighting coefficient by adopting a beam pattern distortion constraint method, wherein the beam pattern distortion constraint method sequentially comprises the following steps: s101, constructing a feature matrix

Computing by solving a convex optimization problem of the following formula

The value of (c):

wherein

Is a constant greater than 0 and less than 2,

representing the beam pattern distortion constraint factor,

according to the direction of sound source

The calculated assumed steering vector is calculated as a vector,

a covariance matrix of N valid snapshot data. S102, judging the rank of A, and when the rank (A) is greater than 1, judging the rank of A according to a formula

Rank 1 decomposition is performed on the feature matrix a to estimate a steering vector a, and when rank (a) =1, eigenvalue decomposition is performed on the matrix a to estimate a steering vector a. S103, calculating the self-adaptive beam forming weight coefficient by taking a as a steering vector

。

In some embodiments, the time period is 25ms in length.

In some embodiments, the microphone array is a dual microphone array.

According to a second aspect of the present disclosure, a microphone array based sound source tracking and pickup apparatus includes: an instantaneous sound source orientation estimation module configured to estimate an instantaneous sound source orientation based on snapshot data received by the microphone array

(ii) a A spatial spectral divergence calculation module configured to calculate a location of the sound source at the estimated instantaneous sound source

Spatial spectral divergence of energy of up-speech signal

(ii) a A voice activity detection module configured to detect a spatial spectral divergence based on the calculated spatial spectral divergence

Detecting voice activity; a sound source bearing update module configured to, upon detection of voice activity, update the sound source bearing based on the estimated instantaneous sound source bearing

Updating sound source orientations

。

In some embodiments, the spatial spectral divergence calculation module is configured to calculate the spatial spectral divergence

Further comprising: constructing the instantaneous sound source orientation

Function of (2)

To calculate the instantaneous sound source bearing

Spatial spectral divergence of energy of up-speech signal

Wherein

Is a set of discrete observation orientations for an observation interval,

for speech signals in observation directions

Is determined.

In some embodiments, the apparatus further comprises a beamforming moduleConfigured to be based on the updated sound source bearing

According to a third aspect of the present disclosure, there is provided a computing device comprising: a memory configured to store computer-executable instructions; a processor configured to perform any of the methods described above when the computer-executable instructions are executed by the processor.

According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed, perform any of the methods described above.

The method can effectively solve the problem of azimuth mismatch caused by sound source movement, and improves the robustness of the microphone array in a far field. The method and the device provided by the disclosure can realize far-field robust sound pickup when the position of a sound source is unknown and the position of the sound source moves relative to a microphone array. These and other advantages of the present disclosure will become apparent from and elucidated with reference to the embodiments described hereinafter.

Drawings

Embodiments of the present disclosure will now be described in more detail and with reference to the accompanying drawings, in which:

FIG. 1 illustrates an exemplary context diagram in which an embodiment according to the present disclosure may be implemented;

fig. 2 illustrates an exemplary flow diagram of a microphone array based sound source tracking and pickup method according to one embodiment of the present disclosure;

FIG. 3 illustrates a graph of performance versus different beamforming algorithms according to one embodiment of the present disclosure;

FIG. 4 illustrates an exemplary schematic diagram of a microphone array based sound source tracking and pickup apparatus according to one embodiment of the present disclosure; and

fig. 5 illustrates an example system that includes an example computing device that represents one or more systems and/or devices that may implement the various techniques described herein.

Detailed Description

The following description provides specific details for a thorough understanding and enabling description of various embodiments of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these details. In some instances, well-known structures and functions have not been shown or described in detail to avoid unnecessarily obscuring the description of the embodiments of the disclosure. The terminology used in the present disclosure is to be understood in its broadest reasonable manner, even though it is being used in conjunction with a particular embodiment of the present disclosure.

First, some terms related to the embodiments of the present disclosure are explained so that those skilled in the art can understand that:

microphone array: the system comprises an audio front-end acquisition system consisting of a plurality of microphones, and the microphones are used for acquiring audio to acquire a source direction and performing beam forming calculation so as to achieve the purpose of enhancing the signal-to-noise ratio of an audio signal.

Beam forming: the microphone array only collects audio signals in a specific direction and suppresses audio signals in other directions.

Conventional beamforming: the objective is to select an appropriate weighting vector to compensate for the propagation delay of each element of the microphone so that the signals in the desired direction arrive at the array in phase, thereby producing a spatial response maximum in that direction. If analogized to a time-domain filter, beamforming can be viewed as a spatial filter, while the beam pattern is the spatial frequency response of the spatial filter. Conventional beamforming is very robust to model mismatch, mainly because it has fixed weighting coefficients whose characteristics do not change as the target signal, interference, and environmental noise characteristics change. But the conventional beamforming has very limited ability to suppress unknown strong interference.

Capon beamforming: which is a classical data dependent adaptive beamforming. The Capon beam forming is mainly characterized in that the weighting coefficient can be adaptively adjusted according to the change of the input data characteristics, so that the Capon beam forming has good inhibition capability on unknown strong interference components. The problem with this approach is that it is extremely sensitive to model mismatch.

Diagonal load beamforming: also known as noise injection. The method mainly aims at reducing the diffusion degree of the noise characteristic value of the covariance matrix caused by mismatching of the array steering vector and finite fast beat number, so that the influence of the noise characteristic vector on the self-adaptive weight vector is reduced. The algorithm can effectively reduce the distortion of the beam pattern and improve the robustness of the algorithm, but simultaneously, the null of the self-adaptive beam pattern is also lightened, and the interference suppression capability is reduced.

Worst-case optimal beamforming: the method is mainly aimed at the situation of mismatching of the steering vectors. Firstly, the real steering vectors are assumed to be distributed in the neighborhood of the assumed steering vectors, then an uncertain set of the steering vectors is constructed by utilizing the neighborhood, and finally, the minimum value of the output signal-to-interference-and-noise ratio corresponding to each vector in the set is maximized by applying constraint on the uncertain set, so that the weighting coefficient of the self-adaptive beam former is calculated.

FIG. 1 illustrates an exemplary context diagram in which an embodiment according to the present disclosure may be implemented. As shown in fig. 1, the microphone array 102 may capture sound within a range of angles, and the microphone array 102 is not directionally directed. For example, when the microphone array 102 is a two-microphone array, sound in an angular range of 180 ° can be collected. The microphone array 102 may beamform the speaker 104 while the speaker 104 is engaged in voice activity. After the microphone array 102 is beamformed for the speaker 104, the speaker 104's voice is enhanced, while noise that is not within the beamforming directivity range is shielded. It should be noted that the dual microphones are merely examples and are not limiting.

Fig. 2 illustrates an exemplary flow diagram 200 of a microphone array based sound source tracking and pickup method according to one embodiment of the present disclosure. In step 202, an instantaneous sound source bearing is estimated based on data received by a microphone array

. In one embodiment, a maximum likelihood estimation method is used to orient an instantaneous sound source over an operating band of a microphone array based on N snapshot data received by the microphone array over a period of time

And estimating, wherein N is a positive integer. For example, the length of the time period may typically take 25 ms. Taking a two-microphone array consisting of two isotropic array elements as an example, the array elements have a pitch d. First, the operating frequency of the microphone array is divided into K mutually independent sub-bands, assuming the center frequency of each sub-band

Are respectively as

. The method adopts a maximum likelihood estimation method to estimate the direction of a sound source, and comprises the following specific implementation steps:

in step 2022, taking the K (K =1,2, …, K) th sub-band as an example (other sub-bands can be analogized in turn), the structure is constructed with respect to the observation direction

And center frequency

Likelihood function of

As follows below, the following description will be given,

（1）

in the formula (1), the reaction mixture is,

is the m (m =0, 1) th array elementAnd c is the propagation speed of sound waves in the air.

In step 2024, the maximum value of equation (1) is obtained by iteration of equation (1) using newton's method, and the sound source direction of the sub-band is obtained:

（2）。

other iterative methods may also be employed to solve the maximum of equation (1), as will be appreciated by those skilled in the art.

In step 2026, the information of the sound source directions calculated for the other sub-frequencies by the above method is integrated, and the total sound source direction is estimated according to the following equation:

（3）。

other algorithms for estimating the direction of the sound source, such as minimum mean square error, etc., may also be employed, as will be appreciated by those skilled in the art.

In step 204, the estimated instantaneous sound source bearing is calculated

Spatial spectral divergence of energy of up-speech signal

. The spatial spectral divergence represents the degree of concentration of signal energy in a certain direction in space, and is a function of angle. In an embodiment, the disclosure defines it as:

（4）

Wherein the content of the first and second substances,

is to disperse the whole observation areaThe obtained group of observation directions are quantized,

for speech signals in observation directions

Is determined. In a practical embodiment of the method according to the invention,

may be obtained using a conventional beamforming algorithm or Capon spatial spectrum estimation as mentioned above. In that

When the sound source appears in the direction, the signal energy is

Direction of convergence when divergence function

The value becomes smaller; in contrast, when no sound source appears on the observation space, the energy of the signal will be randomly distributed throughout the observation space,

the value of (c) becomes large. Thereby, can be based on

Performs voice activity detection. The step of performing voice activity detection may comprise: the sound source azimuth estimated in equation (3)

In formula (4), calculating

Spatial spectral divergence in direction, i.e.:

。

in step 206, based on the calculated spatial spectral divergence

Voice activity is detected. In one embodiment, the detection threshold is preset

。H₁A state indicating "speech signal present"; h₀Indicating a "no speech signal" state. When in use

When the signal energy is distributed in the whole observation space, no speech signal exists. When in use

While the signal energy is gathered in

In the direction, there is a speech signal. That is, whether voice activity is detected may be determined according to the following equation:

（5）。

In step 208, upon detection of voice activity, an instantaneous sound source position is estimated based on

Updating sound source orientations

. Sound source azimuth

Initialization is carried out at the beginning of the sound source detection process, for example, the sound source detection process may be started

Is set to 0 deg.. Instantaneous sound source orientation then calculated from each time period

For the previous

And (6) updating. Upon detection of voice activity, the sound source may be oriented by

The updating is carried out, and the updating is carried out,

（6）

wherein the content of the first and second substances,

is constant, and

。

the method and the device improve the robustness of microphone array azimuth tracking by defining the spatial spectrum divergence and judging whether voice activity exists or not through the spatial spectrum divergence.

In one embodiment, the method 200 further comprises a step 210 of updating the sound source bearing based on the updated sound source position

Adaptive beamforming weighting coefficients for the microphone array are calculated. The process of microphone adaptive beamforming is microphone pickup. In one embodiment, adaptive beamforming weighting coefficients for the microphone array are calculated using a beam pattern distortion constraint. In an embodiment, calculating the adaptive beamforming weighting coefficients may comprise the steps of:

in step 2102, a feature matrix is constructed

Computing by solving a convex optimization problem of the following formula

The value of (c):

（7）

wherein the content of the first and second substances,

is a constant greater than 0 and less than 2, which represents a beampattern distortion constraint factor,

according to the direction of sound source

The calculated assumed steering vector is calculated as a vector,

and performing trace calculation on the matrix. The above formula (7) can be obtained by adopting an interior point method, and the operation amount is

。

The representation a is a positive array of fixed,

a covariance matrix of N valid snapshot data. s.t. indicates the conditions to which it is subjected, and the same applies in the following equations.

In step 2104, the rank of a is determined, and when rank (a) > 1, rank 1 decomposition is performed on the feature matrix a according to equation (8) to estimate steering vector a:

（8）；

when rank (a) =1, eigenvalue decomposition is performed on the matrix a to estimate the steering vector a.

In step 2106, adaptive beamforming weighting coefficients are calculated using a as the steering vector,

（9），

wherein the content of the first and second substances,

covariance matrix for N valid snapshot data, as in equation (10):

（10）。

in step 212, the speech signal is enhanced with the updated beamforming coefficients, i.e.:

（11）。

as will be appreciated by those skilled in the art, other beamforming algorithms besides the beam pattern distortion constraint method, such as conventional beamforming, Capon beamforming, diagonal loading, worst-case best beamforming, etc., may be used to calculate the adaptive beamforming weighting coefficients.

Fig. 3 illustrates a performance comparison graph of several algorithms, where the abscissa identifies the observation direction deviation, i.e. the angle of deviation of the observation direction of the microphone array from the direction of occurrence of the sound source. The SINR curve of the solution proposed for the present disclosure in fig. 3 uses the observation azimuth obtained after the maximum likelihood sound source estimation and the spatial spectrum estimation as described above. The fixed observation azimuth, i.e. the observation azimuth that is aligned to the sound source with a deviation of 0 degrees in the observation direction, is used for the standard beamforming, the diagonal loading algorithm and the worst performance optimization algorithm. As can be seen from fig. 3, the technical solution proposed in the present disclosure can achieve a significantly higher signal to interference plus noise ratio than other beamforming methods.

Fig. 4 illustrates an exemplary schematic diagram of a microphone array based sound source tracking and pickup apparatus according to one embodiment of the present disclosure. In one embodiment, the microphone array based sound source tracking and pickup apparatus 400 includes an instantaneous sound source orientation estimation module 402, a spatial spectral divergence calculation module 404, a voice activity detection module 406, a sound source orientation update module 408, and a beamforming module 410.

The sound source orientation estimation module 402 is configured to estimate an instantaneous sound source orientation based on snapshot data received by the microphone array

Are respectively as

. And estimating the direction of the sound source by adopting a maximum likelihood estimation method. Specifically, the method comprises the following steps:

first, take the K (K =1,2, …, K) th sub-band as an example (other sub-bands can be analogized in turn), the structure is constructed with respect to the observation direction

And center frequency

Likelihood function of

See formula (1) above. In the formula (1), the reaction mixture is,

the single snapshot data received by the m (m =0, 1) th array element, and c is the propagation speed of the sound wave in the air.

Then, the formula (1) is iterated by adopting the Newton method, the maximum value of the formula (1) is solved, and the sound source direction of the sub-band is obtained

See formula (2) above. Other iterative methods may also be employed to solve the maximum of equation (1), as will be appreciated by those skilled in the art.

Then, the sound source direction information calculated by the above method for other sub-frequencies is integrated to estimate the total sound source direction, see the above formula (3). Other algorithms for estimating the sound source may also be employed, as will be appreciated by those skilled in the art.

The spatial spectral divergence calculation module 404 is configured to calculate the instantaneous sound source bearing at the estimated

Spatial spectral divergence of energy of up-speech signal

. The spatial spectral divergence, which represents the degree of concentration of signal energy in a certain direction in space, is a function of angle, which the present disclosure defines according to equation (4) described above. Wherein the content of the first and second substances,

is obtained by discretizing the whole observation regionTo a set of observation directions, p being the observation direction of the speech signal

When the sound source appears in the direction, the signal energy is

Direction of convergence when divergence function

the value of (c) becomes large. Thereby, can be based on

Performs voice activity detection. The specific implementation steps of voice activity detection comprise: the sound source azimuth estimated in equation (3)

In formula (4), calculating

Spatial spectral divergence in direction, i.e.:

。

the voice activity detection module 406 is configured to be based onCalculated spatial spectral divergence

The signal energy is gathered in

In the direction, there is a speech signal, see equation (5) above.

The sound source position update module 408 is configured for, upon detection of voice activity, updating the position of the sound source based on the estimated instantaneous sound source position

Updating sound source orientations

. Upon detection of voice activity, based on the estimated instantaneous sound source position

Updating sound source orientations

. Sound source azimuth

For the previous

And (6) updating. Upon detection of voice activity, the sound source is oriented by

The updating is performed, see formula (6) above, wherein

Is constant, and

。

the beamforming module 410 is configured for updating the location of the sound source based on the updated location of the sound source

Adaptive beamforming weighting coefficients for the microphone array are calculated. The process of microphone adaptive beamforming is microphone pickup. In one embodiment, adaptive beamforming weighting coefficients for the microphone array are calculated using a beam pattern distortion constraint. The method specifically comprises the following steps:

constructing a feature matrix

Calculated by solving the convex optimization problem of equation (7) described above

The value of (c). Wherein

according to the direction of sound source

The calculated assumed steering vector is calculated as a vector,

。

And judging the rank of A, and when the rank (A) is greater than 1, performing rank 1 decomposition on the feature matrix A according to the formula (8) to estimate the guide vector a. When rank (a) =1, eigenvalue decomposition is performed on the matrix a to estimate the steering vector a.

The adaptive beamforming weighting coefficients are calculated with a as the steering vector according to equation (9) described above. Wherein

The covariance matrix for the N valid snapshot data is referred to as equation (10) above.

In one embodiment, a speech signal enhancement module is further included, which enhances the speech signal with the updated beamforming coefficients, see equation (11) described above. As will be appreciated by those skilled in the art, other beamforming algorithms besides the beam pattern distortion constraint may be employed, such as conventional beamforming, Capon beamforming, diagonal loading, worst-case best-performance beamforming, and the like

Fig. 5 illustrates an example system 500 that includes an example computing device 510 that represents one or more systems and/or devices that may implement the various techniques described herein. Computing device 510 may be, for example, a server of a service provider, a device associated with a client (e.g., a client device), a system on a chip, and/or any other suitable computing device or computing system. The microphone array based sound source tracking and pickup apparatus 400 described above with respect to fig. 4 may take the form of a computing device 510. Alternatively, the microphone array based sound source tracking and pickup apparatus 400 may be implemented as a computer program in the form of a sound source tracking and pickup application 516.

The example computing device 510 as illustrated includes a processing system 511, one or more computer-readable media 512, and one or more I/O interfaces 513 communicatively coupled to each other. Although not shown, the computing device 510 may also include a system bus or other data and command transfer system that couples the various components to one another. A system bus can include any one or combination of different bus structures, such as a memory bus or memory controller, a peripheral bus, a universal serial bus, and/or a processor or local bus that utilizes any of a variety of bus architectures. Various other examples are also contemplated, such as control and data lines.

Processing system 511 represents functionality that performs one or more operations using hardware. Thus, the processing system 511 is illustrated as including hardware elements 514 that may be configured as processors, functional blocks, and so forth. This may include implementation in hardware as an application specific integrated circuit or other logic device formed using one or more semiconductors. Hardware element 514 is not limited by the material from which it is formed or the processing mechanisms employed therein. For example, a processor may be comprised of semiconductor(s) and/or transistors (e.g., electronic Integrated Circuits (ICs)). In such a context, processor-executable instructions may be electronically-executable instructions.

The computer-readable medium 512 is illustrated as including a memory/storage device 515. Memory/storage 515 represents memory/storage capacity associated with one or more computer-readable media. The memory/storage 515 may include volatile media (such as Random Access Memory (RAM)) and/or nonvolatile media (such as Read Only Memory (ROM), flash memory, optical disks, magnetic disks, and so forth). The memory/storage 515 may include fixed media (e.g., RAM, ROM, a fixed hard drive, etc.) as well as removable media (e.g., flash memory, a removable hard drive, an optical disk, and so forth). The computer-readable medium 512 may be configured in various other ways as further described below.

One or more I/O interfaces 513 represent functionality that allows a user to enter commands and information to computing device 510, and optionally also allows information to be presented to the user and/or other components or devices using various input/output devices. Examples of input devices include a keyboard, a cursor control device (e.g., a mouse), a microphone (e.g., for voice input), a scanner, touch functionality (e.g., capacitive or other sensors configured to detect physical touch), a camera (e.g., motion that may not involve touch may be detected as gestures using visible or invisible wavelengths such as infrared frequencies), and so forth. Examples of output devices include a display device (e.g., a monitor or projector), speakers, a printer, a network card, a haptic response device, and so forth. Accordingly, the computing device 510 may be configured in various ways to support user interaction, as described further below.

The computing device 510 also includes a sound source tracking and pickup application 516. The sound source tracking and pickup application 516 may be, for example, a software instance of the microphone array based sound source tracking and pickup apparatus 400 described with respect to fig. 4, and in combination with other elements in the computing device 510 implement the techniques described herein.

Various techniques may be described herein in the general context of software hardware elements or program modules. Generally, these modules include routines, programs, objects, elements, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The terms "module," "functionality," and "component" as used herein generally represent software, firmware, hardware, or a combination thereof. The features of the techniques described herein are platform-independent, meaning that the techniques may be implemented on a variety of computing platforms having a variety of processors.

An implementation of the described modules and techniques may be stored on or transmitted across some form of computer readable media. Computer readable media can include a variety of media that can be accessed by computing device 510. By way of example, and not limitation, computer-readable media may comprise "computer-readable storage media" and "computer-readable signal media".

"computer-readable storage medium" refers to a medium and/or device, and/or a tangible storage apparatus, capable of persistently storing information, as opposed to mere signal transmission, carrier wave, or signal per se. Accordingly, computer-readable storage media refers to non-signal bearing media. Computer-readable storage media include hardware such as volatile and nonvolatile, removable and non-removable media and/or storage devices implemented in a method or technology suitable for storage of information such as computer-readable instructions, data structures, program modules, logic elements/circuits or other data. Examples of computer readable storage media may include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical storage, hard disks, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or other storage devices, tangible media, or an article of manufacture suitable for storing the desired information and accessible by a computer.

"computer-readable signal medium" refers to a signal-bearing medium configured to transmit instructions to hardware of computing device 510, such as via a network. Signal media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave, data signal or other transport mechanism. Signal media also includes any information delivery media. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media.

As previously described, hardware element 514 and computer-readable medium 512 represent instructions, modules, programmable device logic, and/or fixed device logic implemented in hardware form that may be used in some embodiments to implement at least some aspects of the techniques described herein. The hardware elements may include integrated circuits or systems-on-chips, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), Complex Programmable Logic Devices (CPLDs), and other implementations in silicon or components of other hardware devices. In this context, a hardware element may serve as a processing device that performs program tasks defined by instructions, modules, and/or logic embodied by the hardware element, as well as a hardware device for storing instructions for execution, such as the computer-readable storage medium described previously.

Combinations of the foregoing may also be used to implement the various techniques and modules described herein. Thus, software, hardware, or program modules and other program modules may be implemented as one or more instructions and/or logic embodied on some form of computer-readable storage medium and/or by one or more hardware elements 514. The computing device 510 may be configured to implement particular instructions and/or functions corresponding to software and/or hardware modules. Thus, implementing modules as modules executable by the computing device 510 as software may be implemented at least partially in hardware, for example, using the processing system's computer-readable storage media and/or hardware elements 514. The instructions and/or functions may be executable/operable by one or more articles of manufacture (e.g., one or more computing devices 510 and/or processing systems 511) to implement the techniques, modules, and examples described herein.

In various implementations, the computing device 510 may assume a variety of different configurations. For example, the computing device 510 may be implemented as a computer-like device including a personal computer, a desktop computer, a multi-screen computer, a laptop computer, a netbook, and so forth. The computing device 510 may also be implemented as a mobile device-like device including mobile devices such as mobile telephones, portable music players, portable gaming devices, tablet computers, multi-screen computers, and the like. Computing device 510 may also be implemented as a television-like device that includes devices with or connected to a generally larger screen in a casual viewing environment. These devices include televisions, set-top boxes, game consoles, and the like.

The techniques described herein may be supported by these various configurations of computing device 510 and are not limited to specific examples of the techniques described herein. Functionality may also be implemented in whole or in part on the "cloud" 520 through the use of a distributed system, such as through the platform 522 described below.

Cloud 520 includes and/or is representative of a platform 522 for resources 524. The platform 522 abstracts underlying functionality of hardware (e.g., servers) and software resources of the cloud 520. The resources 524 may include applications and/or data that may be used when executing computer processes on servers remote from the computing device 510. The resources 524 may also include services provided over the internet and/or over a subscriber network such as a cellular or Wi-Fi network.

The platform 522 may abstract resources and functionality to connect the computing device 510 with other computing devices. The platform 522 may also be used to abstract a hierarchy of resources to provide a corresponding level of hierarchy encountered for the demand of the resources 524 implemented via the platform 522. Thus, in interconnected device embodiments, implementation of functions described herein may be distributed throughout the system 500. For example, the functionality may be implemented in part on the computing device 510 and by the platform 522 that abstracts the functionality of the cloud 520.

It should be understood that embodiments of the disclosure have been described with reference to different functional blocks for clarity. However, it will be apparent that the functionality of each functional module may be implemented in a single module, in multiple modules, or as part of other functional modules without departing from the disclosure. For example, functionality illustrated to be performed by a single module may be performed by multiple different modules. Thus, references to specific functional blocks are only to be seen as references to suitable blocks for providing the described functionality rather than indicative of a strict logical or physical structure or organization. Thus, the present disclosure may be implemented in a single module or may be physically and functionally distributed between different modules and circuits.

It will be understood that, although the terms first, second, third, etc. may be used herein to describe various devices, elements, or components, these devices, elements, or components should not be limited by these terms. These terms are only used to distinguish one device, element, or component from another device, element, or component.

Although the present disclosure has been described in connection with some embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the present disclosure is limited only by the accompanying claims. Additionally, although individual features may be included in different claims, these may possibly advantageously be combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. The order of features in the claims does not imply any specific order in which the features must be worked. Furthermore, in the claims, the word "comprising" does not exclude other elements, and the indefinite article "a" or "an" does not exclude a plurality. Reference signs in the claims are provided merely as a clarifying example and shall not be construed as limiting the scope of the claims in any way.

Claims

1. A microphone array based sound source tracking and pickup method, comprising:

estimating instantaneous sound source orientation based on snapshot data received by the microphone array

；

Calculating the instantaneous sound source orientation at the estimated

Spatial spectral divergence of energy of up-speech signal

；

Based on calculated spatial spectral divergence

Detecting voice activity;

upon detection of the voice activity, based on the estimated instantaneous sound source position

Updating sound source orientations

。

2. The method of claim 1, wherein said estimating an instantaneous sound source bearing

Further comprising:

estimating an instantaneous sound source orientation on a working frequency band of the microphone array based on N snapshot data received by the microphone array within a time period by using a maximum likelihood estimation method

And N is a positive integer.

3. The method of claim 2, wherein said employing maximum likelihood estimation further comprises:

construction about observation orientation

Likelihood function of

Wherein

The serial numbers m =0, 1 of the array elements of the microphone array,

is as follows

for array element spacing of microphone array, observing azimuth

Is a set of discrete observation orientations for an observation interval.

4. The method of claim 1, wherein the calculating spatial spectral divergence

Further comprising:

constructing the instantaneous sound source orientation

Function of (2)

To calculate the instantaneous sound source bearing

Spatial spectral divergence of energy of up-speech signal

Wherein

Is directed to the observation intervalA set of discrete observation positions of (a),

for speech signals in observation directions

Is determined.

5. The method of claim 1, wherein the detecting voice activity further comprises:

comparing the calculated spatial spectral divergence

And a predetermined detection threshold

To determine whether voice activity is detected.

6. The method of claim 5, wherein the divergence is in the spatial spectrum

Greater than the detection threshold

Less than the detection threshold

When no voice activity is detected, it is determined.

7. The method of claim 1, wherein the updating the sound source bearing

Further comprising performing the updating according to:

wherein

Is a constant number of times that the number of the first,

indicating the previous sound source direction.

8. The method of claim 1, further comprising:

based on updated sound source position

9. The method of claim 8, the calculating adaptive beamforming weighting coefficients further comprising:

calculating the self-adaptive beam forming weighting coefficient by adopting a beam pattern distortion constraint method, wherein the beam pattern distortion constraint method sequentially comprises the following steps:

s101, constructing a feature matrix

Computing by solving a convex optimization problem of the following formula

The value of (c):

wherein

Is a constant greater than 0 and less than 2,

representing the beam pattern distortion constraint factor,

according to the direction of sound source

The calculated assumed steering vector is calculated as a vector,

a covariance matrix for the N valid snapshot data;

s102, judging the Rank of A, and when the Rank (A) > 1, judging according to a formula

Rank 1 decomposing the feature matrix A to estimate a steering vector a, an

When Rank (a) =1, eigenvalue decomposition is performed on the matrix a to estimate a steering vector a;

s103, calculating the self-adaptive beam forming weight coefficient by taking a as a steering vector

。

10. The method of claim 2, wherein the time period is 25ms in length.

11. A method as claimed in any preceding claim wherein the microphone array is a dual microphone array.

12. A microphone array based sound source tracking and pickup apparatus comprising:

instantaneous sound source orientation estimationA meter module configured to estimate an instantaneous sound source bearing based on snapshot data received by the microphone array

；

A spatial spectral divergence calculation module configured to calculate a location of the sound source at the estimated instantaneous sound source

Spatial spectral divergence of energy of up-speech signal

；

A voice activity detection module configured to detect a spatial spectral divergence based on the calculated spatial spectral divergence

Detecting voice activity;

a sound source bearing update module configured to, upon detection of the voice activity, based on the estimated instantaneous sound source bearing

Updating sound source orientations

。

13. The apparatus of claim 12, wherein the spatial spectral divergence calculation module is configured to said calculate spatial spectral divergence

Further comprising:

constructing the instantaneous sound source orientation

Function of (2)

To calculate the instantaneous sound source bearing

Spatial spectral divergence of energy of up-speech signal

Wherein

Is a set of discrete observation orientations for an observation interval,

for speech signals in observation directions

Is determined.

14. A non-transitory computer readable medium of computer program instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 1-11.

15. A computing device comprising a processor and a memory having stored thereon a computer program configured to, when executed on the processor, cause the processor to perform the method of any of claims 1-11.