CN111986692A - Sound source tracking and pickup method and device based on microphone array - Google Patents

Sound source tracking and pickup method and device based on microphone array Download PDF

Info

Publication number
CN111986692A
CN111986692A CN201910440423.8A CN201910440423A CN111986692A CN 111986692 A CN111986692 A CN 111986692A CN 201910440423 A CN201910440423 A CN 201910440423A CN 111986692 A CN111986692 A CN 111986692A
Authority
CN
China
Prior art keywords
sound source
microphone array
voice activity
spatial spectral
divergence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910440423.8A
Other languages
Chinese (zh)
Inventor
范展
简小征
姜开宇
李傲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910440423.8A priority Critical patent/CN111986692A/en
Publication of CN111986692A publication Critical patent/CN111986692A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Abstract

Microphone array based sound source tracking and pickup methods and apparatus are described herein. The method comprises the following steps: estimating instantaneous sound source orientation based on snapshot data received by microphone array
Figure DEST_PATH_IMAGE002
(ii) a Calculating the instantaneous sound source orientation at the estimated
Figure 312527DEST_PATH_IMAGE002
Spatial spectral divergence of energy of up-speech signal
Figure DEST_PATH_IMAGE004
(ii) a Based on calculated spatial spectral divergence
Figure 229667DEST_PATH_IMAGE004
Detecting voice activity; upon detection of voice activity, based on the estimated instantaneous sound source position
Figure 947088DEST_PATH_IMAGE002
Updating sound source orientations
Figure DEST_PATH_IMAGE006

Description

Sound source tracking and pickup method and device based on microphone array
Technical Field
The present disclosure relates to the field of microphone array technologies, and in particular, to a method and an apparatus for sound source tracking and pickup based on a microphone array.
Background
In recent years, with the rapid development of computer technology, people hope to control intelligent devices in more complex environments at a longer distance, and the traditional near-field voice technology cannot meet application requirements. Therefore, smart speech technology, especially far-field sound pickup technology based on microphone array, is becoming a current research focus. The dual-microphone array is a preferred solution for consumer electronics products such as smart televisions, smart speakers, mobile robots, and the like, due to advantages such as lower cost, flexibility in installation and use, and low power consumption compared to a multi-microphone array.
Beamforming is a core technology of a microphone array, and protects signals in a desired direction by performing weighted summation on acquired array data, and simultaneously suppresses noise and interference in other directions, so as to achieve the purpose of picking up sound in a far field (usually, beyond 1 meter). Beamforming is generally divided into two types: one type is data-independent beam forming, such as a delay-add method, a fixed summation method, and the like, and the suppression effect of the method on the strong spatial interference source is often not ideal. The other type is data-dependent beam forming, such as an adaptive beam forming method, an adaptive side lobe cancellation method and the like, the weighting coefficients of the adaptive beam forming and the adaptive side lobe cancellation method can be adaptively adjusted according to the change of an external environment so as to achieve the purpose of restraining a strong interference source, and the beam forming of the type is very sensitive to the error of a basic array model.
In application scenarios such as mobile robots and smart homes, a sound source is likely to be in motion, and therefore the position of the sound source relative to a microphone is often changed. The existing adaptive beamforming algorithm is very sensitive to directional deviation, and especially when the observation data contains an expected signal component, even a small directional deviation can easily distort a main lobe beam pattern, so that the expected signal is cancelled.
Disclosure of Invention
In view of the above, the present disclosure provides a microphone array-based tracking sound source and a sound pickup method.
According to a first aspect of the present disclosure, there is provided a microphone array based sound source tracking and pickup method, comprising: estimating instantaneous sound source orientation based on snapshot data received by microphone array
Figure RE-DEST_PATH_IMAGE001
(ii) a Calculating the instantaneous sound source orientation at the estimated
Figure RE-485638DEST_PATH_IMAGE001
Spatial spectral divergence of energy of up-speech signal
Figure RE-DEST_PATH_IMAGE002
(ii) a Based on calculated spatial spectral divergence
Figure RE-606041DEST_PATH_IMAGE002
Detecting voice activity; upon detection of voice activity, based on the estimated instantaneous sound source position
Figure RE-916936DEST_PATH_IMAGE001
Updating sound source orientations
Figure RE-DEST_PATH_IMAGE003
In some embodiments, the instantaneous sound source position is estimated
Figure RE-984249DEST_PATH_IMAGE001
Further comprising: estimating instantaneous sound source orientation on working frequency band of microphone array based on N pieces of snapshot data received by microphone array in a time period by adopting maximum likelihood estimation method
Figure RE-64201DEST_PATH_IMAGE001
And N is a positive integer.
In some embodiments, employing the maximum likelihood estimation method further comprises: construction about observation orientation
Figure RE-DEST_PATH_IMAGE004
And the center frequencies of the sub-bands of the operating band of the microphone array
Figure RE-DEST_PATH_IMAGE005
Likelihood function of
Figure RE-DEST_PATH_IMAGE006
Wherein
Figure RE-DEST_PATH_IMAGE007
The array elements of the microphone array are numbered m =0, 1,
Figure RE-DEST_PATH_IMAGE008
is as follows
Figure RE-840658DEST_PATH_IMAGE007
The snapshot data received by each array element, c is the propagation speed of sound waves in the air,
Figure RE-DEST_PATH_IMAGE009
For array element spacing of microphone array, observing azimuth
Figure RE-DEST_PATH_IMAGE010
Is a set of discrete observation orientation sequences for an observation interval.
In some embodiments, the spatial spectral divergence is calculated
Figure RE-842112DEST_PATH_IMAGE002
Further comprising: constructing the instantaneous sound source orientation
Figure RE-181958DEST_PATH_IMAGE001
Function of (2)
Figure RE-DEST_PATH_IMAGE011
To calculate the sound at the instantSource direction
Figure RE-647574DEST_PATH_IMAGE001
Spatial spectral divergence of energy of up-speech signal
Figure RE-DEST_PATH_IMAGE012
Wherein
Figure RE-47463DEST_PATH_IMAGE004
Is a set of discrete observation orientations for an observation interval,
Figure RE-DEST_PATH_IMAGE013
for speech signals in observation directions
Figure RE-DEST_PATH_IMAGE014
Is determined.
In some embodiments, detecting voice activity further comprises: comparing the calculated spatial spectral divergence
Figure RE-349262DEST_PATH_IMAGE002
And a predetermined detection threshold
Figure RE-DEST_PATH_IMAGE015
To determine whether voice activity is detected.
In some embodiments, the divergence in the spatial spectrum
Figure RE-23957DEST_PATH_IMAGE012
Greater than the detection threshold
Figure RE-344080DEST_PATH_IMAGE015
When the voice activity is detected, determining that the voice activity is detected; and in spatial spectral divergence
Figure RE-180449DEST_PATH_IMAGE012
Less than the detection threshold
Figure RE-94178DEST_PATH_IMAGE015
When no voice activity is detected, it is determined.
In some embodiments, the sound source bearing is updated
Figure RE-900460DEST_PATH_IMAGE003
Further comprising: the update is performed as follows:
Figure RE-DEST_PATH_IMAGE016
wherein
Figure RE-DEST_PATH_IMAGE017
Is a constant number of times that the number of the first,
Figure RE-DEST_PATH_IMAGE018
indicating the previous sound source direction.
In some embodiments, the method further comprises updating the location of the sound source based on the updated location of the sound source
Figure RE-888139DEST_PATH_IMAGE003
Adaptive beamforming weighting coefficients for the microphone array are calculated.
In some embodiments, calculating the adaptive beamforming weighting factor further comprises: calculating the self-adaptive beam forming weighting coefficient by adopting a beam pattern distortion constraint method, wherein the beam pattern distortion constraint method sequentially comprises the following steps: s101, constructing a feature matrix
Figure RE-DEST_PATH_IMAGE019
Computing by solving a convex optimization problem of the following formula
Figure RE-DEST_PATH_IMAGE020
The value of (c):
Figure RE-DEST_PATH_IMAGE021
wherein
Figure RE-DEST_PATH_IMAGE022
Is a constant greater than 0 and less than 2,
Figure RE-969445DEST_PATH_IMAGE022
representing the beam pattern distortion constraint factor,
Figure RE-DEST_PATH_IMAGE023
according to the direction of sound source
Figure RE-636050DEST_PATH_IMAGE003
The calculated assumed steering vector is calculated as a vector,
Figure RE-DEST_PATH_IMAGE024
a covariance matrix of N valid snapshot data. S102, judging the rank of A, and when the rank (A) is greater than 1, judging the rank of A according to a formula
Figure RE-DEST_PATH_IMAGE025
Rank 1 decomposition is performed on the feature matrix a to estimate a steering vector a, and when rank (a) =1, eigenvalue decomposition is performed on the matrix a to estimate a steering vector a. S103, calculating the self-adaptive beam forming weight coefficient by taking a as a steering vector
Figure RE-DEST_PATH_IMAGE026
In some embodiments, the time period is 25ms in length.
In some embodiments, the microphone array is a dual microphone array.
According to a second aspect of the present disclosure, a microphone array based sound source tracking and pickup apparatus includes: an instantaneous sound source orientation estimation module configured to estimate an instantaneous sound source orientation based on snapshot data received by the microphone array
Figure RE-731176DEST_PATH_IMAGE001
(ii) a A spatial spectral divergence calculation module configured to calculate a location of the sound source at the estimated instantaneous sound source
Figure RE-494732DEST_PATH_IMAGE001
Spatial spectral divergence of energy of up-speech signal
Figure RE-531958DEST_PATH_IMAGE002
(ii) a A voice activity detection module configured to detect a spatial spectral divergence based on the calculated spatial spectral divergence
Figure RE-889122DEST_PATH_IMAGE002
Detecting voice activity; a sound source bearing update module configured to, upon detection of voice activity, update the sound source bearing based on the estimated instantaneous sound source bearing
Figure RE-37206DEST_PATH_IMAGE001
Updating sound source orientations
Figure RE-655269DEST_PATH_IMAGE003
In some embodiments, the spatial spectral divergence calculation module is configured to calculate the spatial spectral divergence
Figure RE-4342DEST_PATH_IMAGE002
Further comprising: constructing the instantaneous sound source orientation
Figure RE-707856DEST_PATH_IMAGE001
Function of (2)
Figure RE-394052DEST_PATH_IMAGE011
To calculate the instantaneous sound source bearing
Figure RE-132201DEST_PATH_IMAGE001
Spatial spectral divergence of energy of up-speech signal
Figure RE-652175DEST_PATH_IMAGE012
Wherein
Figure RE-842985DEST_PATH_IMAGE004
Is a set of discrete observation orientations for an observation interval,
Figure RE-332872DEST_PATH_IMAGE013
for speech signals in observation directions
Figure RE-535315DEST_PATH_IMAGE014
Is determined.
In some embodiments, the apparatus further comprises a beamforming moduleConfigured to be based on the updated sound source bearing
Figure RE-350824DEST_PATH_IMAGE003
Adaptive beamforming weighting coefficients for the microphone array are calculated.
According to a third aspect of the present disclosure, there is provided a computing device comprising: a memory configured to store computer-executable instructions; a processor configured to perform any of the methods described above when the computer-executable instructions are executed by the processor.
According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed, perform any of the methods described above.
The method can effectively solve the problem of azimuth mismatch caused by sound source movement, and improves the robustness of the microphone array in a far field. The method and the device provided by the disclosure can realize far-field robust sound pickup when the position of a sound source is unknown and the position of the sound source moves relative to a microphone array. These and other advantages of the present disclosure will become apparent from and elucidated with reference to the embodiments described hereinafter.
Drawings
Embodiments of the present disclosure will now be described in more detail and with reference to the accompanying drawings, in which:
FIG. 1 illustrates an exemplary context diagram in which an embodiment according to the present disclosure may be implemented;
fig. 2 illustrates an exemplary flow diagram of a microphone array based sound source tracking and pickup method according to one embodiment of the present disclosure;
FIG. 3 illustrates a graph of performance versus different beamforming algorithms according to one embodiment of the present disclosure;
FIG. 4 illustrates an exemplary schematic diagram of a microphone array based sound source tracking and pickup apparatus according to one embodiment of the present disclosure; and
fig. 5 illustrates an example system that includes an example computing device that represents one or more systems and/or devices that may implement the various techniques described herein.
Detailed Description
The following description provides specific details for a thorough understanding and enabling description of various embodiments of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these details. In some instances, well-known structures and functions have not been shown or described in detail to avoid unnecessarily obscuring the description of the embodiments of the disclosure. The terminology used in the present disclosure is to be understood in its broadest reasonable manner, even though it is being used in conjunction with a particular embodiment of the present disclosure.
First, some terms related to the embodiments of the present disclosure are explained so that those skilled in the art can understand that:
microphone array: the system comprises an audio front-end acquisition system consisting of a plurality of microphones, and the microphones are used for acquiring audio to acquire a source direction and performing beam forming calculation so as to achieve the purpose of enhancing the signal-to-noise ratio of an audio signal.
Beam forming: the microphone array only collects audio signals in a specific direction and suppresses audio signals in other directions.
Conventional beamforming: the objective is to select an appropriate weighting vector to compensate for the propagation delay of each element of the microphone so that the signals in the desired direction arrive at the array in phase, thereby producing a spatial response maximum in that direction. If analogized to a time-domain filter, beamforming can be viewed as a spatial filter, while the beam pattern is the spatial frequency response of the spatial filter. Conventional beamforming is very robust to model mismatch, mainly because it has fixed weighting coefficients whose characteristics do not change as the target signal, interference, and environmental noise characteristics change. But the conventional beamforming has very limited ability to suppress unknown strong interference.
Capon beamforming: which is a classical data dependent adaptive beamforming. The Capon beam forming is mainly characterized in that the weighting coefficient can be adaptively adjusted according to the change of the input data characteristics, so that the Capon beam forming has good inhibition capability on unknown strong interference components. The problem with this approach is that it is extremely sensitive to model mismatch.
Diagonal load beamforming: also known as noise injection. The method mainly aims at reducing the diffusion degree of the noise characteristic value of the covariance matrix caused by mismatching of the array steering vector and finite fast beat number, so that the influence of the noise characteristic vector on the self-adaptive weight vector is reduced. The algorithm can effectively reduce the distortion of the beam pattern and improve the robustness of the algorithm, but simultaneously, the null of the self-adaptive beam pattern is also lightened, and the interference suppression capability is reduced.
Worst-case optimal beamforming: the method is mainly aimed at the situation of mismatching of the steering vectors. Firstly, the real steering vectors are assumed to be distributed in the neighborhood of the assumed steering vectors, then an uncertain set of the steering vectors is constructed by utilizing the neighborhood, and finally, the minimum value of the output signal-to-interference-and-noise ratio corresponding to each vector in the set is maximized by applying constraint on the uncertain set, so that the weighting coefficient of the self-adaptive beam former is calculated.
FIG. 1 illustrates an exemplary context diagram in which an embodiment according to the present disclosure may be implemented. As shown in fig. 1, the microphone array 102 may capture sound within a range of angles, and the microphone array 102 is not directionally directed. For example, when the microphone array 102 is a two-microphone array, sound in an angular range of 180 ° can be collected. The microphone array 102 may beamform the speaker 104 while the speaker 104 is engaged in voice activity. After the microphone array 102 is beamformed for the speaker 104, the speaker 104's voice is enhanced, while noise that is not within the beamforming directivity range is shielded. It should be noted that the dual microphones are merely examples and are not limiting.
Fig. 2 illustrates an exemplary flow diagram 200 of a microphone array based sound source tracking and pickup method according to one embodiment of the present disclosure. In step 202, an instantaneous sound source bearing is estimated based on data received by a microphone array
Figure RE-28930DEST_PATH_IMAGE001
. In one embodiment, a maximum likelihood estimation method is used to orient an instantaneous sound source over an operating band of a microphone array based on N snapshot data received by the microphone array over a period of time
Figure RE-197874DEST_PATH_IMAGE001
And estimating, wherein N is a positive integer. For example, the length of the time period may typically take 25 ms. Taking a two-microphone array consisting of two isotropic array elements as an example, the array elements have a pitch d. First, the operating frequency of the microphone array is divided into K mutually independent sub-bands, assuming the center frequency of each sub-band
Figure RE-DEST_PATH_IMAGE027
Are respectively as
Figure RE-DEST_PATH_IMAGE028
. The method adopts a maximum likelihood estimation method to estimate the direction of a sound source, and comprises the following specific implementation steps:
in step 2022, taking the K (K =1,2, …, K) th sub-band as an example (other sub-bands can be analogized in turn), the structure is constructed with respect to the observation direction
Figure RE-DEST_PATH_IMAGE029
And center frequency
Figure RE-582719DEST_PATH_IMAGE005
Likelihood function of
Figure RE-DEST_PATH_IMAGE030
As follows below, the following description will be given,
Figure RE-303551DEST_PATH_IMAGE006
(1)
in the formula (1), the reaction mixture is,
Figure RE-344319DEST_PATH_IMAGE008
is the m (m =0, 1) th array elementAnd c is the propagation speed of sound waves in the air.
In step 2024, the maximum value of equation (1) is obtained by iteration of equation (1) using newton's method, and the sound source direction of the sub-band is obtained:
Figure RE-DEST_PATH_IMAGE031
(2)。
other iterative methods may also be employed to solve the maximum of equation (1), as will be appreciated by those skilled in the art.
In step 2026, the information of the sound source directions calculated for the other sub-frequencies by the above method is integrated, and the total sound source direction is estimated according to the following equation:
Figure RE-DEST_PATH_IMAGE032
(3)。
other algorithms for estimating the direction of the sound source, such as minimum mean square error, etc., may also be employed, as will be appreciated by those skilled in the art.
In step 204, the estimated instantaneous sound source bearing is calculated
Figure RE-848113DEST_PATH_IMAGE001
Spatial spectral divergence of energy of up-speech signal
Figure RE-415360DEST_PATH_IMAGE002
. The spatial spectral divergence represents the degree of concentration of signal energy in a certain direction in space, and is a function of angle. In an embodiment, the disclosure defines it as:
Figure RE-DEST_PATH_IMAGE033
(4)
Wherein the content of the first and second substances,
Figure RE-448038DEST_PATH_IMAGE004
is to disperse the whole observation areaThe obtained group of observation directions are quantized,
Figure RE-100736DEST_PATH_IMAGE013
for speech signals in observation directions
Figure RE-1696DEST_PATH_IMAGE014
Is determined. In a practical embodiment of the method according to the invention,
Figure RE-33237DEST_PATH_IMAGE013
may be obtained using a conventional beamforming algorithm or Capon spatial spectrum estimation as mentioned above. In that
Figure RE-DEST_PATH_IMAGE034
When the sound source appears in the direction, the signal energy is
Figure RE-DEST_PATH_IMAGE035
Direction of convergence when divergence function
Figure RE-DEST_PATH_IMAGE036
The value becomes smaller; in contrast, when no sound source appears on the observation space, the energy of the signal will be randomly distributed throughout the observation space,
Figure RE-767975DEST_PATH_IMAGE036
the value of (c) becomes large. Thereby, can be based on
Figure RE-907970DEST_PATH_IMAGE036
Performs voice activity detection. The step of performing voice activity detection may comprise: the sound source azimuth estimated in equation (3)
Figure RE-347041DEST_PATH_IMAGE001
In formula (4), calculating
Figure RE-DEST_PATH_IMAGE037
Spatial spectral divergence in direction, i.e.:
Figure RE-29826DEST_PATH_IMAGE011
in step 206, based on the calculated spatial spectral divergence
Figure RE-263362DEST_PATH_IMAGE002
Voice activity is detected. In one embodiment, the detection threshold is preset
Figure RE-500439DEST_PATH_IMAGE015
。H1A state indicating "speech signal present"; h0Indicating a "no speech signal" state. When in use
Figure RE-DEST_PATH_IMAGE038
When the signal energy is distributed in the whole observation space, no speech signal exists. When in use
Figure RE-DEST_PATH_IMAGE039
While the signal energy is gathered in
Figure RE-149726DEST_PATH_IMAGE001
In the direction, there is a speech signal. That is, whether voice activity is detected may be determined according to the following equation:
Figure RE-DEST_PATH_IMAGE040
(5)。
In step 208, upon detection of voice activity, an instantaneous sound source position is estimated based on
Figure RE-546072DEST_PATH_IMAGE001
Updating sound source orientations
Figure RE-950509DEST_PATH_IMAGE003
. Sound source azimuth
Figure RE-612566DEST_PATH_IMAGE003
Initialization is carried out at the beginning of the sound source detection process, for example, the sound source detection process may be started
Figure RE-127861DEST_PATH_IMAGE003
Is set to 0 deg.. Instantaneous sound source orientation then calculated from each time period
Figure RE-378713DEST_PATH_IMAGE001
For the previous
Figure RE-94997DEST_PATH_IMAGE003
And (6) updating. Upon detection of voice activity, the sound source may be oriented by
Figure RE-431300DEST_PATH_IMAGE003
The updating is carried out, and the updating is carried out,
Figure RE-DEST_PATH_IMAGE041
(6)
wherein the content of the first and second substances,
Figure RE-DEST_PATH_IMAGE042
is constant, and
Figure RE-DEST_PATH_IMAGE043
the method and the device improve the robustness of microphone array azimuth tracking by defining the spatial spectrum divergence and judging whether voice activity exists or not through the spatial spectrum divergence.
In one embodiment, the method 200 further comprises a step 210 of updating the sound source bearing based on the updated sound source position
Figure RE-687969DEST_PATH_IMAGE003
Adaptive beamforming weighting coefficients for the microphone array are calculated. The process of microphone adaptive beamforming is microphone pickup. In one embodiment, adaptive beamforming weighting coefficients for the microphone array are calculated using a beam pattern distortion constraint. In an embodiment, calculating the adaptive beamforming weighting coefficients may comprise the steps of:
in step 2102, a feature matrix is constructed
Figure RE-793328DEST_PATH_IMAGE019
Computing by solving a convex optimization problem of the following formula
Figure RE-414934DEST_PATH_IMAGE020
The value of (c):
Figure RE-972954DEST_PATH_IMAGE021
(7)
wherein the content of the first and second substances,
Figure RE-361210DEST_PATH_IMAGE022
is a constant greater than 0 and less than 2, which represents a beampattern distortion constraint factor,
Figure RE-196442DEST_PATH_IMAGE023
according to the direction of sound source
Figure RE-113582DEST_PATH_IMAGE003
The calculated assumed steering vector is calculated as a vector,
Figure RE-DEST_PATH_IMAGE044
and performing trace calculation on the matrix. The above formula (7) can be obtained by adopting an interior point method, and the operation amount is
Figure RE-DEST_PATH_IMAGE045
Figure RE-DEST_PATH_IMAGE046
The representation a is a positive array of fixed,
Figure RE-362161DEST_PATH_IMAGE024
a covariance matrix of N valid snapshot data. s.t. indicates the conditions to which it is subjected, and the same applies in the following equations.
In step 2104, the rank of a is determined, and when rank (a) > 1, rank 1 decomposition is performed on the feature matrix a according to equation (8) to estimate steering vector a:
Figure RE-22949DEST_PATH_IMAGE025
(8);
when rank (a) =1, eigenvalue decomposition is performed on the matrix a to estimate the steering vector a.
In step 2106, adaptive beamforming weighting coefficients are calculated using a as the steering vector,
Figure RE-712688DEST_PATH_IMAGE026
(9),
wherein the content of the first and second substances,
Figure RE-66309DEST_PATH_IMAGE024
covariance matrix for N valid snapshot data, as in equation (10):
Figure RE-DEST_PATH_IMAGE047
(10)。
in step 212, the speech signal is enhanced with the updated beamforming coefficients, i.e.:
Figure RE-DEST_PATH_IMAGE048
(11)。
as will be appreciated by those skilled in the art, other beamforming algorithms besides the beam pattern distortion constraint method, such as conventional beamforming, Capon beamforming, diagonal loading, worst-case best beamforming, etc., may be used to calculate the adaptive beamforming weighting coefficients.
Fig. 3 illustrates a performance comparison graph of several algorithms, where the abscissa identifies the observation direction deviation, i.e. the angle of deviation of the observation direction of the microphone array from the direction of occurrence of the sound source. The SINR curve of the solution proposed for the present disclosure in fig. 3 uses the observation azimuth obtained after the maximum likelihood sound source estimation and the spatial spectrum estimation as described above. The fixed observation azimuth, i.e. the observation azimuth that is aligned to the sound source with a deviation of 0 degrees in the observation direction, is used for the standard beamforming, the diagonal loading algorithm and the worst performance optimization algorithm. As can be seen from fig. 3, the technical solution proposed in the present disclosure can achieve a significantly higher signal to interference plus noise ratio than other beamforming methods.
Fig. 4 illustrates an exemplary schematic diagram of a microphone array based sound source tracking and pickup apparatus according to one embodiment of the present disclosure. In one embodiment, the microphone array based sound source tracking and pickup apparatus 400 includes an instantaneous sound source orientation estimation module 402, a spatial spectral divergence calculation module 404, a voice activity detection module 406, a sound source orientation update module 408, and a beamforming module 410.
The sound source orientation estimation module 402 is configured to estimate an instantaneous sound source orientation based on snapshot data received by the microphone array
Figure RE-536604DEST_PATH_IMAGE001
. In one embodiment, a maximum likelihood estimation method is used to orient an instantaneous sound source over an operating band of a microphone array based on N snapshot data received by the microphone array over a period of time
Figure RE-1084DEST_PATH_IMAGE001
And estimating, wherein N is a positive integer. For example, the length of the time period may typically take 25 ms. Taking a two-microphone array consisting of two isotropic array elements as an example, the array elements have a pitch d. First, the operating frequency of the microphone array is divided into K mutually independent sub-bands, assuming the center frequency of each sub-band
Figure RE-545329DEST_PATH_IMAGE027
Are respectively as
Figure RE-69851DEST_PATH_IMAGE028
. And estimating the direction of the sound source by adopting a maximum likelihood estimation method. Specifically, the method comprises the following steps:
first, take the K (K =1,2, …, K) th sub-band as an example (other sub-bands can be analogized in turn), the structure is constructed with respect to the observation direction
Figure RE-355339DEST_PATH_IMAGE029
And center frequency
Figure RE-233296DEST_PATH_IMAGE005
Likelihood function of
Figure RE-22261DEST_PATH_IMAGE030
See formula (1) above. In the formula (1), the reaction mixture is,
Figure RE-717684DEST_PATH_IMAGE008
the single snapshot data received by the m (m =0, 1) th array element, and c is the propagation speed of the sound wave in the air.
Then, the formula (1) is iterated by adopting the Newton method, the maximum value of the formula (1) is solved, and the sound source direction of the sub-band is obtained
Figure RE-DEST_PATH_IMAGE049
See formula (2) above. Other iterative methods may also be employed to solve the maximum of equation (1), as will be appreciated by those skilled in the art.
Then, the sound source direction information calculated by the above method for other sub-frequencies is integrated to estimate the total sound source direction, see the above formula (3). Other algorithms for estimating the sound source may also be employed, as will be appreciated by those skilled in the art.
The spatial spectral divergence calculation module 404 is configured to calculate the instantaneous sound source bearing at the estimated
Figure RE-100255DEST_PATH_IMAGE001
Spatial spectral divergence of energy of up-speech signal
Figure RE-172116DEST_PATH_IMAGE002
. The spatial spectral divergence, which represents the degree of concentration of signal energy in a certain direction in space, is a function of angle, which the present disclosure defines according to equation (4) described above. Wherein the content of the first and second substances,
Figure RE-81166DEST_PATH_IMAGE004
is obtained by discretizing the whole observation regionTo a set of observation directions, p being the observation direction of the speech signal
Figure RE-557278DEST_PATH_IMAGE029
Is determined. In a practical embodiment of the method according to the invention,
Figure RE-817358DEST_PATH_IMAGE013
may be obtained using a conventional beamforming algorithm or Capon spatial spectrum estimation as mentioned above. In that
Figure RE-161752DEST_PATH_IMAGE034
When the sound source appears in the direction, the signal energy is
Figure RE-788956DEST_PATH_IMAGE034
Direction of convergence when divergence function
Figure RE-DEST_PATH_IMAGE050
The value becomes smaller; in contrast, when no sound source appears on the observation space, the energy of the signal will be randomly distributed throughout the observation space,
Figure RE-91761DEST_PATH_IMAGE050
the value of (c) becomes large. Thereby, can be based on
Figure RE-448925DEST_PATH_IMAGE050
Performs voice activity detection. The specific implementation steps of voice activity detection comprise: the sound source azimuth estimated in equation (3)
Figure RE-597009DEST_PATH_IMAGE001
In formula (4), calculating
Figure RE-215072DEST_PATH_IMAGE001
Spatial spectral divergence in direction, i.e.:
Figure RE-423200DEST_PATH_IMAGE011
the voice activity detection module 406 is configured to be based onCalculated spatial spectral divergence
Figure RE-2080DEST_PATH_IMAGE012
Voice activity is detected. In one embodiment, the detection threshold is preset
Figure RE-953855DEST_PATH_IMAGE015
。H1A state indicating "speech signal present"; h0Indicating a "no speech signal" state. When in use
Figure RE-692004DEST_PATH_IMAGE038
When the signal energy is distributed in the whole observation space, no speech signal exists. When in use
Figure RE-680820DEST_PATH_IMAGE039
The signal energy is gathered in
Figure RE-137209DEST_PATH_IMAGE001
In the direction, there is a speech signal, see equation (5) above.
The sound source position update module 408 is configured for, upon detection of voice activity, updating the position of the sound source based on the estimated instantaneous sound source position
Figure RE-627096DEST_PATH_IMAGE001
Updating sound source orientations
Figure RE-95118DEST_PATH_IMAGE003
. Upon detection of voice activity, based on the estimated instantaneous sound source position
Figure RE-910627DEST_PATH_IMAGE001
Updating sound source orientations
Figure RE-588733DEST_PATH_IMAGE003
. Sound source azimuth
Figure RE-492098DEST_PATH_IMAGE003
Initialization is carried out at the beginning of the sound source detection process, for example, the sound source detection process may be started
Figure RE-204839DEST_PATH_IMAGE003
Is set to 0 deg.. Instantaneous sound source orientation then calculated from each time period
Figure RE-925671DEST_PATH_IMAGE001
For the previous
Figure RE-966439DEST_PATH_IMAGE003
And (6) updating. Upon detection of voice activity, the sound source is oriented by
Figure RE-798129DEST_PATH_IMAGE003
The updating is performed, see formula (6) above, wherein
Figure RE-99797DEST_PATH_IMAGE042
Is constant, and
Figure RE-132475DEST_PATH_IMAGE043
the beamforming module 410 is configured for updating the location of the sound source based on the updated location of the sound source
Figure RE-785173DEST_PATH_IMAGE003
Adaptive beamforming weighting coefficients for the microphone array are calculated. The process of microphone adaptive beamforming is microphone pickup. In one embodiment, adaptive beamforming weighting coefficients for the microphone array are calculated using a beam pattern distortion constraint. The method specifically comprises the following steps:
constructing a feature matrix
Figure RE-154975DEST_PATH_IMAGE019
Calculated by solving the convex optimization problem of equation (7) described above
Figure RE-576729DEST_PATH_IMAGE020
The value of (c). Wherein
Figure RE-514729DEST_PATH_IMAGE022
Is a constant greater than 0 and less than 2, which represents a beampattern distortion constraint factor,
Figure RE-654723DEST_PATH_IMAGE023
according to the direction of sound source
Figure RE-93795DEST_PATH_IMAGE003
The calculated assumed steering vector is calculated as a vector,
Figure RE-245422DEST_PATH_IMAGE044
and performing trace calculation on the matrix. The above formula (7) can be obtained by adopting an interior point method, and the operation amount is
Figure RE-478957DEST_PATH_IMAGE045
And judging the rank of A, and when the rank (A) is greater than 1, performing rank 1 decomposition on the feature matrix A according to the formula (8) to estimate the guide vector a. When rank (a) =1, eigenvalue decomposition is performed on the matrix a to estimate the steering vector a.
The adaptive beamforming weighting coefficients are calculated with a as the steering vector according to equation (9) described above. Wherein
Figure RE-840668DEST_PATH_IMAGE024
The covariance matrix for the N valid snapshot data is referred to as equation (10) above.
In one embodiment, a speech signal enhancement module is further included, which enhances the speech signal with the updated beamforming coefficients, see equation (11) described above. As will be appreciated by those skilled in the art, other beamforming algorithms besides the beam pattern distortion constraint may be employed, such as conventional beamforming, Capon beamforming, diagonal loading, worst-case best-performance beamforming, and the like
Fig. 5 illustrates an example system 500 that includes an example computing device 510 that represents one or more systems and/or devices that may implement the various techniques described herein. Computing device 510 may be, for example, a server of a service provider, a device associated with a client (e.g., a client device), a system on a chip, and/or any other suitable computing device or computing system. The microphone array based sound source tracking and pickup apparatus 400 described above with respect to fig. 4 may take the form of a computing device 510. Alternatively, the microphone array based sound source tracking and pickup apparatus 400 may be implemented as a computer program in the form of a sound source tracking and pickup application 516.
The example computing device 510 as illustrated includes a processing system 511, one or more computer-readable media 512, and one or more I/O interfaces 513 communicatively coupled to each other. Although not shown, the computing device 510 may also include a system bus or other data and command transfer system that couples the various components to one another. A system bus can include any one or combination of different bus structures, such as a memory bus or memory controller, a peripheral bus, a universal serial bus, and/or a processor or local bus that utilizes any of a variety of bus architectures. Various other examples are also contemplated, such as control and data lines.
Processing system 511 represents functionality that performs one or more operations using hardware. Thus, the processing system 511 is illustrated as including hardware elements 514 that may be configured as processors, functional blocks, and so forth. This may include implementation in hardware as an application specific integrated circuit or other logic device formed using one or more semiconductors. Hardware element 514 is not limited by the material from which it is formed or the processing mechanisms employed therein. For example, a processor may be comprised of semiconductor(s) and/or transistors (e.g., electronic Integrated Circuits (ICs)). In such a context, processor-executable instructions may be electronically-executable instructions.
The computer-readable medium 512 is illustrated as including a memory/storage device 515. Memory/storage 515 represents memory/storage capacity associated with one or more computer-readable media. The memory/storage 515 may include volatile media (such as Random Access Memory (RAM)) and/or nonvolatile media (such as Read Only Memory (ROM), flash memory, optical disks, magnetic disks, and so forth). The memory/storage 515 may include fixed media (e.g., RAM, ROM, a fixed hard drive, etc.) as well as removable media (e.g., flash memory, a removable hard drive, an optical disk, and so forth). The computer-readable medium 512 may be configured in various other ways as further described below.
One or more I/O interfaces 513 represent functionality that allows a user to enter commands and information to computing device 510, and optionally also allows information to be presented to the user and/or other components or devices using various input/output devices. Examples of input devices include a keyboard, a cursor control device (e.g., a mouse), a microphone (e.g., for voice input), a scanner, touch functionality (e.g., capacitive or other sensors configured to detect physical touch), a camera (e.g., motion that may not involve touch may be detected as gestures using visible or invisible wavelengths such as infrared frequencies), and so forth. Examples of output devices include a display device (e.g., a monitor or projector), speakers, a printer, a network card, a haptic response device, and so forth. Accordingly, the computing device 510 may be configured in various ways to support user interaction, as described further below.
The computing device 510 also includes a sound source tracking and pickup application 516. The sound source tracking and pickup application 516 may be, for example, a software instance of the microphone array based sound source tracking and pickup apparatus 400 described with respect to fig. 4, and in combination with other elements in the computing device 510 implement the techniques described herein.
Various techniques may be described herein in the general context of software hardware elements or program modules. Generally, these modules include routines, programs, objects, elements, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The terms "module," "functionality," and "component" as used herein generally represent software, firmware, hardware, or a combination thereof. The features of the techniques described herein are platform-independent, meaning that the techniques may be implemented on a variety of computing platforms having a variety of processors.
An implementation of the described modules and techniques may be stored on or transmitted across some form of computer readable media. Computer readable media can include a variety of media that can be accessed by computing device 510. By way of example, and not limitation, computer-readable media may comprise "computer-readable storage media" and "computer-readable signal media".
"computer-readable storage medium" refers to a medium and/or device, and/or a tangible storage apparatus, capable of persistently storing information, as opposed to mere signal transmission, carrier wave, or signal per se. Accordingly, computer-readable storage media refers to non-signal bearing media. Computer-readable storage media include hardware such as volatile and nonvolatile, removable and non-removable media and/or storage devices implemented in a method or technology suitable for storage of information such as computer-readable instructions, data structures, program modules, logic elements/circuits or other data. Examples of computer readable storage media may include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical storage, hard disks, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or other storage devices, tangible media, or an article of manufacture suitable for storing the desired information and accessible by a computer.
"computer-readable signal medium" refers to a signal-bearing medium configured to transmit instructions to hardware of computing device 510, such as via a network. Signal media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave, data signal or other transport mechanism. Signal media also includes any information delivery media. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media.
As previously described, hardware element 514 and computer-readable medium 512 represent instructions, modules, programmable device logic, and/or fixed device logic implemented in hardware form that may be used in some embodiments to implement at least some aspects of the techniques described herein. The hardware elements may include integrated circuits or systems-on-chips, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), Complex Programmable Logic Devices (CPLDs), and other implementations in silicon or components of other hardware devices. In this context, a hardware element may serve as a processing device that performs program tasks defined by instructions, modules, and/or logic embodied by the hardware element, as well as a hardware device for storing instructions for execution, such as the computer-readable storage medium described previously.
Combinations of the foregoing may also be used to implement the various techniques and modules described herein. Thus, software, hardware, or program modules and other program modules may be implemented as one or more instructions and/or logic embodied on some form of computer-readable storage medium and/or by one or more hardware elements 514. The computing device 510 may be configured to implement particular instructions and/or functions corresponding to software and/or hardware modules. Thus, implementing modules as modules executable by the computing device 510 as software may be implemented at least partially in hardware, for example, using the processing system's computer-readable storage media and/or hardware elements 514. The instructions and/or functions may be executable/operable by one or more articles of manufacture (e.g., one or more computing devices 510 and/or processing systems 511) to implement the techniques, modules, and examples described herein.
In various implementations, the computing device 510 may assume a variety of different configurations. For example, the computing device 510 may be implemented as a computer-like device including a personal computer, a desktop computer, a multi-screen computer, a laptop computer, a netbook, and so forth. The computing device 510 may also be implemented as a mobile device-like device including mobile devices such as mobile telephones, portable music players, portable gaming devices, tablet computers, multi-screen computers, and the like. Computing device 510 may also be implemented as a television-like device that includes devices with or connected to a generally larger screen in a casual viewing environment. These devices include televisions, set-top boxes, game consoles, and the like.
The techniques described herein may be supported by these various configurations of computing device 510 and are not limited to specific examples of the techniques described herein. Functionality may also be implemented in whole or in part on the "cloud" 520 through the use of a distributed system, such as through the platform 522 described below.
Cloud 520 includes and/or is representative of a platform 522 for resources 524. The platform 522 abstracts underlying functionality of hardware (e.g., servers) and software resources of the cloud 520. The resources 524 may include applications and/or data that may be used when executing computer processes on servers remote from the computing device 510. The resources 524 may also include services provided over the internet and/or over a subscriber network such as a cellular or Wi-Fi network.
The platform 522 may abstract resources and functionality to connect the computing device 510 with other computing devices. The platform 522 may also be used to abstract a hierarchy of resources to provide a corresponding level of hierarchy encountered for the demand of the resources 524 implemented via the platform 522. Thus, in interconnected device embodiments, implementation of functions described herein may be distributed throughout the system 500. For example, the functionality may be implemented in part on the computing device 510 and by the platform 522 that abstracts the functionality of the cloud 520.
It should be understood that embodiments of the disclosure have been described with reference to different functional blocks for clarity. However, it will be apparent that the functionality of each functional module may be implemented in a single module, in multiple modules, or as part of other functional modules without departing from the disclosure. For example, functionality illustrated to be performed by a single module may be performed by multiple different modules. Thus, references to specific functional blocks are only to be seen as references to suitable blocks for providing the described functionality rather than indicative of a strict logical or physical structure or organization. Thus, the present disclosure may be implemented in a single module or may be physically and functionally distributed between different modules and circuits.
It will be understood that, although the terms first, second, third, etc. may be used herein to describe various devices, elements, or components, these devices, elements, or components should not be limited by these terms. These terms are only used to distinguish one device, element, or component from another device, element, or component.
Although the present disclosure has been described in connection with some embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the present disclosure is limited only by the accompanying claims. Additionally, although individual features may be included in different claims, these may possibly advantageously be combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. The order of features in the claims does not imply any specific order in which the features must be worked. Furthermore, in the claims, the word "comprising" does not exclude other elements, and the indefinite article "a" or "an" does not exclude a plurality. Reference signs in the claims are provided merely as a clarifying example and shall not be construed as limiting the scope of the claims in any way.

Claims (15)

1. A microphone array based sound source tracking and pickup method, comprising:
estimating instantaneous sound source orientation based on snapshot data received by the microphone array
Figure 171032DEST_PATH_IMAGE001
Calculating the instantaneous sound source orientation at the estimated
Figure 353752DEST_PATH_IMAGE001
Spatial spectral divergence of energy of up-speech signal
Figure 399068DEST_PATH_IMAGE002
Based on calculated spatial spectral divergence
Figure 59857DEST_PATH_IMAGE002
Detecting voice activity;
upon detection of the voice activity, based on the estimated instantaneous sound source position
Figure 139808DEST_PATH_IMAGE001
Updating sound source orientations
Figure 227850DEST_PATH_IMAGE003
2. The method of claim 1, wherein said estimating an instantaneous sound source bearing
Figure 281169DEST_PATH_IMAGE001
Further comprising:
estimating an instantaneous sound source orientation on a working frequency band of the microphone array based on N snapshot data received by the microphone array within a time period by using a maximum likelihood estimation method
Figure 745648DEST_PATH_IMAGE001
And N is a positive integer.
3. The method of claim 2, wherein said employing maximum likelihood estimation further comprises:
construction about observation orientation
Figure 680106DEST_PATH_IMAGE004
And the center frequencies of the sub-bands of the operating band of the microphone array
Figure 939049DEST_PATH_IMAGE005
Likelihood function of
Figure 958957DEST_PATH_IMAGE006
Wherein
Figure 961549DEST_PATH_IMAGE007
The serial numbers m =0, 1 of the array elements of the microphone array,
Figure 750513DEST_PATH_IMAGE008
is as follows
Figure 196669DEST_PATH_IMAGE007
The snapshot data received by each array element, c is the propagation speed of sound waves in the air,
Figure 703874DEST_PATH_IMAGE009
for array element spacing of microphone array, observing azimuth
Figure 244576DEST_PATH_IMAGE004
Is a set of discrete observation orientations for an observation interval.
4. The method of claim 1, wherein the calculating spatial spectral divergence
Figure 419206DEST_PATH_IMAGE002
Further comprising:
constructing the instantaneous sound source orientation
Figure 19951DEST_PATH_IMAGE001
Function of (2)
Figure 14452DEST_PATH_IMAGE010
To calculate the instantaneous sound source bearing
Figure 358846DEST_PATH_IMAGE001
Spatial spectral divergence of energy of up-speech signal
Figure 371670DEST_PATH_IMAGE011
Wherein
Figure 408896DEST_PATH_IMAGE004
Is directed to the observation intervalA set of discrete observation positions of (a),
Figure 625114DEST_PATH_IMAGE012
for speech signals in observation directions
Figure 773199DEST_PATH_IMAGE013
Is determined.
5. The method of claim 1, wherein the detecting voice activity further comprises:
comparing the calculated spatial spectral divergence
Figure 391262DEST_PATH_IMAGE002
And a predetermined detection threshold
Figure 864968DEST_PATH_IMAGE014
To determine whether voice activity is detected.
6. The method of claim 5, wherein the divergence is in the spatial spectrum
Figure 568482DEST_PATH_IMAGE011
Greater than the detection threshold
Figure 5411DEST_PATH_IMAGE014
When the voice activity is detected, determining that the voice activity is detected; and in spatial spectral divergence
Figure 743560DEST_PATH_IMAGE011
Less than the detection threshold
Figure 122589DEST_PATH_IMAGE014
When no voice activity is detected, it is determined.
7. The method of claim 1, wherein the updating the sound source bearing
Figure 313398DEST_PATH_IMAGE003
Further comprising performing the updating according to:
Figure 803286DEST_PATH_IMAGE015
wherein
Figure 176367DEST_PATH_IMAGE016
Is a constant number of times that the number of the first,
Figure 726297DEST_PATH_IMAGE017
indicating the previous sound source direction.
8. The method of claim 1, further comprising:
based on updated sound source position
Figure 404403DEST_PATH_IMAGE003
Adaptive beamforming weighting coefficients for the microphone array are calculated.
9. The method of claim 8, the calculating adaptive beamforming weighting coefficients further comprising:
calculating the self-adaptive beam forming weighting coefficient by adopting a beam pattern distortion constraint method, wherein the beam pattern distortion constraint method sequentially comprises the following steps:
s101, constructing a feature matrix
Figure 963560DEST_PATH_IMAGE018
Computing by solving a convex optimization problem of the following formula
Figure 410722DEST_PATH_IMAGE019
The value of (c):
Figure 865974DEST_PATH_IMAGE020
wherein
Figure 31376DEST_PATH_IMAGE021
Is a constant greater than 0 and less than 2,
Figure 613799DEST_PATH_IMAGE021
representing the beam pattern distortion constraint factor,
Figure 181046DEST_PATH_IMAGE022
according to the direction of sound source
Figure 807200DEST_PATH_IMAGE003
The calculated assumed steering vector is calculated as a vector,
Figure 459898DEST_PATH_IMAGE023
a covariance matrix for the N valid snapshot data;
s102, judging the Rank of A, and when the Rank (A) > 1, judging according to a formula
Figure 95279DEST_PATH_IMAGE024
Rank 1 decomposing the feature matrix A to estimate a steering vector a, an
When Rank (a) =1, eigenvalue decomposition is performed on the matrix a to estimate a steering vector a;
s103, calculating the self-adaptive beam forming weight coefficient by taking a as a steering vector
Figure 251453DEST_PATH_IMAGE025
10. The method of claim 2, wherein the time period is 25ms in length.
11. A method as claimed in any preceding claim wherein the microphone array is a dual microphone array.
12. A microphone array based sound source tracking and pickup apparatus comprising:
instantaneous sound source orientation estimationA meter module configured to estimate an instantaneous sound source bearing based on snapshot data received by the microphone array
Figure 314087DEST_PATH_IMAGE001
A spatial spectral divergence calculation module configured to calculate a location of the sound source at the estimated instantaneous sound source
Figure 188502DEST_PATH_IMAGE001
Spatial spectral divergence of energy of up-speech signal
Figure 876842DEST_PATH_IMAGE002
A voice activity detection module configured to detect a spatial spectral divergence based on the calculated spatial spectral divergence
Figure 153102DEST_PATH_IMAGE002
Detecting voice activity;
a sound source bearing update module configured to, upon detection of the voice activity, based on the estimated instantaneous sound source bearing
Figure 386637DEST_PATH_IMAGE001
Updating sound source orientations
Figure 748349DEST_PATH_IMAGE003
13. The apparatus of claim 12, wherein the spatial spectral divergence calculation module is configured to said calculate spatial spectral divergence
Figure 725532DEST_PATH_IMAGE002
Further comprising:
constructing the instantaneous sound source orientation
Figure 121878DEST_PATH_IMAGE001
Function of (2)
Figure 526315DEST_PATH_IMAGE010
To calculate the instantaneous sound source bearing
Figure 126054DEST_PATH_IMAGE001
Spatial spectral divergence of energy of up-speech signal
Figure 641349DEST_PATH_IMAGE011
Wherein
Figure 892202DEST_PATH_IMAGE004
Is a set of discrete observation orientations for an observation interval,
Figure 467540DEST_PATH_IMAGE012
for speech signals in observation directions
Figure 803843DEST_PATH_IMAGE013
Is determined.
14. A non-transitory computer readable medium of computer program instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 1-11.
15. A computing device comprising a processor and a memory having stored thereon a computer program configured to, when executed on the processor, cause the processor to perform the method of any of claims 1-11.
CN201910440423.8A 2019-05-24 2019-05-24 Sound source tracking and pickup method and device based on microphone array Pending CN111986692A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910440423.8A CN111986692A (en) 2019-05-24 2019-05-24 Sound source tracking and pickup method and device based on microphone array

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910440423.8A CN111986692A (en) 2019-05-24 2019-05-24 Sound source tracking and pickup method and device based on microphone array

Publications (1)

Publication Number Publication Date
CN111986692A true CN111986692A (en) 2020-11-24

Family

ID=73436706

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910440423.8A Pending CN111986692A (en) 2019-05-24 2019-05-24 Sound source tracking and pickup method and device based on microphone array

Country Status (1)

Country Link
CN (1) CN111986692A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112951273A (en) * 2021-02-02 2021-06-11 郑州大学 Digit control machine tool cutter wearing and tearing monitoring device based on microphone array and machine vision
CN113608167A (en) * 2021-10-09 2021-11-05 阿里巴巴达摩院(杭州)科技有限公司 Sound source positioning method, device and equipment
WO2023108864A1 (en) * 2021-12-15 2023-06-22 苏州蛙声科技有限公司 Regional pickup method and system for miniature microphone array device

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105609100A (en) * 2014-10-31 2016-05-25 中国科学院声学研究所 Acoustic model training and constructing method, acoustic model and speech recognition system
CN105874535A (en) * 2014-01-15 2016-08-17 宇龙计算机通信科技(深圳)有限公司 Speech processing method and speech processing apparatus
CN106098075A (en) * 2016-08-08 2016-11-09 腾讯科技(深圳)有限公司 Audio collection method and apparatus based on microphone array
CN106504763A (en) * 2015-12-22 2017-03-15 电子科技大学 Based on blind source separating and the microphone array multiple target sound enhancement method of spectrum-subtraction
CN106952653A (en) * 2017-03-15 2017-07-14 科大讯飞股份有限公司 Noise remove method, device and terminal device
WO2017129239A1 (en) * 2016-01-27 2017-08-03 Nokia Technologies Oy System and apparatus for tracking moving audio sources
US10096328B1 (en) * 2017-10-06 2018-10-09 Intel Corporation Beamformer system for tracking of speech and noise in a dynamic environment
CN108962272A (en) * 2018-06-21 2018-12-07 湖南优浪语音科技有限公司 Sound pick-up method and system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105874535A (en) * 2014-01-15 2016-08-17 宇龙计算机通信科技(深圳)有限公司 Speech processing method and speech processing apparatus
CN105609100A (en) * 2014-10-31 2016-05-25 中国科学院声学研究所 Acoustic model training and constructing method, acoustic model and speech recognition system
CN106504763A (en) * 2015-12-22 2017-03-15 电子科技大学 Based on blind source separating and the microphone array multiple target sound enhancement method of spectrum-subtraction
WO2017129239A1 (en) * 2016-01-27 2017-08-03 Nokia Technologies Oy System and apparatus for tracking moving audio sources
CN106098075A (en) * 2016-08-08 2016-11-09 腾讯科技(深圳)有限公司 Audio collection method and apparatus based on microphone array
CN106952653A (en) * 2017-03-15 2017-07-14 科大讯飞股份有限公司 Noise remove method, device and terminal device
US10096328B1 (en) * 2017-10-06 2018-10-09 Intel Corporation Beamformer system for tracking of speech and noise in a dynamic environment
CN108962272A (en) * 2018-06-21 2018-12-07 湖南优浪语音科技有限公司 Sound pick-up method and system

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112951273A (en) * 2021-02-02 2021-06-11 郑州大学 Digit control machine tool cutter wearing and tearing monitoring device based on microphone array and machine vision
CN112951273B (en) * 2021-02-02 2024-03-29 郑州大学 Numerical control machine tool cutter abrasion monitoring device based on microphone array and machine vision
CN113608167A (en) * 2021-10-09 2021-11-05 阿里巴巴达摩院(杭州)科技有限公司 Sound source positioning method, device and equipment
CN113608167B (en) * 2021-10-09 2022-02-08 阿里巴巴达摩院(杭州)科技有限公司 Sound source positioning method, device and equipment
WO2023108864A1 (en) * 2021-12-15 2023-06-22 苏州蛙声科技有限公司 Regional pickup method and system for miniature microphone array device

Similar Documents

Publication Publication Date Title
CN108122563B (en) Method for improving voice awakening rate and correcting DOA
US10909988B2 (en) Systems and methods for displaying a user interface
CN110931036B (en) Microphone array beam forming method
US8583428B2 (en) Sound source separation using spatial filtering and regularization phases
CN111986692A (en) Sound source tracking and pickup method and device based on microphone array
WO2020108614A1 (en) Audio recognition method, and target audio positioning method, apparatus and device
JP4799443B2 (en) Sound receiving device and method
US20160192068A1 (en) Steering vector estimation for minimum variance distortionless response (mvdr) beamforming circuits, systems, and methods
US8098842B2 (en) Enhanced beamforming for arrays of directional microphones
CN109616136B (en) Adaptive beam forming method, device and system
US7626889B2 (en) Sensor array post-filter for tracking spatial distributions of signals and noise
US8005237B2 (en) Sensor array beamformer post-processor
US20130082875A1 (en) Processing Signals
CN110554357B (en) Sound source positioning method and device
US10957338B2 (en) 360-degree multi-source location detection, tracking and enhancement
US11218802B1 (en) Beamformer rotation
CN106537501A (en) Reverberation estimator
US11222646B2 (en) Apparatus and method for generating audio signal with noise attenuated based on phase change rate
WO2022105571A1 (en) Speech enhancement method and apparatus, and device and computer-readable storage medium
TW202125989A (en) Switching method for multiple antenna arrays and electronic device applying the same
Luo et al. Constrained maximum directivity beamformers based on uniform linear acoustic vector sensor arrays
CN113055071B (en) Switching method of multiple groups of array antennas and electronic device applying same
US11508348B2 (en) Directional noise suppression
CN111245490B (en) Broadband signal extraction method and device and electronic equipment
CN113491137B (en) Flexible differential microphone array with fractional order

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination