CN111986692A - Sound source tracking and pickup method and device based on microphone array - Google Patents
Sound source tracking and pickup method and device based on microphone array Download PDFInfo
- Publication number
- CN111986692A CN111986692A CN201910440423.8A CN201910440423A CN111986692A CN 111986692 A CN111986692 A CN 111986692A CN 201910440423 A CN201910440423 A CN 201910440423A CN 111986692 A CN111986692 A CN 111986692A
- Authority
- CN
- China
- Prior art keywords
- sound source
- microphone array
- voice activity
- spatial spectral
- divergence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 83
- 230000000694 effects Effects 0.000 claims abstract description 38
- 230000003595 spectral effect Effects 0.000 claims abstract description 38
- 238000001514 detection method Methods 0.000 claims abstract description 30
- 239000013598 vector Substances 0.000 claims description 29
- 239000011159 matrix material Substances 0.000 claims description 20
- 230000003044 adaptive effect Effects 0.000 claims description 17
- 230000006870 function Effects 0.000 claims description 10
- 238000007476 Maximum Likelihood Methods 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 9
- 238000000354 decomposition reaction Methods 0.000 claims description 7
- 238000001228 spectrum Methods 0.000 claims description 7
- 238000005457 optimization Methods 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 3
- 230000009977 dual effect Effects 0.000 claims description 3
- 238000010276 construction Methods 0.000 claims description 2
- 238000003860 storage Methods 0.000 description 21
- 238000004422 calculation algorithm Methods 0.000 description 13
- 238000005516 engineering process Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 7
- 238000012545 processing Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 239000000126 substance Substances 0.000 description 5
- 230000008859 change Effects 0.000 description 4
- 230000033001 locomotion Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000004044 response Effects 0.000 description 3
- 239000000243 solution Substances 0.000 description 3
- 230000005236 sound signal Effects 0.000 description 3
- 230000001419 dependent effect Effects 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 239000011541 reaction mixture Substances 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 230000001629 suppression Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000009792 diffusion process Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000005764 inhibitory process Effects 0.000 description 1
- 238000002347 injection Methods 0.000 description 1
- 239000007924 injection Substances 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000452 restraining effect Effects 0.000 description 1
- 230000008054 signal transmission Effects 0.000 description 1
- 229910052710 silicon Inorganic materials 0.000 description 1
- 239000010703 silicon Substances 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000007723 transport mechanism Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02166—Microphone arrays; Beamforming
Abstract
Microphone array based sound source tracking and pickup methods and apparatus are described herein. The method comprises the following steps: estimating instantaneous sound source orientation based on snapshot data received by microphone array(ii) a Calculating the instantaneous sound source orientation at the estimatedSpatial spectral divergence of energy of up-speech signal(ii) a Based on calculated spatial spectral divergenceDetecting voice activity; upon detection of voice activity, based on the estimated instantaneous sound source positionUpdating sound source orientations。
Description
Technical Field
The present disclosure relates to the field of microphone array technologies, and in particular, to a method and an apparatus for sound source tracking and pickup based on a microphone array.
Background
In recent years, with the rapid development of computer technology, people hope to control intelligent devices in more complex environments at a longer distance, and the traditional near-field voice technology cannot meet application requirements. Therefore, smart speech technology, especially far-field sound pickup technology based on microphone array, is becoming a current research focus. The dual-microphone array is a preferred solution for consumer electronics products such as smart televisions, smart speakers, mobile robots, and the like, due to advantages such as lower cost, flexibility in installation and use, and low power consumption compared to a multi-microphone array.
Beamforming is a core technology of a microphone array, and protects signals in a desired direction by performing weighted summation on acquired array data, and simultaneously suppresses noise and interference in other directions, so as to achieve the purpose of picking up sound in a far field (usually, beyond 1 meter). Beamforming is generally divided into two types: one type is data-independent beam forming, such as a delay-add method, a fixed summation method, and the like, and the suppression effect of the method on the strong spatial interference source is often not ideal. The other type is data-dependent beam forming, such as an adaptive beam forming method, an adaptive side lobe cancellation method and the like, the weighting coefficients of the adaptive beam forming and the adaptive side lobe cancellation method can be adaptively adjusted according to the change of an external environment so as to achieve the purpose of restraining a strong interference source, and the beam forming of the type is very sensitive to the error of a basic array model.
In application scenarios such as mobile robots and smart homes, a sound source is likely to be in motion, and therefore the position of the sound source relative to a microphone is often changed. The existing adaptive beamforming algorithm is very sensitive to directional deviation, and especially when the observation data contains an expected signal component, even a small directional deviation can easily distort a main lobe beam pattern, so that the expected signal is cancelled.
Disclosure of Invention
In view of the above, the present disclosure provides a microphone array-based tracking sound source and a sound pickup method.
According to a first aspect of the present disclosure, there is provided a microphone array based sound source tracking and pickup method, comprising: estimating instantaneous sound source orientation based on snapshot data received by microphone array(ii) a Calculating the instantaneous sound source orientation at the estimatedSpatial spectral divergence of energy of up-speech signal(ii) a Based on calculated spatial spectral divergenceDetecting voice activity; upon detection of voice activity, based on the estimated instantaneous sound source positionUpdating sound source orientations。
In some embodiments, the instantaneous sound source position is estimatedFurther comprising: estimating instantaneous sound source orientation on working frequency band of microphone array based on N pieces of snapshot data received by microphone array in a time period by adopting maximum likelihood estimation methodAnd N is a positive integer.
In some embodiments, employing the maximum likelihood estimation method further comprises: construction about observation orientationAnd the center frequencies of the sub-bands of the operating band of the microphone arrayLikelihood function ofWhereinThe array elements of the microphone array are numbered m =0, 1,is as followsThe snapshot data received by each array element, c is the propagation speed of sound waves in the air, For array element spacing of microphone array, observing azimuthIs a set of discrete observation orientation sequences for an observation interval.
In some embodiments, the spatial spectral divergence is calculatedFurther comprising: constructing the instantaneous sound source orientationFunction of (2)To calculate the sound at the instantSource directionSpatial spectral divergence of energy of up-speech signalWhereinIs a set of discrete observation orientations for an observation interval,for speech signals in observation directionsIs determined.
In some embodiments, detecting voice activity further comprises: comparing the calculated spatial spectral divergenceAnd a predetermined detection thresholdTo determine whether voice activity is detected.
In some embodiments, the divergence in the spatial spectrumGreater than the detection thresholdWhen the voice activity is detected, determining that the voice activity is detected; and in spatial spectral divergenceLess than the detection thresholdWhen no voice activity is detected, it is determined.
In some embodiments, the sound source bearing is updatedFurther comprising: the update is performed as follows:whereinIs a constant number of times that the number of the first,indicating the previous sound source direction.
In some embodiments, the method further comprises updating the location of the sound source based on the updated location of the sound sourceAdaptive beamforming weighting coefficients for the microphone array are calculated.
In some embodiments, calculating the adaptive beamforming weighting factor further comprises: calculating the self-adaptive beam forming weighting coefficient by adopting a beam pattern distortion constraint method, wherein the beam pattern distortion constraint method sequentially comprises the following steps: s101, constructing a feature matrix Computing by solving a convex optimization problem of the following formulaThe value of (c):whereinIs a constant greater than 0 and less than 2,representing the beam pattern distortion constraint factor,according to the direction of sound sourceThe calculated assumed steering vector is calculated as a vector,a covariance matrix of N valid snapshot data. S102, judging the rank of A, and when the rank (A) is greater than 1, judging the rank of A according to a formulaRank 1 decomposition is performed on the feature matrix a to estimate a steering vector a, and when rank (a) =1, eigenvalue decomposition is performed on the matrix a to estimate a steering vector a. S103, calculating the self-adaptive beam forming weight coefficient by taking a as a steering vector。
In some embodiments, the time period is 25ms in length.
In some embodiments, the microphone array is a dual microphone array.
According to a second aspect of the present disclosure, a microphone array based sound source tracking and pickup apparatus includes: an instantaneous sound source orientation estimation module configured to estimate an instantaneous sound source orientation based on snapshot data received by the microphone array(ii) a A spatial spectral divergence calculation module configured to calculate a location of the sound source at the estimated instantaneous sound sourceSpatial spectral divergence of energy of up-speech signal(ii) a A voice activity detection module configured to detect a spatial spectral divergence based on the calculated spatial spectral divergence Detecting voice activity; a sound source bearing update module configured to, upon detection of voice activity, update the sound source bearing based on the estimated instantaneous sound source bearingUpdating sound source orientations。
In some embodiments, the spatial spectral divergence calculation module is configured to calculate the spatial spectral divergenceFurther comprising: constructing the instantaneous sound source orientationFunction of (2)To calculate the instantaneous sound source bearingSpatial spectral divergence of energy of up-speech signalWhereinIs a set of discrete observation orientations for an observation interval,for speech signals in observation directionsIs determined.
In some embodiments, the apparatus further comprises a beamforming moduleConfigured to be based on the updated sound source bearingAdaptive beamforming weighting coefficients for the microphone array are calculated.
According to a third aspect of the present disclosure, there is provided a computing device comprising: a memory configured to store computer-executable instructions; a processor configured to perform any of the methods described above when the computer-executable instructions are executed by the processor.
According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed, perform any of the methods described above.
The method can effectively solve the problem of azimuth mismatch caused by sound source movement, and improves the robustness of the microphone array in a far field. The method and the device provided by the disclosure can realize far-field robust sound pickup when the position of a sound source is unknown and the position of the sound source moves relative to a microphone array. These and other advantages of the present disclosure will become apparent from and elucidated with reference to the embodiments described hereinafter.
Drawings
Embodiments of the present disclosure will now be described in more detail and with reference to the accompanying drawings, in which:
FIG. 1 illustrates an exemplary context diagram in which an embodiment according to the present disclosure may be implemented;
fig. 2 illustrates an exemplary flow diagram of a microphone array based sound source tracking and pickup method according to one embodiment of the present disclosure;
FIG. 3 illustrates a graph of performance versus different beamforming algorithms according to one embodiment of the present disclosure;
FIG. 4 illustrates an exemplary schematic diagram of a microphone array based sound source tracking and pickup apparatus according to one embodiment of the present disclosure; and
fig. 5 illustrates an example system that includes an example computing device that represents one or more systems and/or devices that may implement the various techniques described herein.
Detailed Description
The following description provides specific details for a thorough understanding and enabling description of various embodiments of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these details. In some instances, well-known structures and functions have not been shown or described in detail to avoid unnecessarily obscuring the description of the embodiments of the disclosure. The terminology used in the present disclosure is to be understood in its broadest reasonable manner, even though it is being used in conjunction with a particular embodiment of the present disclosure.
First, some terms related to the embodiments of the present disclosure are explained so that those skilled in the art can understand that:
microphone array: the system comprises an audio front-end acquisition system consisting of a plurality of microphones, and the microphones are used for acquiring audio to acquire a source direction and performing beam forming calculation so as to achieve the purpose of enhancing the signal-to-noise ratio of an audio signal.
Beam forming: the microphone array only collects audio signals in a specific direction and suppresses audio signals in other directions.
Conventional beamforming: the objective is to select an appropriate weighting vector to compensate for the propagation delay of each element of the microphone so that the signals in the desired direction arrive at the array in phase, thereby producing a spatial response maximum in that direction. If analogized to a time-domain filter, beamforming can be viewed as a spatial filter, while the beam pattern is the spatial frequency response of the spatial filter. Conventional beamforming is very robust to model mismatch, mainly because it has fixed weighting coefficients whose characteristics do not change as the target signal, interference, and environmental noise characteristics change. But the conventional beamforming has very limited ability to suppress unknown strong interference.
Capon beamforming: which is a classical data dependent adaptive beamforming. The Capon beam forming is mainly characterized in that the weighting coefficient can be adaptively adjusted according to the change of the input data characteristics, so that the Capon beam forming has good inhibition capability on unknown strong interference components. The problem with this approach is that it is extremely sensitive to model mismatch.
Diagonal load beamforming: also known as noise injection. The method mainly aims at reducing the diffusion degree of the noise characteristic value of the covariance matrix caused by mismatching of the array steering vector and finite fast beat number, so that the influence of the noise characteristic vector on the self-adaptive weight vector is reduced. The algorithm can effectively reduce the distortion of the beam pattern and improve the robustness of the algorithm, but simultaneously, the null of the self-adaptive beam pattern is also lightened, and the interference suppression capability is reduced.
Worst-case optimal beamforming: the method is mainly aimed at the situation of mismatching of the steering vectors. Firstly, the real steering vectors are assumed to be distributed in the neighborhood of the assumed steering vectors, then an uncertain set of the steering vectors is constructed by utilizing the neighborhood, and finally, the minimum value of the output signal-to-interference-and-noise ratio corresponding to each vector in the set is maximized by applying constraint on the uncertain set, so that the weighting coefficient of the self-adaptive beam former is calculated.
FIG. 1 illustrates an exemplary context diagram in which an embodiment according to the present disclosure may be implemented. As shown in fig. 1, the microphone array 102 may capture sound within a range of angles, and the microphone array 102 is not directionally directed. For example, when the microphone array 102 is a two-microphone array, sound in an angular range of 180 ° can be collected. The microphone array 102 may beamform the speaker 104 while the speaker 104 is engaged in voice activity. After the microphone array 102 is beamformed for the speaker 104, the speaker 104's voice is enhanced, while noise that is not within the beamforming directivity range is shielded. It should be noted that the dual microphones are merely examples and are not limiting.
Fig. 2 illustrates an exemplary flow diagram 200 of a microphone array based sound source tracking and pickup method according to one embodiment of the present disclosure. In step 202, an instantaneous sound source bearing is estimated based on data received by a microphone array. In one embodiment, a maximum likelihood estimation method is used to orient an instantaneous sound source over an operating band of a microphone array based on N snapshot data received by the microphone array over a period of timeAnd estimating, wherein N is a positive integer. For example, the length of the time period may typically take 25 ms. Taking a two-microphone array consisting of two isotropic array elements as an example, the array elements have a pitch d. First, the operating frequency of the microphone array is divided into K mutually independent sub-bands, assuming the center frequency of each sub-band Are respectively as. The method adopts a maximum likelihood estimation method to estimate the direction of a sound source, and comprises the following specific implementation steps:
in step 2022, taking the K (K =1,2, …, K) th sub-band as an example (other sub-bands can be analogized in turn), the structure is constructed with respect to the observation directionAnd center frequencyLikelihood function ofAs follows below, the following description will be given,
in the formula (1), the reaction mixture is,is the m (m =0, 1) th array elementAnd c is the propagation speed of sound waves in the air.
In step 2024, the maximum value of equation (1) is obtained by iteration of equation (1) using newton's method, and the sound source direction of the sub-band is obtained:
other iterative methods may also be employed to solve the maximum of equation (1), as will be appreciated by those skilled in the art.
In step 2026, the information of the sound source directions calculated for the other sub-frequencies by the above method is integrated, and the total sound source direction is estimated according to the following equation:
other algorithms for estimating the direction of the sound source, such as minimum mean square error, etc., may also be employed, as will be appreciated by those skilled in the art.
In step 204, the estimated instantaneous sound source bearing is calculatedSpatial spectral divergence of energy of up-speech signal. The spatial spectral divergence represents the degree of concentration of signal energy in a certain direction in space, and is a function of angle. In an embodiment, the disclosure defines it as:
Wherein the content of the first and second substances,is to disperse the whole observation areaThe obtained group of observation directions are quantized,for speech signals in observation directionsIs determined. In a practical embodiment of the method according to the invention,may be obtained using a conventional beamforming algorithm or Capon spatial spectrum estimation as mentioned above. In thatWhen the sound source appears in the direction, the signal energy isDirection of convergence when divergence functionThe value becomes smaller; in contrast, when no sound source appears on the observation space, the energy of the signal will be randomly distributed throughout the observation space,the value of (c) becomes large. Thereby, can be based onPerforms voice activity detection. The step of performing voice activity detection may comprise: the sound source azimuth estimated in equation (3)In formula (4), calculatingSpatial spectral divergence in direction, i.e.:。
in step 206, based on the calculated spatial spectral divergenceVoice activity is detected. In one embodiment, the detection threshold is preset。H1A state indicating "speech signal present"; h0Indicating a "no speech signal" state. When in useWhen the signal energy is distributed in the whole observation space, no speech signal exists. When in useWhile the signal energy is gathered inIn the direction, there is a speech signal. That is, whether voice activity is detected may be determined according to the following equation:
In step 208, upon detection of voice activity, an instantaneous sound source position is estimated based onUpdating sound source orientations. Sound source azimuthInitialization is carried out at the beginning of the sound source detection process, for example, the sound source detection process may be startedIs set to 0 deg.. Instantaneous sound source orientation then calculated from each time periodFor the previousAnd (6) updating. Upon detection of voice activity, the sound source may be oriented byThe updating is carried out, and the updating is carried out,
the method and the device improve the robustness of microphone array azimuth tracking by defining the spatial spectrum divergence and judging whether voice activity exists or not through the spatial spectrum divergence.
In one embodiment, the method 200 further comprises a step 210 of updating the sound source bearing based on the updated sound source positionAdaptive beamforming weighting coefficients for the microphone array are calculated. The process of microphone adaptive beamforming is microphone pickup. In one embodiment, adaptive beamforming weighting coefficients for the microphone array are calculated using a beam pattern distortion constraint. In an embodiment, calculating the adaptive beamforming weighting coefficients may comprise the steps of:
in step 2102, a feature matrix is constructedComputing by solving a convex optimization problem of the following formula The value of (c):
wherein the content of the first and second substances,is a constant greater than 0 and less than 2, which represents a beampattern distortion constraint factor,according to the direction of sound sourceThe calculated assumed steering vector is calculated as a vector,and performing trace calculation on the matrix. The above formula (7) can be obtained by adopting an interior point method, and the operation amount is。The representation a is a positive array of fixed,a covariance matrix of N valid snapshot data. s.t. indicates the conditions to which it is subjected, and the same applies in the following equations.
In step 2104, the rank of a is determined, and when rank (a) > 1, rank 1 decomposition is performed on the feature matrix a according to equation (8) to estimate steering vector a:
when rank (a) =1, eigenvalue decomposition is performed on the matrix a to estimate the steering vector a.
In step 2106, adaptive beamforming weighting coefficients are calculated using a as the steering vector,
wherein the content of the first and second substances,covariance matrix for N valid snapshot data, as in equation (10):
in step 212, the speech signal is enhanced with the updated beamforming coefficients, i.e.:
as will be appreciated by those skilled in the art, other beamforming algorithms besides the beam pattern distortion constraint method, such as conventional beamforming, Capon beamforming, diagonal loading, worst-case best beamforming, etc., may be used to calculate the adaptive beamforming weighting coefficients.
Fig. 3 illustrates a performance comparison graph of several algorithms, where the abscissa identifies the observation direction deviation, i.e. the angle of deviation of the observation direction of the microphone array from the direction of occurrence of the sound source. The SINR curve of the solution proposed for the present disclosure in fig. 3 uses the observation azimuth obtained after the maximum likelihood sound source estimation and the spatial spectrum estimation as described above. The fixed observation azimuth, i.e. the observation azimuth that is aligned to the sound source with a deviation of 0 degrees in the observation direction, is used for the standard beamforming, the diagonal loading algorithm and the worst performance optimization algorithm. As can be seen from fig. 3, the technical solution proposed in the present disclosure can achieve a significantly higher signal to interference plus noise ratio than other beamforming methods.
Fig. 4 illustrates an exemplary schematic diagram of a microphone array based sound source tracking and pickup apparatus according to one embodiment of the present disclosure. In one embodiment, the microphone array based sound source tracking and pickup apparatus 400 includes an instantaneous sound source orientation estimation module 402, a spatial spectral divergence calculation module 404, a voice activity detection module 406, a sound source orientation update module 408, and a beamforming module 410.
The sound source orientation estimation module 402 is configured to estimate an instantaneous sound source orientation based on snapshot data received by the microphone array . In one embodiment, a maximum likelihood estimation method is used to orient an instantaneous sound source over an operating band of a microphone array based on N snapshot data received by the microphone array over a period of timeAnd estimating, wherein N is a positive integer. For example, the length of the time period may typically take 25 ms. Taking a two-microphone array consisting of two isotropic array elements as an example, the array elements have a pitch d. First, the operating frequency of the microphone array is divided into K mutually independent sub-bands, assuming the center frequency of each sub-bandAre respectively as. And estimating the direction of the sound source by adopting a maximum likelihood estimation method. Specifically, the method comprises the following steps:
first, take the K (K =1,2, …, K) th sub-band as an example (other sub-bands can be analogized in turn), the structure is constructed with respect to the observation directionAnd center frequencyLikelihood function ofSee formula (1) above. In the formula (1), the reaction mixture is,the single snapshot data received by the m (m =0, 1) th array element, and c is the propagation speed of the sound wave in the air.
Then, the formula (1) is iterated by adopting the Newton method, the maximum value of the formula (1) is solved, and the sound source direction of the sub-band is obtainedSee formula (2) above. Other iterative methods may also be employed to solve the maximum of equation (1), as will be appreciated by those skilled in the art.
Then, the sound source direction information calculated by the above method for other sub-frequencies is integrated to estimate the total sound source direction, see the above formula (3). Other algorithms for estimating the sound source may also be employed, as will be appreciated by those skilled in the art.
The spatial spectral divergence calculation module 404 is configured to calculate the instantaneous sound source bearing at the estimatedSpatial spectral divergence of energy of up-speech signal. The spatial spectral divergence, which represents the degree of concentration of signal energy in a certain direction in space, is a function of angle, which the present disclosure defines according to equation (4) described above. Wherein the content of the first and second substances,is obtained by discretizing the whole observation regionTo a set of observation directions, p being the observation direction of the speech signalIs determined. In a practical embodiment of the method according to the invention,may be obtained using a conventional beamforming algorithm or Capon spatial spectrum estimation as mentioned above. In thatWhen the sound source appears in the direction, the signal energy isDirection of convergence when divergence functionThe value becomes smaller; in contrast, when no sound source appears on the observation space, the energy of the signal will be randomly distributed throughout the observation space,the value of (c) becomes large. Thereby, can be based onPerforms voice activity detection. The specific implementation steps of voice activity detection comprise: the sound source azimuth estimated in equation (3) In formula (4), calculatingSpatial spectral divergence in direction, i.e.:。
the voice activity detection module 406 is configured to be based onCalculated spatial spectral divergenceVoice activity is detected. In one embodiment, the detection threshold is preset。H1A state indicating "speech signal present"; h0Indicating a "no speech signal" state. When in useWhen the signal energy is distributed in the whole observation space, no speech signal exists. When in useThe signal energy is gathered inIn the direction, there is a speech signal, see equation (5) above.
The sound source position update module 408 is configured for, upon detection of voice activity, updating the position of the sound source based on the estimated instantaneous sound source positionUpdating sound source orientations. Upon detection of voice activity, based on the estimated instantaneous sound source positionUpdating sound source orientations. Sound source azimuthInitialization is carried out at the beginning of the sound source detection process, for example, the sound source detection process may be startedIs set to 0 deg.. Instantaneous sound source orientation then calculated from each time periodFor the previousAnd (6) updating. Upon detection of voice activity, the sound source is oriented byThe updating is performed, see formula (6) above, whereinIs constant, and。
the beamforming module 410 is configured for updating the location of the sound source based on the updated location of the sound source Adaptive beamforming weighting coefficients for the microphone array are calculated. The process of microphone adaptive beamforming is microphone pickup. In one embodiment, adaptive beamforming weighting coefficients for the microphone array are calculated using a beam pattern distortion constraint. The method specifically comprises the following steps:
constructing a feature matrixCalculated by solving the convex optimization problem of equation (7) described aboveThe value of (c). WhereinIs a constant greater than 0 and less than 2, which represents a beampattern distortion constraint factor,according to the direction of sound sourceThe calculated assumed steering vector is calculated as a vector,and performing trace calculation on the matrix. The above formula (7) can be obtained by adopting an interior point method, and the operation amount is。
And judging the rank of A, and when the rank (A) is greater than 1, performing rank 1 decomposition on the feature matrix A according to the formula (8) to estimate the guide vector a. When rank (a) =1, eigenvalue decomposition is performed on the matrix a to estimate the steering vector a.
The adaptive beamforming weighting coefficients are calculated with a as the steering vector according to equation (9) described above. WhereinThe covariance matrix for the N valid snapshot data is referred to as equation (10) above.
In one embodiment, a speech signal enhancement module is further included, which enhances the speech signal with the updated beamforming coefficients, see equation (11) described above. As will be appreciated by those skilled in the art, other beamforming algorithms besides the beam pattern distortion constraint may be employed, such as conventional beamforming, Capon beamforming, diagonal loading, worst-case best-performance beamforming, and the like
Fig. 5 illustrates an example system 500 that includes an example computing device 510 that represents one or more systems and/or devices that may implement the various techniques described herein. Computing device 510 may be, for example, a server of a service provider, a device associated with a client (e.g., a client device), a system on a chip, and/or any other suitable computing device or computing system. The microphone array based sound source tracking and pickup apparatus 400 described above with respect to fig. 4 may take the form of a computing device 510. Alternatively, the microphone array based sound source tracking and pickup apparatus 400 may be implemented as a computer program in the form of a sound source tracking and pickup application 516.
The example computing device 510 as illustrated includes a processing system 511, one or more computer-readable media 512, and one or more I/O interfaces 513 communicatively coupled to each other. Although not shown, the computing device 510 may also include a system bus or other data and command transfer system that couples the various components to one another. A system bus can include any one or combination of different bus structures, such as a memory bus or memory controller, a peripheral bus, a universal serial bus, and/or a processor or local bus that utilizes any of a variety of bus architectures. Various other examples are also contemplated, such as control and data lines.
The computer-readable medium 512 is illustrated as including a memory/storage device 515. Memory/storage 515 represents memory/storage capacity associated with one or more computer-readable media. The memory/storage 515 may include volatile media (such as Random Access Memory (RAM)) and/or nonvolatile media (such as Read Only Memory (ROM), flash memory, optical disks, magnetic disks, and so forth). The memory/storage 515 may include fixed media (e.g., RAM, ROM, a fixed hard drive, etc.) as well as removable media (e.g., flash memory, a removable hard drive, an optical disk, and so forth). The computer-readable medium 512 may be configured in various other ways as further described below.
One or more I/O interfaces 513 represent functionality that allows a user to enter commands and information to computing device 510, and optionally also allows information to be presented to the user and/or other components or devices using various input/output devices. Examples of input devices include a keyboard, a cursor control device (e.g., a mouse), a microphone (e.g., for voice input), a scanner, touch functionality (e.g., capacitive or other sensors configured to detect physical touch), a camera (e.g., motion that may not involve touch may be detected as gestures using visible or invisible wavelengths such as infrared frequencies), and so forth. Examples of output devices include a display device (e.g., a monitor or projector), speakers, a printer, a network card, a haptic response device, and so forth. Accordingly, the computing device 510 may be configured in various ways to support user interaction, as described further below.
The computing device 510 also includes a sound source tracking and pickup application 516. The sound source tracking and pickup application 516 may be, for example, a software instance of the microphone array based sound source tracking and pickup apparatus 400 described with respect to fig. 4, and in combination with other elements in the computing device 510 implement the techniques described herein.
Various techniques may be described herein in the general context of software hardware elements or program modules. Generally, these modules include routines, programs, objects, elements, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The terms "module," "functionality," and "component" as used herein generally represent software, firmware, hardware, or a combination thereof. The features of the techniques described herein are platform-independent, meaning that the techniques may be implemented on a variety of computing platforms having a variety of processors.
An implementation of the described modules and techniques may be stored on or transmitted across some form of computer readable media. Computer readable media can include a variety of media that can be accessed by computing device 510. By way of example, and not limitation, computer-readable media may comprise "computer-readable storage media" and "computer-readable signal media".
"computer-readable storage medium" refers to a medium and/or device, and/or a tangible storage apparatus, capable of persistently storing information, as opposed to mere signal transmission, carrier wave, or signal per se. Accordingly, computer-readable storage media refers to non-signal bearing media. Computer-readable storage media include hardware such as volatile and nonvolatile, removable and non-removable media and/or storage devices implemented in a method or technology suitable for storage of information such as computer-readable instructions, data structures, program modules, logic elements/circuits or other data. Examples of computer readable storage media may include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical storage, hard disks, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or other storage devices, tangible media, or an article of manufacture suitable for storing the desired information and accessible by a computer.
"computer-readable signal medium" refers to a signal-bearing medium configured to transmit instructions to hardware of computing device 510, such as via a network. Signal media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave, data signal or other transport mechanism. Signal media also includes any information delivery media. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media.
As previously described, hardware element 514 and computer-readable medium 512 represent instructions, modules, programmable device logic, and/or fixed device logic implemented in hardware form that may be used in some embodiments to implement at least some aspects of the techniques described herein. The hardware elements may include integrated circuits or systems-on-chips, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), Complex Programmable Logic Devices (CPLDs), and other implementations in silicon or components of other hardware devices. In this context, a hardware element may serve as a processing device that performs program tasks defined by instructions, modules, and/or logic embodied by the hardware element, as well as a hardware device for storing instructions for execution, such as the computer-readable storage medium described previously.
Combinations of the foregoing may also be used to implement the various techniques and modules described herein. Thus, software, hardware, or program modules and other program modules may be implemented as one or more instructions and/or logic embodied on some form of computer-readable storage medium and/or by one or more hardware elements 514. The computing device 510 may be configured to implement particular instructions and/or functions corresponding to software and/or hardware modules. Thus, implementing modules as modules executable by the computing device 510 as software may be implemented at least partially in hardware, for example, using the processing system's computer-readable storage media and/or hardware elements 514. The instructions and/or functions may be executable/operable by one or more articles of manufacture (e.g., one or more computing devices 510 and/or processing systems 511) to implement the techniques, modules, and examples described herein.
In various implementations, the computing device 510 may assume a variety of different configurations. For example, the computing device 510 may be implemented as a computer-like device including a personal computer, a desktop computer, a multi-screen computer, a laptop computer, a netbook, and so forth. The computing device 510 may also be implemented as a mobile device-like device including mobile devices such as mobile telephones, portable music players, portable gaming devices, tablet computers, multi-screen computers, and the like. Computing device 510 may also be implemented as a television-like device that includes devices with or connected to a generally larger screen in a casual viewing environment. These devices include televisions, set-top boxes, game consoles, and the like.
The techniques described herein may be supported by these various configurations of computing device 510 and are not limited to specific examples of the techniques described herein. Functionality may also be implemented in whole or in part on the "cloud" 520 through the use of a distributed system, such as through the platform 522 described below.
The platform 522 may abstract resources and functionality to connect the computing device 510 with other computing devices. The platform 522 may also be used to abstract a hierarchy of resources to provide a corresponding level of hierarchy encountered for the demand of the resources 524 implemented via the platform 522. Thus, in interconnected device embodiments, implementation of functions described herein may be distributed throughout the system 500. For example, the functionality may be implemented in part on the computing device 510 and by the platform 522 that abstracts the functionality of the cloud 520.
It should be understood that embodiments of the disclosure have been described with reference to different functional blocks for clarity. However, it will be apparent that the functionality of each functional module may be implemented in a single module, in multiple modules, or as part of other functional modules without departing from the disclosure. For example, functionality illustrated to be performed by a single module may be performed by multiple different modules. Thus, references to specific functional blocks are only to be seen as references to suitable blocks for providing the described functionality rather than indicative of a strict logical or physical structure or organization. Thus, the present disclosure may be implemented in a single module or may be physically and functionally distributed between different modules and circuits.
It will be understood that, although the terms first, second, third, etc. may be used herein to describe various devices, elements, or components, these devices, elements, or components should not be limited by these terms. These terms are only used to distinguish one device, element, or component from another device, element, or component.
Although the present disclosure has been described in connection with some embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the present disclosure is limited only by the accompanying claims. Additionally, although individual features may be included in different claims, these may possibly advantageously be combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. The order of features in the claims does not imply any specific order in which the features must be worked. Furthermore, in the claims, the word "comprising" does not exclude other elements, and the indefinite article "a" or "an" does not exclude a plurality. Reference signs in the claims are provided merely as a clarifying example and shall not be construed as limiting the scope of the claims in any way.
Claims (15)
1. A microphone array based sound source tracking and pickup method, comprising:
estimating instantaneous sound source orientation based on snapshot data received by the microphone array;
Calculating the instantaneous sound source orientation at the estimatedSpatial spectral divergence of energy of up-speech signal;
2. The method of claim 1, wherein said estimating an instantaneous sound source bearingFurther comprising:
3. The method of claim 2, wherein said employing maximum likelihood estimation further comprises:
construction about observation orientationAnd the center frequencies of the sub-bands of the operating band of the microphone arrayLikelihood function ofWhereinThe serial numbers m =0, 1 of the array elements of the microphone array,is as followsThe snapshot data received by each array element, c is the propagation speed of sound waves in the air,for array element spacing of microphone array, observing azimuth Is a set of discrete observation orientations for an observation interval.
constructing the instantaneous sound source orientationFunction of (2)To calculate the instantaneous sound source bearingSpatial spectral divergence of energy of up-speech signalWhereinIs directed to the observation intervalA set of discrete observation positions of (a),for speech signals in observation directionsIs determined.
6. The method of claim 5, wherein the divergence is in the spatial spectrumGreater than the detection thresholdWhen the voice activity is detected, determining that the voice activity is detected; and in spatial spectral divergenceLess than the detection thresholdWhen no voice activity is detected, it is determined.
9. The method of claim 8, the calculating adaptive beamforming weighting coefficients further comprising:
calculating the self-adaptive beam forming weighting coefficient by adopting a beam pattern distortion constraint method, wherein the beam pattern distortion constraint method sequentially comprises the following steps:
s101, constructing a feature matrixComputing by solving a convex optimization problem of the following formulaThe value of (c):
whereinIs a constant greater than 0 and less than 2,representing the beam pattern distortion constraint factor,according to the direction of sound sourceThe calculated assumed steering vector is calculated as a vector,a covariance matrix for the N valid snapshot data;
s102, judging the Rank of A, and when the Rank (A) > 1, judging according to a formulaRank 1 decomposing the feature matrix A to estimate a steering vector a, an
When Rank (a) =1, eigenvalue decomposition is performed on the matrix a to estimate a steering vector a;
10. The method of claim 2, wherein the time period is 25ms in length.
11. A method as claimed in any preceding claim wherein the microphone array is a dual microphone array.
12. A microphone array based sound source tracking and pickup apparatus comprising:
instantaneous sound source orientation estimationA meter module configured to estimate an instantaneous sound source bearing based on snapshot data received by the microphone array ;
A spatial spectral divergence calculation module configured to calculate a location of the sound source at the estimated instantaneous sound sourceSpatial spectral divergence of energy of up-speech signal;
A voice activity detection module configured to detect a spatial spectral divergence based on the calculated spatial spectral divergenceDetecting voice activity;
13. The apparatus of claim 12, wherein the spatial spectral divergence calculation module is configured to said calculate spatial spectral divergenceFurther comprising:
constructing the instantaneous sound source orientationFunction of (2)To calculate the instantaneous sound source bearingSpatial spectral divergence of energy of up-speech signalWhereinIs a set of discrete observation orientations for an observation interval,for speech signals in observation directionsIs determined.
14. A non-transitory computer readable medium of computer program instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 1-11.
15. A computing device comprising a processor and a memory having stored thereon a computer program configured to, when executed on the processor, cause the processor to perform the method of any of claims 1-11.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910440423.8A CN111986692A (en) | 2019-05-24 | 2019-05-24 | Sound source tracking and pickup method and device based on microphone array |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910440423.8A CN111986692A (en) | 2019-05-24 | 2019-05-24 | Sound source tracking and pickup method and device based on microphone array |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111986692A true CN111986692A (en) | 2020-11-24 |
Family
ID=73436706
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910440423.8A Pending CN111986692A (en) | 2019-05-24 | 2019-05-24 | Sound source tracking and pickup method and device based on microphone array |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111986692A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112951273A (en) * | 2021-02-02 | 2021-06-11 | 郑州大学 | Digit control machine tool cutter wearing and tearing monitoring device based on microphone array and machine vision |
CN113608167A (en) * | 2021-10-09 | 2021-11-05 | 阿里巴巴达摩院(杭州)科技有限公司 | Sound source positioning method, device and equipment |
WO2023108864A1 (en) * | 2021-12-15 | 2023-06-22 | 苏州蛙声科技有限公司 | Regional pickup method and system for miniature microphone array device |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105609100A (en) * | 2014-10-31 | 2016-05-25 | 中国科学院声学研究所 | Acoustic model training and constructing method, acoustic model and speech recognition system |
CN105874535A (en) * | 2014-01-15 | 2016-08-17 | 宇龙计算机通信科技(深圳)有限公司 | Speech processing method and speech processing apparatus |
CN106098075A (en) * | 2016-08-08 | 2016-11-09 | 腾讯科技(深圳)有限公司 | Audio collection method and apparatus based on microphone array |
CN106504763A (en) * | 2015-12-22 | 2017-03-15 | 电子科技大学 | Based on blind source separating and the microphone array multiple target sound enhancement method of spectrum-subtraction |
CN106952653A (en) * | 2017-03-15 | 2017-07-14 | 科大讯飞股份有限公司 | Noise remove method, device and terminal device |
WO2017129239A1 (en) * | 2016-01-27 | 2017-08-03 | Nokia Technologies Oy | System and apparatus for tracking moving audio sources |
US10096328B1 (en) * | 2017-10-06 | 2018-10-09 | Intel Corporation | Beamformer system for tracking of speech and noise in a dynamic environment |
CN108962272A (en) * | 2018-06-21 | 2018-12-07 | 湖南优浪语音科技有限公司 | Sound pick-up method and system |
-
2019
- 2019-05-24 CN CN201910440423.8A patent/CN111986692A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105874535A (en) * | 2014-01-15 | 2016-08-17 | 宇龙计算机通信科技(深圳)有限公司 | Speech processing method and speech processing apparatus |
CN105609100A (en) * | 2014-10-31 | 2016-05-25 | 中国科学院声学研究所 | Acoustic model training and constructing method, acoustic model and speech recognition system |
CN106504763A (en) * | 2015-12-22 | 2017-03-15 | 电子科技大学 | Based on blind source separating and the microphone array multiple target sound enhancement method of spectrum-subtraction |
WO2017129239A1 (en) * | 2016-01-27 | 2017-08-03 | Nokia Technologies Oy | System and apparatus for tracking moving audio sources |
CN106098075A (en) * | 2016-08-08 | 2016-11-09 | 腾讯科技(深圳)有限公司 | Audio collection method and apparatus based on microphone array |
CN106952653A (en) * | 2017-03-15 | 2017-07-14 | 科大讯飞股份有限公司 | Noise remove method, device and terminal device |
US10096328B1 (en) * | 2017-10-06 | 2018-10-09 | Intel Corporation | Beamformer system for tracking of speech and noise in a dynamic environment |
CN108962272A (en) * | 2018-06-21 | 2018-12-07 | 湖南优浪语音科技有限公司 | Sound pick-up method and system |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112951273A (en) * | 2021-02-02 | 2021-06-11 | 郑州大学 | Digit control machine tool cutter wearing and tearing monitoring device based on microphone array and machine vision |
CN112951273B (en) * | 2021-02-02 | 2024-03-29 | 郑州大学 | Numerical control machine tool cutter abrasion monitoring device based on microphone array and machine vision |
CN113608167A (en) * | 2021-10-09 | 2021-11-05 | 阿里巴巴达摩院(杭州)科技有限公司 | Sound source positioning method, device and equipment |
CN113608167B (en) * | 2021-10-09 | 2022-02-08 | 阿里巴巴达摩院(杭州)科技有限公司 | Sound source positioning method, device and equipment |
WO2023108864A1 (en) * | 2021-12-15 | 2023-06-22 | 苏州蛙声科技有限公司 | Regional pickup method and system for miniature microphone array device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108122563B (en) | Method for improving voice awakening rate and correcting DOA | |
US10909988B2 (en) | Systems and methods for displaying a user interface | |
CN110931036B (en) | Microphone array beam forming method | |
US8583428B2 (en) | Sound source separation using spatial filtering and regularization phases | |
CN111986692A (en) | Sound source tracking and pickup method and device based on microphone array | |
WO2020108614A1 (en) | Audio recognition method, and target audio positioning method, apparatus and device | |
JP4799443B2 (en) | Sound receiving device and method | |
US20160192068A1 (en) | Steering vector estimation for minimum variance distortionless response (mvdr) beamforming circuits, systems, and methods | |
US8098842B2 (en) | Enhanced beamforming for arrays of directional microphones | |
CN109616136B (en) | Adaptive beam forming method, device and system | |
US7626889B2 (en) | Sensor array post-filter for tracking spatial distributions of signals and noise | |
US8005237B2 (en) | Sensor array beamformer post-processor | |
US20130082875A1 (en) | Processing Signals | |
CN110554357B (en) | Sound source positioning method and device | |
US10957338B2 (en) | 360-degree multi-source location detection, tracking and enhancement | |
US11218802B1 (en) | Beamformer rotation | |
CN106537501A (en) | Reverberation estimator | |
US11222646B2 (en) | Apparatus and method for generating audio signal with noise attenuated based on phase change rate | |
WO2022105571A1 (en) | Speech enhancement method and apparatus, and device and computer-readable storage medium | |
TW202125989A (en) | Switching method for multiple antenna arrays and electronic device applying the same | |
Luo et al. | Constrained maximum directivity beamformers based on uniform linear acoustic vector sensor arrays | |
CN113055071B (en) | Switching method of multiple groups of array antennas and electronic device applying same | |
US11508348B2 (en) | Directional noise suppression | |
CN111245490B (en) | Broadband signal extraction method and device and electronic equipment | |
CN113491137B (en) | Flexible differential microphone array with fractional order |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |