CN110858488A

CN110858488A - Voice activity detection method, device, equipment and storage medium

Info

Publication number: CN110858488A
Application number: CN201810973780.6A
Authority: CN
Inventors: 刘章; 余涛; 刘礼
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-08-24
Filing date: 2018-08-24
Publication date: 2020-03-03

Abstract

The disclosure provides a voice activity detection method, a voice activity detection device, voice activity detection equipment and a storage medium. Positioning a sound source according to signals received by at least part of microphones in the microphone array so as to determine the position information of a speaker; judging whether the speaker is located in the designated area or not based on the position information; and further determining whether voice activity is present based on one or more voice activity detection modalities if it is determined that the speaker is located in the designated area. Therefore, the false alarm rate can be reduced, and the accuracy of voice activity detection can be improved.

Description

Voice activity detection method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of voice activity detection technologies, and in particular, to a method, an apparatus, a device, and a storage medium for voice activity detection.

Background

Voice Activity Detection (VAD), also called Voice endpoint Detection, is the Detection of the presence or absence of Voice in a noisy environment. The voice activity detection is the basis for realizing voice interaction and plays a vital role in voice awakening, voice enhancement and the like. When a device supporting a voice interaction function is in a highly noisy environment, how to distinguish noise from the sound of a target speaker to achieve voice activity detection is a problem that needs to be solved at present.

For example, with the development of artificial intelligence voice technology, many traditional devices have an increasingly strong demand for human-computer voice interaction, such as subway ticket purchasing machines. However, to successfully apply the voice interaction technology in the subway ticket-buying airport scene, it is necessary to challenge a highly noisy noise environment, where the noise includes, but is not limited to, bubble noise caused by speaking of people, interference noise caused by speakers around the ticket-buying person, noise generated by movement of people, mechanical noise of subway locomotive movement, interference sound of tweeter, and the like. For the acoustic environment, the existing scheme cannot effectively distinguish the target human voice from the noise, so that the activity state of the voice cannot be accurately detected.

Disclosure of Invention

It is an object of the present disclosure to propose a voice activity detection scheme that can effectively distinguish between a target person's voice and noise also in a noisy environment.

According to a first aspect of the present disclosure, a voice activity detection method is provided, including: positioning a sound source according to signals received by at least part of microphones in the microphone array so as to determine the position information of a speaker; judging whether the speaker is located in the designated area or not based on the position information; and further determining whether voice activity is present based on one or more voice activity detection modalities if it is determined that the speaker is located in the designated area.

Optionally, the step of performing sound source localization comprises: acquiring a covariance matrix of signals received by at least part of microphones; performing eigenvalue decomposition on the covariance matrix to obtain a plurality of eigenvalues; selecting a first number of maximum eigenvalues from the plurality of eigenvalues, and forming a signal subspace based on eigenvectors corresponding to the selected eigenvalues, wherein the first number is equivalent to the estimated number of sound sources; and determining location information of the speaker based on the signal subspace.

Optionally, the position information includes an azimuth angle and a distance, the azimuth angle is an azimuth angle of the speaker in a coordinate system in which the at least part of the microphones are located, and the distance is a distance between the speaker and a center position of the at least part of the microphones.

Optionally, the step of determining the location information of the speaker comprises: determining a maximum response of the signal in a two-dimensional space based on the signal subspace; and determining the position information of the speaker based on the arrival direction corresponding to the maximum response.

Optionally, the designated area is a convex polygon formed by a plurality of vertices, and the step of determining whether the speaker is located in the designated area includes: constructing rays passing through other vertexes of the plurality of vertexes respectively by taking one of the vertexes as an end point; finding out rays which are adjacent to two sides of a target point and correspond to the position information through a bisection method; and under the condition that the number of the found rays is two, judging whether the target point is positioned on one side, close to the end point, of a line segment formed by two vertexes passed by the two found rays, and under the condition that the target point is positioned on one side, close to the end point, of the line segment, judging that the target point is positioned in the specified area.

Optionally, the step of further determining whether voice activity is present based on one or more voice activity detection modes comprises: respectively judging whether voice activity exists by using at least two voice activity detection modes; and determining whether voice activity exists based on the judgment results of the at least two voice activity detection modes.

Optionally, the step of determining whether voice activity is present comprises: and determining that the voice activity exists under the condition that the judgment results of the at least two voice activity detection modes are that the voice activity exists.

Optionally, the one or more voice activity detection modalities include: a first voice activity detection mode, configured to determine whether voice activity exists based on spatial entropy of signals received by the at least part of microphones; and/or a second voice activity detection mode for determining whether voice activity is present based on the neural network model.

Optionally, the first voice activity detection manner includes: acquiring a covariance matrix of signals received by at least part of microphones; performing eigenvalue decomposition on the covariance matrix to obtain a plurality of eigenvalues; the plurality of feature values are analyzed to determine whether voice activity is present.

Optionally, the step of analyzing the plurality of feature values comprises: normalizing the plurality of characteristic values; calculating the spatial entropy of a plurality of values obtained after normalization processing; and judging whether voice activity exists or not based on the comparison result of the spatial entropy and a preset threshold value.

Optionally, the second voice activity detection manner includes: and predicting audio data acquired based on at least part of the microphones by using a voice activity detection model to judge whether voice activity exists, wherein the voice activity detection model is a neural network model and is used for predicting the voice activity state of the input audio data.

According to a second aspect of the present disclosure, there is also provided a voice activity detection method, comprising: acquiring a covariance matrix of signals received by at least part of microphones in a microphone array; performing eigenvalue decomposition on the covariance matrix to obtain a plurality of eigenvalues; and analyzing the plurality of characteristic values to obtain a first voice activity detection result.

Optionally, the voice activity detection method further comprises: predicting the audio data received by the microphone by using a voice activity detection model to obtain a second voice activity detection result, wherein the voice activity detection model is used for predicting the voice activity state of the input audio data; and determining whether voice activity is present based on the first voice activity detection result and the second voice activity detection result.

According to a third aspect of the present disclosure, there is also provided a voice activity detection method, including: carrying out sound source positioning according to signals received by at least part of microphones in a microphone array so as to determine the position information of a speaker, and judging whether the speaker is located in a designated area or not based on the position information so as to obtain a first voice activity detection result; determining whether voice activity exists based on spatial entropy of signals received by at least part of the microphones to obtain a second voice activity detection result, and/or determining whether voice activity exists based on a neural network model to obtain a third voice activity detection result; and determining whether voice activity is present based on the first voice activity detection result, the second voice activity detection result, and/or the third voice activity detection result.

According to a fourth aspect of the present disclosure, there is also provided a voice activity detection apparatus comprising: the positioning module is used for positioning a sound source according to signals received by at least part of microphones in the microphone array so as to determine the position information of a speaker; the judging module is used for judging whether the speaker is positioned in the designated area or not based on the position information; and the voice activity detection module is used for further judging whether voice activity exists based on one or more voice activity detection modes under the condition that the speaker is judged to be positioned in the specified area.

Optionally, the positioning module comprises: the covariance matrix acquisition module is used for acquiring a covariance matrix of signals received by at least part of microphones; the characteristic decomposition module is used for carrying out characteristic value decomposition on the covariance matrix to obtain a plurality of characteristic values; the signal subspace determining module is used for selecting a first number of maximum eigenvalues from the plurality of eigenvalues and forming a signal subspace based on eigenvectors corresponding to the selected eigenvalues, wherein the first number is equivalent to the estimated number of the sound sources; and the positioning submodule is used for determining the position information of the speaker based on the signal subspace.

Optionally, the positioning sub-module comprises: the maximum response determining module is used for determining the maximum response of the signal in the two-dimensional space based on the signal subspace; and the sound source position determining module is used for determining the position information of the speaker based on the arrival direction corresponding to the maximum response.

Optionally, the designated area is a convex polygon formed by a plurality of vertices, and the determining module includes: a construction module, configured to respectively construct a ray passing through each of the other vertices of the plurality of vertices, with one of the plurality of vertices as an endpoint; the searching module is used for finding out adjacent rays on two sides of the target point corresponding to the position information through a dichotomy; and the judging submodule is used for judging whether the target point is positioned on one side, close to the end point, of a line segment formed by two vertexes passed by the two searched rays under the condition that the number of the two searched rays is two, and judging that the target point is positioned in the specified area under the condition that the target point is positioned on one side, close to the end point, of the line segment.

Optionally, the voice activity detection module uses at least two voice activity detection modes to respectively determine whether voice activity exists, and determines whether voice activity exists based on determination results of the at least two voice activity detection modes.

Optionally, the voice activity detection module determines that voice activity exists when the determination result of the at least two voice activity detection modes is that voice activity exists.

Optionally, the voice activity detection module comprises: the covariance matrix acquisition module is used for acquiring a covariance matrix of signals received by at least part of microphones; the characteristic decomposition module is used for carrying out characteristic value decomposition on the covariance matrix to obtain a plurality of characteristic values; and the analysis module is used for analyzing the characteristic values to judge whether voice activity exists or not.

Optionally, the analysis module comprises: the normalization module is used for performing normalization processing on the characteristic values; the spatial entropy calculation module is used for calculating spatial entropies of a plurality of values obtained after normalization processing; and the comparison and judgment module is used for judging whether voice activity exists or not based on the comparison result of the spatial entropy and the preset threshold value.

Optionally, the voice activity detection module predicts audio data acquired based on at least part of the microphone using a voice activity detection model for predicting a voice activity state of the input audio data to determine whether voice activity is present.

According to a fifth aspect of the present disclosure, there is also provided a voice activity detection apparatus comprising: the covariance matrix acquisition module is used for acquiring a covariance matrix of signals received by at least part of microphones in the microphone array; the characteristic decomposition module is used for decomposing the characteristic value of the covariance matrix to obtain a plurality of characteristic values; and the analysis module is used for analyzing the characteristic values to obtain a first voice activity detection result.

Optionally, the analysis module comprises: the normalization module is used for performing normalization processing on the characteristic values; the spatial entropy calculation module is used for calculating spatial entropies of a plurality of values obtained after normalization processing; and the judging module is used for judging whether voice activity exists or not based on the comparison result of the spatial entropy and the preset threshold value.

Optionally, the voice activity detecting apparatus further includes: the voice activity detection module is used for predicting the audio data received by the microphone by using a voice activity detection model to obtain a second voice activity detection result, wherein the voice activity detection model is used for predicting the voice activity state of the input audio data; and a determination module to determine whether voice activity is present based on the first voice activity detection result and the second voice activity detection result.

According to a sixth aspect of the present disclosure, there is also provided a voice activity detection apparatus comprising: the first detection module is used for positioning a sound source according to signals received by at least part of microphones in the microphone array so as to determine the position information of a speaker, and judging whether the speaker is located in a specified area or not based on the position information so as to obtain a first voice activity detection result; the second detection module is used for judging whether voice activity exists or not based on the spatial entropy of the signals received by at least part of the microphones so as to obtain a second voice activity detection result, and/or the third detection module is used for judging whether voice activity exists or not based on the neural network model so as to obtain a third voice activity detection result; and a determining module for determining whether voice activity is present based on the first voice activity detection result, the second voice activity detection result, and/or the third voice activity detection result.

According to a seventh aspect of the present disclosure, there is also provided an apparatus for supporting a voice interaction function, including: a microphone array for receiving a sound input; and the terminal processor is used for positioning a sound source according to signals received by at least part of microphones in the microphone array so as to determine the position information of the speaker, judging whether the speaker is positioned in a specified area or not based on the position information, and further judging whether voice activity exists or not based on one or more voice activity detection modes under the condition that the speaker is positioned in the specified area.

Optionally, the apparatus further comprises: and the communication module is used for sending the audio data received by the microphone array to a server under the condition that the terminal processor judges that voice activity exists.

Optionally, the terminal processor wakes up the device in case it is determined that there is voice activity to provide the user with voice interaction functionality.

Optionally, the device is a device adapted for voice interaction by a user in a specified area at a distance from said device.

Optionally, the device is any one of: a ticket purchasing machine; a robot; an automobile.

According to an eighth aspect of the present disclosure, there is also provided a computing device comprising: a processor; and a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method as set forth in any one of the first to third aspects of the disclosure.

According to a ninth aspect of the present disclosure, there is also provided a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform a method as set forth in any one of the first to third aspects of the present disclosure.

The method and the device can realize the VAD function in the noisy environment, so that when a user uses equipment supporting the voice interaction function, such as a subway ticket buying machine, the equipment does not need to be specially awakened, and the equipment can automatically sense the voice activation state of the user. In addition, for highly noisy noise environments such as subways, when no speaker is present in the designated area, a large amount of noise can be misjudged as a voice condition, the false alarm rate can be further reduced based on the method, and the accuracy of voice activity detection is improved.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.

Fig. 1 is a schematic diagram showing a map ticket purchase scenario.

Fig. 2 is a flow chart illustrating voice activity detection.

Fig. 3 is a schematic diagram showing sound propagation in a near-field model.

Fig. 4 is a schematic diagram showing a positional relationship of the target point and the convex polygon.

Fig. 5 is a block diagram illustrating a structure of an apparatus for supporting a voice interaction function according to an embodiment of the present disclosure.

Fig. 6 is a schematic block diagram illustrating the structure of a voice activity detection apparatus according to an embodiment of the present disclosure.

Fig. 7 is a schematic structural diagram showing functional modules that the positioning module in fig. 6 may have.

Fig. 8 is a schematic structural diagram showing functional modules that the determination module in fig. 6 may have.

Fig. 9 is a schematic diagram showing the structure of functional modules that the voice activity detection module in fig. 6 may have.

Fig. 10 is a schematic block diagram showing the structure of a voice activity detection apparatus according to another embodiment of the present disclosure.

Fig. 11 is a schematic block diagram illustrating the structure of a voice activity detection apparatus according to another embodiment of the present disclosure.

FIG. 12 is a schematic block diagram illustrating the structure of a computing device in accordance with an embodiment of the present disclosure.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

[ scheme overview ]

As shown in fig. 1, taking a ticket vending machine in a subway, a train station, and other scenes as an example, when a voice interaction function is deployed on the ticket vending machine, a highly noisy noise environment needs to be challenged. Besides the voice of the target speaker (i.e. the ticket buyer), the sound received by the ticket vending machine also includes the interference noise of other speakers (such as the interference speaker 1, the interference speaker 2 and the interference speaker 3) around the ticket buyer. In addition, environmental noise may include crowd movement-generated noise, mechanical noise from locomotive movement, loud horn interference, and the like.

For the acoustic environment, the existing scheme cannot effectively distinguish the target human voice from the noise, so that the activity state of the voice cannot be accurately detected.

In view of the above, the inventors of the present disclosure have intensively studied and found that, in an application scenario like a ticket purchaser, the sound source activation state can be determined by determining whether the sound source position is within a defined fixed area. For example, if it is determined that the sound source position is within a square area in front of the ticket purchaser, the sound source may be considered to be from the ticket purchaser, and thus it may be determined that the voice is in an activated state. In addition, in consideration of a highly noisy environment, when there is no speaker in the defined area, a large amount of noise may be erroneously determined as speech. Therefore, there is a need to further reduce the false alarm rate, and for this purpose the present disclosure proposes that the false alarm rate can be further reduced by one or more VAD schemes.

FIG. 2 is a flow chart illustrating an overall implementation of the voice activity detection scheme of the present disclosure.

As shown in fig. 2, the voice activity detection scheme of the present disclosure mainly includes three parts, namely, sound source location estimation, region discrimination, and Voice Activity Detection (VAD). Signals collected by the microphone array can be firstly converted into digital audio data through analog-to-digital conversion (ADC), sound source position estimation can be carried out according to the audio data, the obtained position information can be used for carrying out region judgment, a preliminary voice activity detection result can be obtained based on the judgment result, and if the sound source is judged to be located in a specified region, the voice can be judged to be in an activated state. In case it is determined that the speech is in an active state, the speech activity state may be further detected based on one or more VAD schemes, whereby a final speech activity detection result may be output.

Based on the method and the device, the VAD function in a noisy environment can be realized, so that when a user uses equipment supporting a voice interaction function, such as a subway ticket buying machine, the equipment does not need to be specially awakened, and the equipment can automatically sense the voice activation state of the user. In addition, for highly noisy noise environments such as subways, when no speaker is present in the designated area, a large amount of noise can be misjudged as a voice condition, the false alarm rate can be further reduced based on the method, and the accuracy of voice activity detection is improved.

The following further describes aspects of the present disclosure.

[ Sound Source position estimation ]

A microphone array may be provided on a device that supports voice interaction functions, such as a ticket purchaser, for receiving nearby sound input.

The microphone array is an array formed by arranging a group of omnidirectional microphones at different spatial positions according to a certain shape rule, and is a device for carrying out spatial sampling on spatial propagation sound input, and acquired signals contain spatial position information of the signals. According to the topological structure of the microphone array, the microphone array can be divided into a linear array, a planar array, a volume array and the like. Depending on how far and how close the distance between the sound source and the microphone array is, the array can be divided into a near field model and a far field model. The near-field model regards sound waves as spherical waves, and the amplitude difference between signals received by the microphone elements is considered; the far-field model regards the sound wave as a plane wave, ignores the amplitude difference between the received signals of the array elements and approximately considers that the received signals are in a simple time delay relationship.

Sound source localization may be performed based on signals received by at least some of the microphones of the microphone array to determine positional information of the speaker.

The determined location information may be two-dimensional location coordinates of the speaker or an azimuth angle and a distance of the speaker relative to the at least some of the microphones. The azimuth angle is the azimuth angle of the speaker in the coordinate system where the at least part of the microphones are located, and the distance is the distance between the speaker and the central positions of the at least part of the microphones.

As an example of the present disclosure, sound source localization may be performed using a MUSIC Signal classification (multi Signal classification) algorithm according to signals received by a part of microphones or all microphones in a microphone array. The basic idea of the MUSIC algorithm is to decompose the eigenvalue of the covariance matrix of any array output data, so as to obtain a signal subspace corresponding to the signal component and a noise subspace in which the signal component is orthogonal, and then estimate the parameters (incident direction, polarization information and signal strength) of the signal by using the orthogonality of the two subspaces. For example, the orthogonality of the two subspaces can be used to form a spatial scan spectrum, and a spectral peak is searched in a global manner, so as to realize the parameter estimation of the signal.

Taking the microphone array described in the present disclosure as an example of applying to a ticket vending machine, the microphone array may be a linear array, and the sound field model may be regarded as a near field model. In the near field, the time difference between the arrival of the sound source signal at each array microphone is τ, and the time difference varies not only with the angle but also with the distance as compared with the far field. As shown in FIG. 3, the distances from the target speaker to each microphone in the microphone array are respectively set to R₁，R₂，…，R_N-1，R_NWhen the propagation speed of sound wave in air is C, the time difference of sound wave reaching the ith microphone relative to the 1 st microphone is tau_iWherein, in the step (A),

the sound source localization process in the near-field model is described as follows.

A covariance matrix of signals received by at least some of the microphones of the microphone array may first be obtained. For example, the covariance matrix may be expressed as r (f), r (f) ═ E [ x (f)^H]Wherein, x (f) is data of signals received by at least some microphones in the microphone array at different frequency points f after fourier transform (such as short-time fourier transform), and is frequency domain data. X (f) can be regarded as a vector, and each element in the vector represents data of a signal received by one microphone at different frequency points f after fourier transform. For example, X (f) can be represented as

X(f)＝{X₁(f)，X₂(f)…X_M(f)}

Wherein, X₁(f)、X₂(f)、X_M(f) The data of signals received by different microphones at different frequency points f after fourier transform (such as short-time fourier transform) is represented, and M is the number of the microphones. The expression of X (f) actually implies a time variable t, and the complete representation should be X (f, t) to represent the data contained in a time period t. E represents the mathematical expectation, the mathematical expectation or the mean, in fact for time t, E [ X (f, t)^H]Or is or

Where N2-N1 represent time periods corresponding to X (f, t), N1 represents a start time, and N2 represents an end time.

And then, carrying out eigenvalue decomposition on the covariance matrix to obtain a plurality of eigenvalues. A first number of the largest eigenvalues can be selected from the plurality of eigenvalues, and the eigenvectors corresponding to the selected eigenvalues can form a signal subspace. Wherein the eigenvectors corresponding to the remaining eigenvalues may constitute a noise subspace, wherein the first number is comparable to the number of sound source estimates, as inWhen there are 3 sound source signals, the eigenvector corresponding to the largest three eigenvalues can be taken to form a signal subspace. The estimated number of sound sources can be calculated by experience or other estimation methods, and will not be described herein. For example, after decomposing the eigenvalue of r (f), r (f) is equal to U_s(f)∑_sU_s(f)^H+U_N(f)∑_NU_N(f)^HWherein, U_s(f) Is a signal subspace, U, formed by eigenvectors corresponding to large eigenvalues_N(f) Is a noise subspace formed by eigenvectors corresponding to small features, S, N denotes different divisions of the signal U, S denotes the signal, N denotes the noise, and the divided U denotes_sRepresenting a signal subspace, U_NRepresenting the noise subspace. Σ represents a diagonal matrix, representing a matrix composed of eigenvalues. Actually, r (f) is decomposed into eigenvalues, and there are r (f) ═ u (f) and Σ u (f)^HWhere Σ is a matrix having only major diagonal elements, the major diagonal elements in Σ are eigenvalues obtained by decomposition, U and Σ are classified into a larger class S (i.e., a signal subspace including eigenvectors corresponding to the major eigenvalues) and a smaller class N (i.e., a noise subspace including eigenvectors corresponding to the remaining small eigenvalues) by the size of the major diagonal elements (eigenvalues) in Σ, and r (f) is equal to U_s(f)∑_sU_s(f)^H+U_N(f)∑_NU_N(f)^H。

Based on the signal subspace, the sound source position can be determined. For example, the maximum response of a signal in a two-dimensional space may be determined based on the signal subspace, and the sound source position, i.e., the positional information of the speaker, may be determined based on the direction of arrival (DOA) to which the maximum response corresponds.

As an example, the response of the target signal in two-dimensional space is calculated as

f is a value range, and a (R, theta, f) can be obtained from the relative time difference tau. Where a (R, θ, f) represents the steering vector of the microphone array. R is a sound sourceAnd the distance from the center of the microphone array, theta is the azimuth angle of the sound source in the array coordinate system. Assuming that the sound source is at the (R, θ) position, the relative time difference τ is defined as: the difference τ between the time required for the sound source to reach each microphone relative to the time required to reach the first microphone (τ ═ t₁,τ₂,…,τ_M)，τ₁Then, the steering vector a (R, θ, f) at the frequency f corresponding to the position (R, θ) can be obtained as (a, θ, f)₁,a₂,….,a_M) Wherein

The two-dimensional coordinate of the target speaker is (R)_target,θ_target)＝argmax_(R,θ)S_R,θ. That is, the response S_R,θThe maximum time (R, theta) is the position of the targeted speaker.

[ regional judgment ]

Based on the acquired position information, it is determined whether the speaker is located in the specified area.

The designated area mentioned herein may refer to a position where a user is located during voice interaction with the device, wherein the designated area may be determined according to a specific application scenario, and the specific area size may be obtained through field measurement or experience. For example, when the present disclosure is applied to a subway ticket vending machine, the designated area may refer to a location area where a ticket purchaser performs a ticket purchasing operation using the subway ticket vending machine, such as an area having a predetermined shape and size in front of the subway ticket vending machine, which is statistically obtained.

The determined position information can be regarded as the position information of the speaker, and the designated area can be regarded as the area where the user is located in the voice interaction process with the equipment, so that the voice activation state can be preliminarily judged by comparing whether the position information of the speaker is located in the designated area. For example, in the case where the speaker is located in the specified area, it may be preliminarily determined that the target speaker uttered voice, and in the case where the speaker is not located in the specified area, it may be preliminarily determined that the target speaker did not utter voice.

One implementation of determining whether a speaker is located in a designated area based on location information is described below.

As shown in fig. 4, taking the application of the present disclosure to an automatic ticket vending machine supporting voice interaction as an example, a microphone array may be provided in the automatic ticket vending machine and receive surrounding sound input, the specified area may be an area enclosed by a dotted line in the figure, the area may be measured according to a situation in the field or obtained by statistics, and the area may be represented by a convex polygon, and vertices constituting the convex polygon may be represented by { P }₀,P₁,…,…,P_N}。

One of the vertices may be used as an endpoint, rays of other vertices may be respectively constructed, rays adjacent to both sides of the target point corresponding to the position information may be found out by bisection, and in a case where two rays are found, it may be determined whether the target point is located on a side, close to the endpoint, of a line segment formed by the two vertices through which the two rays pass, and in a case where it is determined that the target point is located on the side, close to the endpoint, of the line segment, it may be determined that the target point is located within the specified area.

As shown in fig. 4, can be selected from P₀Starting with the construction of a ray to each of the other vertices in the convex polygon, assume that the target point corresponding to the position information determined in step S110 is contained in P₀P_iP_i+1Inside this small triangle, it can be determined whether the target point is inside the convex polygon by the following steps.

1. Finding the immediate left and right rays of the object point using bisection if the immediate left ray is P₀P₁Or the ray immediately to the right is P₀P_NIf the target point is not inside the convex polygon, otherwise, the adjacent left and right rays P are found₀P_iAnd P₀P_i+1。

2. Judging whether the target point is on the ray P_iP_i+1On the left, if yes, the target point is inside the convex polygon, otherwise it is not inside the convex polygon.

【VAD】

In the event that the speaker is determined to be located in the designated area, it may be further determined whether voice activity is present based on one or more voice activity detection modalities.

As shown in fig. 1, in a highly noisy noise environment such as a subway, when there is no speaker in a specified area, it may be erroneously determined that there is a speaker in the specified area due to the influence of nearby interfering speakers. Therefore, in the case where it is determined that the speaker is located in the specified area, it is possible to further determine whether or not voice activity is present based on a predetermined voice activity detection manner to improve the accuracy of voice activity detection. The predetermined voice activity detection method may be a single voice activity detection method or a detection method combining a plurality of voice activity detection methods.

For example, a plurality of (at least two) different voice activity detection manners may be used to respectively determine whether voice activity exists, and determine whether voice activity exists based on voice activity detection results obtained by the different voice activity detection manners. For example, the voice activity can be determined to exist in the case that the voice activity detection results of different voice activity detection modes all indicate that the voice activity exists. Thus, the false alarm rate can be reduced by the multi-VAD detection method.

By way of example, the presence or absence of voice activity may be further determined by either or both of the following two voice activity detection approaches.

Method one, voice activity detection based on spatial entropy

The sound signal received by the microphone array may contain the sound of the targeted speaker as well as ambient noise. Thus, voice activity detection may be performed based on the degree of misordering of the signal space of the sound signals received by at least some of the microphones of the array of microphones. In the present disclosure, the degree of misordering of the signal space may be characterized by spatial entropy. It can be considered that there is voice activity in the case where the spatial entropy is small, and that there is no voice activity in the case where the spatial entropy is large.

As an example, a covariance matrix of signals received by at least some microphones in the microphone array may be obtained first, and the covariance matrix may be subjected to eigenvalue decomposition to obtain a plurality of eigenvalues. As described above, a signal subspace formed by large eigenvalues may be considered as a speech subspace and a signal subspace formed by small eigenvalues may be considered as a noise subspace, so that it is possible to determine whether speech activity is present by analyzing a plurality of eigenvalues. For example, each feature value can be regarded as a signal subspace (i.e., a signal source), the entropy (i.e., the spatial entropy) of the plurality of feature values is calculated, and whether voice activity exists or not can be determined according to the calculated spatial entropy.

For example, normalization processing may be performed on a plurality of feature values, spatial entropies of the plurality of values obtained after the normalization processing are calculated, the spatial entropies are compared with a predetermined threshold, and whether voice activity exists or not may be determined based on a comparison result of the spatial entropies with the predetermined threshold. As can be determined that voice activity is present if the spatial entropy is less than a predetermined threshold, and not present if greater than or equal to a predetermined threshold. The value of the predetermined threshold may be set according to an actual situation, for example, the value may be related to the selected positioning frequency band, for example, when the positioning frequency band is 500-. Wherein, the spatial entropy is ES,

p_ithe characteristic value is obtained by normalizing the characteristic value, N is the number of characteristic values obtained by decomposing the characteristic value of the covariance matrix, and the base number of the log is a numerical value greater than 1, such as 2, 10, and e, which is not limited in this disclosure.

Mode two, voice activity detection based on neural network model

A voice activity detection model may be used to predict audio data acquired based on at least some of the microphones of the array of microphones to determine whether voice activity is present. The voice activity detection model is used for predicting the voice activity state of the input audio data, can be a neural network-based model, and can obtain a prediction model in a supervised machine learning mode.

When further detecting whether voice activity exists, either one of the two manners may be used for detection, or both of the two manners may be used for detection, and whether voice activity exists is determined according to the total detection result. VAD detection result VAD at time t, for example_tCan be expressed as VAD_t＝ES_t·NN_tWherein, ES_tNN is a detection result obtained by a speech activity detection mode based on spatial entropy_tThe detection result is obtained by a model-based voice activity detection mode. The detection result may be a logical variable, that is, "0" and "1" may be used to indicate the detection result, for example, "1" may indicate that there is voice activity, "0" indicates that there is no voice activity, or "0" may indicate that there is voice activity, and "1" indicates that there is no voice activity.

So far, the implementation flow of the voice activity detection method of the present disclosure is described in detail with reference to fig. 1 to 4.

In summary, the voice activity detection scheme of the present disclosure mainly includes three parts, namely, sound source location estimation, region discrimination and VAD detection. Signals acquired by the microphone array can be firstly converted into digital signals through analog-to-digital conversion (ADC), then sound source position estimation is carried out, the obtained position information can be subjected to region judgment, and preliminary VAD information VL at a certain time t can be obtained based on the judgment result_tWhen the position information at time t is within the specified region as a result of the determination, VL can be considered_tAs there is voice activity. The detection may then be further based on a predetermined voice activity detection mode, for example, the VAD information ES of the time t may be obtained based on the above two detection modes_tAnd NN_t. Voice activity detection result VAD at time t_t＝VL_t·ES_t·NN_t. Thus, a voice activity detection result output smoothly in time can be obtained.

The present disclosure may be viewed as a multi VAD information fusion scheme. The region determination in the present disclosure may be regarded as a primary determination manner of voice activity detection, the active state of voice is primarily determined by determining whether the sound source position is located in the effective region, and when the voice is determined to be in the active state, the voice active state may be further determined by VAD (e.g., VAD based on spatial entropy and/or VAD based on neural network model), so that the false alarm rate may be greatly reduced.

In one embodiment of the present disclosure, the voice activity detection mode of determining whether the sound source position is in the designated area and the voice activity detection mode based on the spatial entropy and/or the voice activity detection mode based on the neural network model may be performed in parallel. That is, sound source localization may be performed according to signals received by at least some of the microphones in the microphone array to determine location information of the speaker, and based on the location information, it is determined whether the speaker is located in a specified area, so as to obtain a first voice activity detection result. Meanwhile, whether voice activity exists or not can be judged based on the spatial entropy of the signals received by at least part of the microphones to obtain a second voice activity detection result, and/or whether voice activity exists or not can be judged based on a neural network model to obtain a third voice activity detection result. Finally, it may be determined whether voice activity is present based on the first voice activity detection result, the second voice activity detection result, and/or the third voice activity detection result.

Thus, the final voice activity detection result at time t may be expressed as VAD_t＝VL_t·ES_t·NN_tWherein, VL_tRepresenting a first voice activity detection result, ES_tIndicating the second voice activity detection result, NN_tRepresenting a third voice activity detection result. Thus, the voice activity detection result output according to time smoothness can be finally obtained.

The device shown in fig. 5 may be a voice interaction device deployed in a noisy environment, and particularly, after the device is deployed, a user is required to issue a voice instruction in a specified area at a certain distance from the device, or the user is used to issue a voice instruction in a specified area at a certain distance from the device, so as to implement voice interaction.

For example, the voice activity detection method can be a ticket vending machine which is deployed in a noisy environment such as a subway, a train station and the like and supports a voice interaction function, and a user is usually located in a certain area in front of the ticket vending machine when using the ticket vending machine, so that the voice activity detection method is suitable for judging whether the sound source position is in a specified area. In addition, the device may also be other devices suitable for voice interaction in a specified area where the user is at a certain distance from the device, such as a car, a robot, and other devices that require the user to implement voice interaction at a close distance.

As shown in fig. 5, the device 500 includes a microphone array 510 and a terminal processor 510.

The microphone array 510 is used to receive sound input. The terminal processor 510 is configured to perform sound source localization according to signals received by at least some of the microphones in the microphone array to determine location information of the speaker, determine whether the speaker is located in a designated area based on the location information, and further determine whether voice activity exists based on one or more voice activity detection methods when the speaker is determined to be located in the designated area. For details of the sound source localization, the area determination, and the voice activity detection, reference may be made to the above description, which is not repeated herein.

The device may be awakened to provide voice interaction functionality to the user in the event that voice activity is determined to be present, such as by the terminal processor. In addition, in the case that the device is a frequently used device such as a ticket vending machine, the device may be in an awake state all the time, or may be continuously in the awake state for a long time (for example, six am to ten pm), and at this time, when it is determined that there is voice activity, the audio data received by the microphone array 510 may be directly sent to the server, so that the server performs subsequent operations such as voice recognition and command issuing. Therefore, the user can perform voice interaction with the equipment by standing in the designated area without sending out the awakening instruction, and the use experience of the user is improved. Taking the device as a subway ticket purchasing machine as an example, a user can stand in front of the ticket purchasing machine to directly say that the user buys a subway ticket of a terrible pavilion, a microphone array in the ticket purchasing machine can receive sound signals including the sound signals and other surrounding noises, a terminal processor in the ticket purchasing machine can execute the voice activity detection method disclosed by the invention to judge, and under the condition that the voice activity is judged to exist, audio data received by the microphone array can be sent to a server so as to be subjected to subsequent operations such as voice recognition, instruction issuing and the like by the server. In addition, the terminal processor can also perform voice enhancement on the audio data received by the microphone array and send the enhanced audio data to the server.

Thus, as shown in fig. 5, the device 500 may also include a communication module 630. The communication module 530 is configured to send the audio data received by the microphone array 510 to a server when the terminal processor 520 determines that voice activity exists, so that the server performs subsequent operations such as voice recognition and instruction issuing.

[ VOICE ACTIVITY DETECTION DEVICE ]

Fig. 6 is a schematic block diagram illustrating the structure of a voice activity detection apparatus according to an embodiment of the present disclosure. Wherein the functional blocks of the voice activity detection apparatus may be implemented by hardware, software, or a combination of hardware and software implementing the principles of the present disclosure. It will be appreciated by those skilled in the art that the functional blocks described in fig. 6 may be combined or divided into sub-blocks to implement the principles of the invention described above. Thus, the description herein may support any possible combination, or division, or further definition of the functional modules described herein.

In the following, functional modules that the voice activity detection apparatus may have and operations that each functional module may perform are briefly described, and for details related thereto, reference may be made to the description above in conjunction with fig. 1, and details are not repeated here.

Referring to fig. 6, the voice activity detecting apparatus 600 includes a positioning module 610, a determining module 620, and a voice activity detecting module 630.

The positioning module 610 is configured to perform sound source positioning according to signals received by at least some microphones in the microphone array to determine location information of the speaker. The determining module 620 is configured to determine whether the speaker is located in the designated area based on the location information. The voice activity detection module 630 is configured to further determine whether voice activity exists based on one or more voice activity detection manners if it is determined that the speaker is located in the designated area.

As shown in fig. 7, the positioning module 610 may include a covariance matrix acquisition module 611, a feature decomposition module 613, and a positioning sub-module 615. The covariance matrix acquisition module 611 is configured to acquire a covariance matrix of at least a portion of the signals received by the microphones. The eigenvalue decomposition module 613 is configured to perform eigenvalue decomposition on the covariance matrix to obtain a plurality of eigenvalues. The signal subspace determination module 614 is configured to select a first number of largest eigenvalues from the plurality of eigenvalues, and form a signal subspace based on eigenvectors corresponding to the selected eigenvalues, where the first number is equivalent to the number of sound source estimates. The localization sub-module 615 is used to determine the sound source location based on the signal subspace. Wherein, the covariance matrix is R (f), R (f) E [ X (f)^H]Wherein, x (f) is data of signals received by different microphones in at least some microphones at different frequency points f after fourier transform.

As shown in fig. 7, the location sub-module 615 may include a maximum response determination module 6151 and a sound source location determination module 6153. The maximum response determining module 6151 is configured to determine the maximum response of the signal in the two-dimensional space based on the signal subspace; the sound source position determining module 6153 is configured to determine a sound source position based on the direction of arrival corresponding to the maximum response.

In one embodiment of the present disclosure, the designated area may be a convex polygon composed of a plurality of vertices. As shown in FIG. 8, the determination module 620 may include a construction module 621, a lookup module 623, and a determination sub-module 625. The constructing module 621 is configured to construct a ray passing through each of the other vertices of the plurality of vertices, respectively, with one of the plurality of vertices as an endpoint; the searching module 623 is configured to find out, by bisection, rays that are adjacent to two sides of the target point and correspond to the position information; the determining sub-module 625 is configured to determine, when the number of the found rays is two, whether the target point is located on a side, close to the end point, of a line segment formed by two vertices through which the two found rays pass, and determine, when the target point is located on the side, close to the end point, of the line segment, that the target point is located in the designated area.

As shown in fig. 9, the voice activity detection module 630 includes a covariance matrix acquisition module 631, a feature decomposition module 633, and an analysis module 635. The covariance matrix obtaining module 631 is configured to obtain a covariance matrix of at least some of the signals received by the microphones; the feature decomposition module 633 is configured to perform feature value decomposition on the covariance matrix to obtain a plurality of feature values; the analysis module 635 is configured to analyze the plurality of feature values to determine whether voice activity is present.

As shown in fig. 9, the analysis module 635 includes a normalization module 6351, a spatial entropy calculation module 6353, and a comparison and judgment module 6355. The normalization module 6351 is configured to perform normalization processing on the plurality of feature values; the spatial entropy calculation module 6353 is configured to calculate spatial entropy of the multiple values obtained after normalization processing; the comparison and determination module 6355 is used to determine whether there is voice activity based on the comparison result of the spatial entropy and the predetermined threshold. Wherein, the spatial entropy is ES,wherein p is_iThe characteristic values are normalized values, and N is the number of the characteristic values.

In addition, the voice activity detection module may also use a voice activity detection model to predict audio data acquired based on at least part of the microphones to determine whether voice activity exists, where the voice activity detection model is a neural network model and is used to predict a voice activity state of the input audio data, and details about a structure and a training process of the voice activity detection model are not repeated here.

Fig. 10 is a schematic structural diagram illustrating a voice activity detection apparatus according to another embodiment of the present disclosure.

As shown in fig. 10, the voice activity detection apparatus 1000 includes a covariance matrix acquisition module 1010, a feature decomposition module 1020, and an analysis module 1030. The covariance matrix obtaining module 1010 is configured to obtain a covariance matrix of signals received by at least some microphones in the microphone array; the eigenvalue decomposition module 1020 is configured to perform eigenvalue decomposition on the covariance matrix to obtain a plurality of eigenvalues; the analysis module 1030 is configured to analyze the plurality of feature values to obtain a first voice activity detection result.

Optionally, the analysis module 1030 may include a normalization module 1031, a spatial entropy calculation module 1033, and a determination module 1035. The normalization module 1031 is configured to perform normalization processing on the plurality of feature values; the spatial entropy calculation module 1033 is configured to calculate spatial entropies of the multiple values obtained through the normalization processing; the determination module 1035 is configured to determine whether voice activity is present based on the comparison of the spatial entropy and the predetermined threshold. Wherein, the spatial entropy is ES,

wherein p is_iThe characteristic values are normalized values, and N is the number of the characteristic values.

Optionally, the voice activity detection apparatus 1000 may further include a voice activity detection module 1040 and a determination module 1050. The voice activity detection module 1040 is configured to predict audio data received by the microphone by using a voice activity detection model to obtain a second voice activity detection result, where the voice activity detection model is configured to predict a voice activity state of the input audio data; and a determination module 1050 for determining whether voice activity is present based on the first voice activity detection result and the second voice activity detection result.

Fig. 11 is a schematic structural diagram illustrating a voice activity detection apparatus according to another embodiment of the present disclosure.

Referring to fig. 11, the voice activity detection apparatus 1100 includes a first detection module 1110, a second detection module 1120 and/or a third detection module 1130, and a determination module 1140.

The first detection module 1110 is configured to perform sound source localization according to signals received by at least some microphones of the microphone array to determine location information of a speaker, and determine whether the speaker is located in a designated area based on the location information to obtain a first voice activity detection result.

The second detecting module 1120 is configured to determine whether voice activity exists based on spatial entropy of signals received by the at least part of the microphones, so as to obtain a second voice activity detection result. The third detecting module 1130 is configured to determine whether voice activity exists based on the neural network model to obtain a third voice activity detecting result. The determination module 1140 is used to determine whether voice activity is present based on the first voice activity detection result, the second voice activity detection result, and/or the third voice activity detection result.

[ calculating device ]

FIG. 12 is a schematic structural diagram of a computing device that can be used to implement the voice activity detection method according to an embodiment of the present disclosure.

Referring to fig. 12, computing device 1200 includes memory 1210 and processor 1220.

Processor 1220 may be a multi-core processor or may include multiple processors. In some embodiments, processor 1220 may include a general-purpose host processor and one or more special purpose coprocessors such as a Graphics Processor (GPU), Digital Signal Processor (DSP), or the like. In some embodiments, the processor 1220 may be implemented using custom circuitry, such as an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA).

The memory 1210 may include various types of storage units, such as system memory, Read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions for the processor 1220 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. In addition, memory 1210 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, may also be used. In some embodiments, memory 1210 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual layer DVD-ROM), a read-only Blu-ray disc, an ultra-density optical disc, a flash memory card (e.g., SD card, min SD card, Micro-SD card, etc.), a magnetic floppy disc, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.

The memory 1210 has stored thereon executable code that, when executed by the processor 1220, may cause the processor 1220 to perform the voice activity detection methods described above.

The voice activity detection method, apparatus and computing device according to the present disclosure have been described in detail above with reference to the accompanying drawings.

Furthermore, the method according to the present disclosure may also be implemented as a computer program or computer program product comprising computer program code instructions for performing the above-mentioned steps defined in the above-mentioned method of the present disclosure.

Alternatively, the present disclosure may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the various steps of the above-described method according to the present disclosure.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A voice activity detection method, comprising:

positioning a sound source according to signals received by at least part of microphones in the microphone array so as to determine the position information of a speaker;

judging whether the speaker is located in a designated area or not based on the position information; and

in the event that the speaker is determined to be located in the designated area, further determining whether voice activity is present based on one or more voice activity detection modalities.

2. The voice activity detection method according to claim 1, wherein the step of performing sound source localization includes:

acquiring a covariance matrix of signals received by at least part of microphones;

performing eigenvalue decomposition on the covariance matrix to obtain a plurality of eigenvalues;

selecting a first number of maximum eigenvalues from the plurality of eigenvalues, and forming a signal subspace based on eigenvectors corresponding to the selected eigenvalues, wherein the first number is equivalent to the estimated number of sound sources; and

based on the signal subspace, position information of the speaker is determined.

3. The voice activity detection method according to claim 1 or 2,

the position information includes an azimuth angle and a distance, the azimuth angle is an azimuth angle of the speaker in a coordinate system in which the at least part of the microphones are located, and the distance is a distance between the speaker and a center position of the at least part of the microphones.

4. The voice activity detection method as claimed in claim 2, wherein the step of determining the location information of the speaker comprises:

determining a maximum response of the signal in a two-dimensional space based on the signal subspace;

and determining the position information of the speaker based on the arrival direction corresponding to the maximum response.

5. The voice activity detecting method according to claim 1, wherein the specified region is a convex polygon formed by a plurality of vertices, and the step of determining whether the speaker is located in the specified region includes:

constructing rays of other vertexes in the plurality of vertexes respectively by taking one point in the plurality of vertexes as an endpoint;

finding out adjacent rays on two sides of a target point corresponding to the position information through a bisection method;

and under the condition that two rays are found, judging whether the target point is positioned on one side, close to the end point, of a line segment formed by two vertexes passed by the two found rays, and under the condition that the target point is positioned on one side, close to the end point, of the line segment, judging that the target point is positioned in the specified area.

6. The voice activity detection method of claim 1, wherein the step of further determining whether voice activity is present based on one or more voice activity detection modalities comprises:

respectively judging whether voice activity exists by using at least two voice activity detection modes; and

and determining whether voice activity exists based on the judgment results of the at least two voice activity detection modes.

7. The voice activity detection method according to claim 6, wherein the step of determining whether voice activity is present comprises:

and determining that the voice activity exists under the condition that the judgment results of the at least two voice activity detection modes are that the voice activity exists.

8. The voice activity detection method according to claim 1, wherein the one or more voice activity detection modalities include:

a first voice activity detection mode, configured to determine whether voice activity exists based on spatial entropy of signals received by the at least part of microphones; and/or

And the second voice activity detection mode is used for judging whether voice activity exists or not based on the neural network model.

9. The voice activity detection method according to claim 8, wherein the first voice activity detection manner comprises:

analyzing the plurality of feature values to determine whether voice activity is present.

10. The voice activity detection method according to claim 9, wherein the step of analyzing the plurality of feature values comprises:

normalizing the plurality of characteristic values;

calculating the spatial entropy of a plurality of values obtained after normalization processing;

and judging whether voice activity exists or not based on the comparison result of the spatial entropy and a preset threshold value.

11. The voice activity detection method according to claim 8, wherein the second voice activity detection manner includes:

and predicting the audio data acquired based on at least part of the microphones by using a voice activity detection model to judge whether voice activity exists, wherein the voice activity detection model is a neural network model and is used for predicting the voice activity state of the input audio data.

12. A voice activity detection method, comprising:

acquiring a covariance matrix of signals received by at least part of microphones in a microphone array;

and analyzing the plurality of characteristic values to obtain a first voice activity detection result.

13. The voice activity detection method according to claim 12, the step of analyzing the plurality of feature values comprising:

normalizing the plurality of characteristic values;

14. The voice activity detection method of claim 12, further comprising:

predicting audio data received by the microphone by using a voice activity detection model to obtain a second voice activity detection result, wherein the voice activity detection model is used for predicting a voice activity state of the input audio data; and

determining whether voice activity is present based on the first voice activity detection result and the second voice activity detection result.

15. A voice activity detection method, comprising:

carrying out sound source positioning according to signals received by at least part of microphones in a microphone array so as to determine the position information of a speaker, and judging whether the speaker is located in a designated area or not based on the position information so as to obtain a first voice activity detection result;

determining whether voice activity exists based on spatial entropy of signals received by at least part of the microphones to obtain a second voice activity detection result, and/or determining whether voice activity exists based on a neural network model to obtain a third voice activity detection result; and

determining whether voice activity is present based on the first voice activity detection result, the second voice activity detection result, and/or the third voice activity detection result.

16. A voice activity detection apparatus comprising:

the positioning module is used for positioning a sound source according to signals received by at least part of microphones in the microphone array so as to determine the position information of a speaker;

the judging module is used for judging whether the speaker is positioned in a specified area or not based on the position information; and

and the voice activity detection module is used for further judging whether voice activity exists based on one or more voice activity detection modes under the condition that the speaker is judged to be positioned in the specified area.

17. A voice activity detection apparatus comprising:

the covariance matrix acquisition module is used for acquiring a covariance matrix of signals received by at least part of microphones in the microphone array;

the characteristic decomposition module is used for decomposing the characteristic value of the covariance matrix to obtain a plurality of characteristic values; and

and the analysis module is used for analyzing the characteristic values to obtain a first voice activity detection result.

18. A voice activity detection apparatus comprising:

the first detection module is used for positioning a sound source according to signals received by at least part of microphones in the microphone array so as to determine the position information of a speaker, and judging whether the speaker is located in a specified area or not based on the position information so as to obtain a first voice activity detection result;

the second detection module is used for judging whether voice activity exists or not based on the spatial entropy of the signals received by at least part of the microphones so as to obtain a second voice activity detection result, and/or the third detection module is used for judging whether voice activity exists or not based on the neural network model so as to obtain a third voice activity detection result; and

a determining module to determine whether voice activity is present based on the first voice activity detection result, the second voice activity detection result, and/or the third voice activity detection result.

19. An apparatus for supporting a voice interaction function, comprising:

a microphone array for receiving a sound input; and

and the terminal processor is used for positioning a sound source according to signals received by at least part of microphones in the microphone array so as to determine the position information of the speaker, judging whether the speaker is positioned in a specified area or not based on the position information, and further judging whether voice activity exists or not based on one or more voice activity detection modes under the condition that the speaker is positioned in the specified area.

20. The apparatus of claim 19, further comprising:

and the communication module is used for sending the audio data received by the microphone array to a server under the condition that the terminal processor judges that voice activity exists.

21. The apparatus of claim 19, wherein,

the terminal processor wakes up the device upon determining that voice activity is present to provide a voice interaction function for the user.

22. The device of claim 19, wherein the device is a device adapted for voice interaction by a user in a designated area at a distance from the device.

23. The apparatus of claim 19, wherein the apparatus is any one of:

a ticket purchasing machine;

a robot;

an automobile.

24. A computing device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any of claims 1-15.

25. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any one of claims 1-15.