CN111833908A - Audio activity detection method, system, device and storage medium - Google Patents

Audio activity detection method, system, device and storage medium Download PDF

Info

Publication number
CN111833908A
CN111833908A CN202010547546.4A CN202010547546A CN111833908A CN 111833908 A CN111833908 A CN 111833908A CN 202010547546 A CN202010547546 A CN 202010547546A CN 111833908 A CN111833908 A CN 111833908A
Authority
CN
China
Prior art keywords
current frame
energy
current
single gaussian
gaussian model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010547546.4A
Other languages
Chinese (zh)
Inventor
陈英博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TP Link Technologies Co Ltd
Original Assignee
TP Link Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TP Link Technologies Co Ltd filed Critical TP Link Technologies Co Ltd
Priority to CN202010547546.4A priority Critical patent/CN111833908A/en
Publication of CN111833908A publication Critical patent/CN111833908A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention discloses an audio activity detection method, which comprises the following steps: framing the audio; each frame comprises a plurality of time domain points; calculating the energy of the current frame according to the amplitude of each time domain point; judging whether the energy of the current frame is successfully matched with a plurality of single Gaussian models in a preset mixed Gaussian model; when the energy of the current frame is successfully matched with any single Gaussian model, updating the parameters of the current single Gaussian model; and when the matching of the energy of the current frame and all the single Gaussian models fails, judging that the current frame is in an activated state. The invention also discloses an audio activity detection system, an audio activity detection device and a computer readable storage medium. By adopting the embodiment of the invention, the accuracy of the audio activity detection can be effectively improved, the calculation is simple, and all types of audio can be covered.

Description

Audio activity detection method, system, device and storage medium
Technical Field
The present invention relates to the field of speech processing technologies, and in particular, to a method, a system, a device, and a storage medium for audio activity detection.
Background
Audio Activity Detection, also called Voice Activity Detection (vad) in some scenarios, is a fundamental and very important function in audio processing. When there is only background sound in the environment, we generally refer to the state at this time as a mute state, and when there is not only background sound but also sound of interest, etc., we refer to the state at this time as an active state. The conditions are different depending on the task, when it is in the mute state and when it is in the active state. For example, if the current task is speech recognition, then the frame with speech is considered to be in an active state, otherwise it is considered to be in a mute state. For the task of audio classification and anomaly detection, however, the presence of only stationary background sounds (noise, fans, air conditioners) is called silent, otherwise called active.
Most of the existing vad detection methods are used for carrying out vad detection on a voice signal, for example, the vad detection is carried out on the voice signal by adopting a spectral entropy method. The spectral entropy method has a premise that the spectrum of speech is steeper than that of noise. However, since the audio frequency range is wider than that of voice, the effect of the spectral entropy method is greatly reduced when detecting audio frequency with a not steep frequency spectrum, so that the accuracy of audio activity detection is not high. For those vad methods based on machine learning and deep learning, the required calculation power is too high, and because the range of the audio is too wide, all kinds of audio cannot be covered at all during training.
Disclosure of Invention
Embodiments of the present invention provide an audio activity detection method, system, device, and storage medium, which can effectively improve the accuracy of audio activity detection, are simple to calculate, and can cover all types of audio.
To achieve the above object, an embodiment of the present invention provides an audio activity detection method, including:
framing the audio; each frame comprises a plurality of time domain points;
calculating the energy of the current frame according to the amplitude of each time domain point;
judging whether the energy of the current frame is successfully matched with a plurality of single Gaussian models in a preset mixed Gaussian model;
when the energy of the current frame is successfully matched with any single Gaussian model, updating the parameters of the current single Gaussian model;
and when the matching of the energy of the current frame and all the single Gaussian models fails, judging that the current frame is in an activated state.
Compared with the prior art, the audio activity detection method disclosed by the embodiment of the invention comprises the steps of firstly, framing the audio, and calculating the energy of the current frame according to the amplitude of the time domain point in each frame; then, judging whether the energy of the current frame is successfully matched with a plurality of single Gaussian models in a preset mixed Gaussian model or not, so as to judge whether the frame is a voice frame (an activated state) or a noise frame (a mute state); when the energy of the current frame is successfully matched with any single Gaussian model, updating the parameters of the current single Gaussian model; and when the matching of the energy of the current frame and all the single Gaussian models fails, judging that the current frame is in an activated state. The audio activity detection method disclosed by the embodiment of the invention can effectively improve the accuracy of audio activity detection, is simple to calculate, and can cover all types of audio.
As an improvement of the above scheme, the determining whether the energy of the current frame is successfully matched with a plurality of single gaussian models in a preset gaussian model specifically includes:
sequentially matching the energy of the current frame with a plurality of single Gaussian models according to a preset matching sequence;
judging whether the difference value between the energy of the current frame and the mean value of the current single-Gaussian model is smaller than or equal to the product of the standard deviation of the current single-Gaussian model and a preset first threshold value;
if so, judging that the energy of the current frame is successfully matched with the current single Gaussian model; if not, judging that the matching of the energy of the current frame and the current single Gaussian model fails.
As an improvement of the above scheme, the parameters of the single gaussian model include weight, standard deviation and mean; then, the updating of the parameters of the current single gaussian model specifically includes:
updating at least one of a weight, a standard deviation, and a mean of the current single-Gaussian model.
As an improvement of the above scheme, after updating the parameters of the current single gaussian model, the method further includes:
adjusting the weights of all single Gaussian models in the Gaussian mixture model to enable the sum of the weights of all the single Gaussian models to be a preset fixed value;
calculating the weight sum of the current frame and other frames which are matched with sequences before the current frame;
judging whether the weight sum is larger than a preset second threshold value or not;
if yes, judging that the current frame is in a mute state; if not, judging that the current frame is in an activated state.
As an improvement of the above scheme, after determining that the current frame is in the active state, the method further includes:
and updating the parameters of the single Gaussian model with the matching sequence at the last position in the mixed Gaussian model.
In order to achieve the above object, an embodiment of the present invention further provides an audio activity detection system, including:
a framing unit for framing the audio; each frame comprises a plurality of time domain points;
the frame energy calculating unit is used for calculating the energy of the current frame according to the amplitude of each time domain point;
the first judging unit is used for judging whether the energy of the current frame is successfully matched with a plurality of single Gaussian models in a preset Gaussian mixture model;
the first parameter updating unit is used for updating the parameters of the current single Gaussian model when the energy of the current frame is successfully matched with any single Gaussian model;
and the first judging unit is used for judging that the current frame is in an activated state when the matching of the energy of the current frame and all the single Gaussian models fails.
Compared with the prior art, the audio activity detection system disclosed by the embodiment of the invention comprises the following steps that firstly, a framing unit frames audio, and a frame energy calculation unit calculates the energy of a current frame according to the amplitude of a time domain point in each frame; then, a first judging unit judges whether the energy of the current frame is successfully matched with a plurality of single Gaussian models in a preset mixed Gaussian model or not, so as to judge whether the frame is a speech frame (an activated state) or a noise frame (a mute state); when the energy of the current frame is successfully matched with any single Gaussian model, a first parameter updating unit updates the parameters of the current single Gaussian model; when the matching of the energy of the current frame and all the single Gaussian models fails, the first judging unit judges that the current frame is in an activated state. The audio activity detection system disclosed by the embodiment of the invention can effectively improve the accuracy of audio activity detection, is simple to calculate, and can cover all types of audio.
As an improvement of the foregoing solution, the first determining unit is specifically configured to:
sequentially matching the energy of the current frame with a plurality of single Gaussian models according to a preset matching sequence;
judging whether the difference value between the energy of the current frame and the mean value of the current single-Gaussian model is smaller than or equal to the product of the standard deviation of the current single-Gaussian model and a preset first threshold value;
if so, judging that the energy of the current frame is successfully matched with the current single Gaussian model; if not, judging that the matching of the energy of the current frame and the current single Gaussian model fails.
As an improvement of the above, the audio activity detection system further includes:
the first weight adjusting unit is used for adjusting the weights of all the single Gaussian models in the Gaussian mixture model when the energy of the current frame is successfully matched with any single Gaussian model, so that the sum of the weights of all the single Gaussian models is a preset fixed value;
a weight sum calculation unit for calculating a weight sum of the current frame and other frames whose matching order is prior to the current frame;
the second judging unit is used for judging whether the weight sum is larger than a preset second threshold value or not;
a second determining unit, configured to determine that the current frame is in a mute state when the sum of weights is greater than a preset second threshold; and the current frame is judged to be in an activated state when the weight sum is less than or equal to a preset second threshold value.
To achieve the above object, an embodiment of the present invention further provides an audio activity detection apparatus, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, where the processor executes the computer program to implement the audio activity detection method according to any one of the above embodiments.
To achieve the above object, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, where the computer program, when executed, controls an apparatus where the computer-readable storage medium is located to perform the audio activity detection method according to any one of the above embodiments.
Drawings
Fig. 1 is a flowchart of an audio activity detection method according to an embodiment of the present invention;
fig. 2 is a block diagram of an audio activity detection system according to an embodiment of the present invention;
fig. 3 is a block diagram of an audio activity detection apparatus according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a flowchart of an audio activity detection method according to an embodiment of the present invention; the audio activity detection method comprises the following steps:
s1, framing the audio; each frame comprises a plurality of time domain points;
s2, calculating the energy of the current frame according to the amplitude of each time domain point;
s3, judging whether the energy of the current frame is successfully matched with a plurality of single Gaussian models in a preset Gaussian mixture model;
s4, when the energy of the current frame is successfully matched with any single Gaussian model, updating the parameters of the current single Gaussian model;
and S5, when the matching of the energy of the current frame and all the single Gaussian models fails, judging that the current frame is in an activated state.
It should be noted that the audio activity detection method in the embodiment of the present invention may be implemented by a speech recognition device, where a user of the speech recognition device acquires audio information of the user and distinguishes between a speech frame (active state) and a noise frame (mute state) according to the audio activity detection method for the audio information.
In the embodiment of the invention, a Gaussian mixture model is preset and used for distinguishing whether the current speech frame is in an activated state or not. The mixed Gaussian model has K single Gaussian models (K is generally 2 or 3) which are respectively marked as 1 … K … K. Different from images, the volume of background noise of audio is necessarily smaller than that of foreground, so that a single gaussian model with a smaller subscript value has a smaller mean value in the initialization and subsequent processes.
The Gaussian mixture model is represented as follows:
Figure BDA0002541269830000061
Figure BDA0002541269830000062
where p (x) represents the probability distribution of x may be represented by a single gaussian distribution of K weights. K denotes a total of K single Gaussian models, wkWeight, u, representing the kth single Gaussian modelkMean, v, representing the kth single Gaussian modelkThe standard deviation of the kth single gaussian model is shown. In the initialization process, the initial mean values of the K single Gaussian models can be all set as u0Or may be initialized to K sequentially increasing numbers; single gaussThe initial standard deviation of the model is set to a preset constant value v0. With respect to the initial mean value u0And initial standard deviation v0The method can capture a period of time of audio from a camera under a quiet environment, a row of energy arrays can be obtained by framing and frame energy solving methods, and the mean value and standard deviation of the energy arrays are solved. The weight of each model is initialized to be 1/K, and w can be givenkRandom weights are given as long as the sum of the weights of the single gaussian models is guaranteed to be 1.
Specifically, in step S1, the audio is framed; wherein each frame comprises a plurality of time domain points. Illustratively, each frame may include 128 time-domain points. The specific manner of framing the audio signal stream may refer to an audio framing manner in the prior art, which is not specifically limited by the present invention.
Specifically, in step S2, the energy of the current frame, i.e. the sum of the squares of all time-domain points in the frame, is calculated according to the amplitude of each time-domain point in the current frame. Assuming that the frame is the ith frame, calculating the energy of the frame to satisfy the following formula:
Figure BDA0002541269830000063
wherein s ispThe amplitude of the p-th time domain point of the current frame.
Specifically, in step S3, the determining whether the energy of the current frame is successfully matched with the single gaussian models in the preset gaussian model includes steps S31 to S33:
s31, sequentially matching the energy of the current frame with a plurality of single Gaussian models according to a preset matching sequence; the matching sequence is from small to large according to the subscript of the model;
s32, judging whether the difference value between the energy of the current frame and the mean value of the current single-Gaussian model is smaller than or equal to the product of the standard deviation of the current single-Gaussian model and a preset first threshold value, and satisfying the following formula:
abs(xi-uk)≤λ0*vkformula (4);
wherein λ is0The preset first threshold value can be set to be between 2 and 3, and can also be set to be other values;
s33, if yes, judging that the energy of the current frame is successfully matched with the current single Gaussian model, and not performing the subsequent matching of the single Gaussian model any more; if not, judging that the matching of the energy of the current frame and the current single Gaussian model fails.
For example, to distinguish the background noise frame from the speech frame, a mixed gaussian model may be formed by two single gaussian models, i.e., K ═ 2, i.e., one single gaussian model is used to describe the frame energy distribution of the background noise, and the other single gaussian model is used to describe the speech frame energy distribution. For a frame, the function of matching the frame energy with each single gaussian model in turn is to determine whether the frame is a speech frame or a noise frame.
Specifically, in step S4, when the energy of the current frame is successfully matched with any single gaussian model, updating the parameters of the current single gaussian model; the parameters of the single Gaussian model comprise a weight wkStandard deviation vkSum mean uk(ii) a Then, the updating of the parameters of the current single gaussian model specifically includes:
updating at least one of a weight, a standard deviation, and a mean of the current single-Gaussian model.
Exemplary, xiMatching with the kth model is successful, one possible updating method is as follows:
u'k=uk+a0*(xi-uk) Formula (5);
Figure BDA0002541269830000071
Figure BDA0002541269830000072
wherein, a0And taking the value between 0 and 1 for presetting the third threshold value.
Further, after the parameters of the current single gaussian model are updated, the method further includes steps S41 to S45:
s41, adjusting the weights of all single Gaussian models in the Gaussian mixture model to enable the sum of the weights of all the single Gaussian models to be a preset fixed value; wherein the preset fixed value is 1; scaling all the weights of the single Gaussian models by the following formula to make the sum of the weights 1;
Figure BDA0002541269830000081
s42, calculating the weight sum of the current frame and other frames which are matched with the sequence before the current frame;
s43, judging whether the weight sum is larger than a preset second threshold value or not, and satisfying the following formula:
Figure BDA0002541269830000082
wherein Thresh is the preset second threshold, and is generally between 0.5 and 0.8;
s44, if yes, determining that the current frame is in a mute state (noise frame), that is, vad is 0; if not, the current frame is determined to be in an activated state (speech frame), namely vad is equal to 1.
And S45, reordering the K models from small to large according to the updated mean value of the parameters.
Specifically, in step S5, when the energy of the current frame fails to match all the single gaussian models, the current frame is determined to be in an activated state. x is the number ofiIf all the single gaussian models fail to match with K, the frame is directly judged to be in an active state, i.e. vad is 1. At this time, parameters of the single gaussian model matching the last (kth) in the sequence in the gaussian mixture model are updated. The updating mode satisfies the following formula:
uK=xiequation (10);
vK=v0formula (11);
Figure BDA0002541269830000083
furthermore, after the three parameters of the Kth single Gaussian model (the last one) are updated, the weights of all the K models are scaled, and the K models are reordered from small to large according to the updated mean value.
Compared with the prior art, the audio activity detection method disclosed by the embodiment of the invention comprises the steps of firstly, framing the audio, and calculating the energy of the current frame according to the amplitude of the time domain point in each frame; then, judging whether the energy of the current frame is successfully matched with a plurality of single Gaussian models in a preset mixed Gaussian model or not, so as to judge whether the frame is a voice frame (an activated state) or a noise frame (a mute state); when the energy of the current frame is successfully matched with any single Gaussian model, updating the parameters of the current single Gaussian model; and when the matching of the energy of the current frame and all the single Gaussian models fails, judging that the current frame is in an activated state. The audio activity detection method disclosed by the embodiment of the invention can effectively improve the accuracy of audio activity detection, is simple to calculate, and can cover all types of audio.
Referring to fig. 2, fig. 2 is a block diagram of an audio activity detection system 10 according to an embodiment of the present invention; the audio activity detection system 10 comprises:
a framing unit 11, configured to frame audio; each frame comprises a plurality of time domain points;
a frame energy calculating unit 12, configured to calculate energy of the current frame according to the amplitude of each time-domain point;
the first judging unit 13 is configured to judge whether the energy of the current frame is successfully matched with a plurality of single gaussian models in a preset gaussian model mixture;
a first parameter updating unit 14, configured to update a parameter of a current single gaussian model when the energy of the current frame is successfully matched with any single gaussian model;
a first determining unit 15, configured to determine that the current frame is in an active state when matching between the energy of the current frame and all single gaussian models fails;
the first weight adjusting unit 16 is configured to, when the energy of the current frame is successfully matched with any one of the single gaussian models, adjust the weights of all the single gaussian models in the gaussian mixture model so that the sum of the weights of all the single gaussian models is a preset fixed value;
a weight sum calculating unit 17 for calculating a weight sum of the current frame and other frames whose matching order is prior to the current frame;
a second judging unit 18, configured to judge whether the weight sum is greater than a preset second threshold;
a second determining unit 19, configured to determine that the current frame is in a mute state when the sum of weights is greater than a preset second threshold; and the current frame is judged to be in an activated state when the weight sum is less than or equal to a preset second threshold value.
It is worth mentioning that the audio activity detection system 10 in the embodiment of the present invention may be a speech recognition device, which obtains audio information of a user and distinguishes between a speech frame (active state) and a noise frame (mute state) for the audio information according to the audio activity detection method.
In the embodiment of the invention, a Gaussian mixture model is preset and used for distinguishing whether the current speech frame is in an activated state or not. The mixed Gaussian model has K single Gaussian models (K is generally 2 or 3) which are respectively marked as 1 … K … K. Different from images, the volume of background noise of audio is necessarily smaller than that of foreground, so that a single gaussian model with a smaller subscript value has a smaller mean value in the initialization and subsequent processes.
The Gaussian mixture model is represented as follows:
Figure BDA0002541269830000101
Figure BDA0002541269830000102
where p (x) represents the probability distribution of x may be represented by a single gaussian distribution of K weights. K denotes a total of K single Gaussian models, wkWeight, u, representing the kth single Gaussian modelkMean, v, representing the kth single Gaussian modelkThe standard deviation of the kth single gaussian model is shown. In the initialization process, the initial mean values of the K single Gaussian models can be all set as u0Or may be initialized to K sequentially increasing numbers; setting the initial standard deviation of the single Gaussian model as a preset constant value v0. With respect to the initial mean value u0And initial standard deviation v0The method can capture a period of time of audio from a camera under a quiet environment, a row of energy arrays can be obtained by framing and frame energy solving methods, and the mean value and standard deviation of the energy arrays are solved. The weight of each model is initialized to be 1/K, and w can be givenkRandom weights are given as long as the sum of the weights of the single gaussian models is guaranteed to be 1.
Specifically, the framing unit 11 frames the audio; wherein each frame comprises a plurality of time domain points. Illustratively, each frame may include 128 time-domain points. The specific manner of framing the audio signal stream may refer to an audio framing manner in the prior art, which is not specifically limited by the present invention.
Specifically, the frame energy calculating unit 12 calculates the energy of the current frame according to the amplitude of each time domain point in the current frame, where the energy of the current frame is the sum of squares of all time domain points in the frame. Assuming that the frame is the ith frame, calculating the energy of the frame to satisfy the following formula:
Figure BDA0002541269830000111
wherein s ispThe amplitude of the p-th time domain point of the current frame.
Specifically, the determining unit 13 determines whether the energy of the current frame is successfully matched with a plurality of single gaussian models in a preset gaussian model, and specifically includes:
sequentially matching the energy of the current frame with a plurality of single Gaussian models according to a preset matching sequence; the matching sequence is from small to large according to the subscript of the model;
judging whether the difference value between the energy of the current frame and the mean value of the current single-Gaussian model is smaller than or equal to the product of the standard deviation of the current single-Gaussian model and a preset first threshold value, and meeting the following formula:
abs(xi-uk)≤λ0*vkformula (4);
wherein λ is0The preset first threshold value can be set to be between 2 and 3, and can also be set to be other values;
if so, judging that the energy of the current frame is successfully matched with the current single Gaussian model, and not performing the subsequent matching of the single Gaussian model; if not, judging that the matching of the energy of the current frame and the current single Gaussian model fails.
For example, to distinguish the background noise frame from the speech frame, a mixed gaussian model may be formed by two single gaussian models, i.e., K ═ 2, i.e., one single gaussian model is used to describe the frame energy distribution of the background noise, and the other single gaussian model is used to describe the speech frame energy distribution. For a frame, the function of matching the frame energy with each single gaussian model in turn is to determine whether the frame is a speech frame or a noise frame.
Specifically, when the energy of the current frame is successfully matched with any single gaussian model, the first parameter updating unit 14 updates the parameter of the current single gaussian model; the parameters of the single Gaussian model comprise a weight wkStandard deviation vkSum mean uk(ii) a Then, the first parameter updating unit 14 is specifically configured to update at least one of the weight, the standard deviation, and the mean of the current single-gaussian model.
Exemplary, xiMatching with the kth model is successful, one possible updating method is as follows:
u'k=uk+a0*(xi-uk) Formula (5);
Figure BDA0002541269830000121
Figure BDA0002541269830000122
wherein, a0And taking the value between 0 and 1 for presetting the third threshold value. And reordering the K models from small to large according to the average value after the parameters are updated.
Further, after updating the parameters of the current single gaussian model, the first weight adjusting unit 16 adjusts the weights of all the single gaussian models in the gaussian mixture model, so that the sum of the weights of all the single gaussian models is a preset fixed value; the preset fixed value is 1; scaling all the weights of the single Gaussian models by the following formula to make the sum of the weights 1;
Figure BDA0002541269830000123
the weight sum calculation unit 17 calculates the weight sum of the current frame and other frames whose matching order precedes the current frame;
the second judging unit 18 judges whether the weight sum is greater than a preset second threshold, and satisfies the following formula:
Figure BDA0002541269830000124
wherein Thresh is the preset second threshold, and is generally between 0.5 and 0.8.
When the sum of weights is greater than a preset second threshold, the second determination unit 19 determines that the current frame is in a mute state (noise frame), that is, vad is 0; when the sum of the weights is less than or equal to a preset second threshold, the second determination unit 19 determines that the current frame is in an active state (speech frame), i.e., vad ═ 1.
Specifically, when the matching between the energy of the current frame and all the single gaussian models fails, the first determining unit 15 determines that the current frame is excitedThe active state. x is the number ofiIf all the single gaussian models fail to match with K, the frame is directly judged to be in an active state, i.e. vad is 1. At this time, parameters of the single gaussian model matching the last (kth) in the sequence in the gaussian mixture model are updated. The updating mode satisfies the following formula:
uK=xiequation (10);
vK=v0formula (11);
Figure BDA0002541269830000131
further, after the three parameters of the kth single gaussian model (last) are updated, the second weight adjusting unit (not shown in the figure) scales the weights of all the K models again, and reorders the K models from small to large according to the updated mean value.
Compared with the prior art, the audio activity detection system 10 disclosed by the embodiment of the invention comprises a framing unit 11 for framing the audio, and a frame energy calculation unit 12 for calculating the energy of the current frame according to the amplitude of the time domain point in each frame; then, the first judging unit 13 judges whether the energy of the current frame is successfully matched with a plurality of single gaussian models in a preset mixed gaussian model, so as to judge whether the frame is a speech frame (active state) or a noise frame (mute state); when the energy of the current frame is successfully matched with any single gaussian model, the first parameter updating unit 14 updates the parameters of the current single gaussian model; when the matching of the energy of the current frame with all the single gaussian models fails, the first determination unit 15 determines that the current frame is in an activated state. The audio activity detection system 10 disclosed by the embodiment of the invention can effectively improve the accuracy of audio activity detection, is simple to calculate, and can cover all types of audio.
Referring to fig. 3, fig. 3 is a block diagram of an audio activity detection apparatus 20 according to an embodiment of the present invention. The audio activity detection device 20 of this embodiment comprises: a processor 21, a memory 22 and a computer program stored in said memory 22 and executable on said processor 21. The processor 21, when executing the computer program, implements the steps in the above-described audio activity detection method embodiments, such as steps S1-S5 shown in fig. 1. Alternatively, the processor 21, when executing the computer program, implements the functions of the modules/units in the above-mentioned device embodiments, such as the framing unit 11.
Illustratively, the computer program may be divided into one or more modules/units, which are stored in the memory 22 and executed by the processor 21 to accomplish the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program in the audio activity detection device 20. For example, the computer program may be divided into a frame unit 11, a frame energy calculation unit 12, a first determination unit 13, a first parameter update unit 14, a first determination unit 15, a first weight adjustment unit 16, a weight sum calculation unit 17, a second determination unit 18, and a second determination unit 19, and specific functions of each module refer to the specific working process of the audio activity detection system 10 described in the foregoing embodiment, which is not described herein again.
The audio activity detection device 20 may be a computing device such as a desktop computer, a notebook, a palm top computer, and a cloud server. The audio activity detection device 20 may include, but is not limited to, a processor 21, a memory 22. It will be appreciated by those skilled in the art that the schematic diagram is merely an example of the audio activity detection device 20 and does not constitute a limitation of the audio activity detection device 20 and may include more or fewer components than shown, or some components may be combined, or different components, e.g., the audio activity detection device 20 may also include an input-output device, a network access device, a bus, etc.
The Processor 21 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. The general purpose processor may be a microprocessor or the processor 21 may be any conventional processor or the like, the processor 21 being the control center for the audio activity detection device 20, with various interfaces and lines connecting the various parts of the overall audio activity detection device 20.
The memory 22 may be used to store the computer programs and/or modules, and the processor 21 may implement the various functions of the audio activity detection device 20 by running or executing the computer programs and/or modules stored in the memory 22 and invoking the data stored in the memory 22. The memory 22 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory 22 may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
Wherein the integrated modules/units of the audio activity detection device 20, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the above embodiments may be implemented by a computer program, which may be stored in a computer readable storage medium and used by the processor 21 to implement the steps of the above embodiments of the method. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, etc. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (10)

1. A method for audio activity detection, comprising:
framing the audio; each frame comprises a plurality of time domain points;
calculating the energy of the current frame according to the amplitude of each time domain point;
judging whether the energy of the current frame is successfully matched with a plurality of single Gaussian models in a preset mixed Gaussian model;
when the energy of the current frame is successfully matched with any single Gaussian model, updating the parameters of the current single Gaussian model;
and when the matching of the energy of the current frame and all the single Gaussian models fails, judging that the current frame is in an activated state.
2. The method for detecting audio activity according to claim 1, wherein the determining whether the energy of the current frame is successfully matched with a plurality of single gaussian models in a preset gaussian model mixture comprises:
sequentially matching the energy of the current frame with a plurality of single Gaussian models according to a preset matching sequence;
judging whether the difference value between the energy of the current frame and the mean value of the current single-Gaussian model is smaller than or equal to the product of the standard deviation of the current single-Gaussian model and a preset first threshold value;
if so, judging that the energy of the current frame is successfully matched with the current single Gaussian model; if not, judging that the matching of the energy of the current frame and the current single Gaussian model fails.
3. The audio activity detection method of claim 1, wherein the parameters of the single gaussian model include weights, standard deviations, and mean values; then, the updating of the parameters of the current single gaussian model specifically includes:
updating at least one of a weight, a standard deviation, and a mean of the current single-Gaussian model.
4. The audio activity detection method of claim 2, wherein after updating the parameters of the current single-gaussian model, further comprising:
adjusting the weights of all single Gaussian models in the Gaussian mixture model to enable the sum of the weights of all the single Gaussian models to be a preset fixed value;
calculating the weight sum of the current frame and other frames which are matched with sequences before the current frame;
judging whether the weight sum is larger than a preset second threshold value or not;
if yes, judging that the current frame is in a mute state; if not, judging that the current frame is in an activated state.
5. The audio activity detection method of claim 2, wherein after determining that the current frame is active, further comprising:
and updating the parameters of the single Gaussian model with the matching sequence at the last position in the mixed Gaussian model.
6. An audio activity detection system, comprising:
a framing unit for framing the audio; each frame comprises a plurality of time domain points;
the frame energy calculating unit is used for calculating the energy of the current frame according to the amplitude of each time domain point;
the first judging unit is used for judging whether the energy of the current frame is successfully matched with a plurality of single Gaussian models in a preset Gaussian mixture model;
the first parameter updating unit is used for updating the parameters of the current single Gaussian model when the energy of the current frame is successfully matched with any single Gaussian model;
and the first judging unit is used for judging that the current frame is in an activated state when the matching of the energy of the current frame and all the single Gaussian models fails.
7. The audio activity detection system of claim 6, wherein the first determination unit is specifically configured to:
sequentially matching the energy of the current frame with a plurality of single Gaussian models according to a preset matching sequence;
judging whether the difference value between the energy of the current frame and the mean value of the current single-Gaussian model is smaller than or equal to the product of the standard deviation of the current single-Gaussian model and a preset first threshold value;
if so, judging that the energy of the current frame is successfully matched with the current single Gaussian model; if not, judging that the matching of the energy of the current frame and the current single Gaussian model fails.
8. The audio activity detection system of claim 7, further comprising:
the first weight adjusting unit is used for adjusting the weights of all the single Gaussian models in the Gaussian mixture model when the energy of the current frame is successfully matched with any single Gaussian model, so that the sum of the weights of all the single Gaussian models is a preset fixed value;
a weight sum calculation unit for calculating a weight sum of the current frame and other frames whose matching order is prior to the current frame;
the second judging unit is used for judging whether the weight sum is larger than a preset second threshold value or not;
a second determining unit, configured to determine that the current frame is in a mute state when the sum of weights is greater than a preset second threshold; and the current frame is judged to be in an activated state when the weight sum is less than or equal to a preset second threshold value.
9. An audio activity detection device comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the audio activity detection method of any of claims 1 to 5 when executing the computer program.
10. A computer-readable storage medium, comprising a stored computer program, wherein the computer program, when executed, controls an apparatus in which the computer-readable storage medium is located to perform the audio activity detection method of any one of claims 1-5.
CN202010547546.4A 2020-06-16 2020-06-16 Audio activity detection method, system, device and storage medium Withdrawn CN111833908A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010547546.4A CN111833908A (en) 2020-06-16 2020-06-16 Audio activity detection method, system, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010547546.4A CN111833908A (en) 2020-06-16 2020-06-16 Audio activity detection method, system, device and storage medium

Publications (1)

Publication Number Publication Date
CN111833908A true CN111833908A (en) 2020-10-27

Family

ID=72897734

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010547546.4A Withdrawn CN111833908A (en) 2020-06-16 2020-06-16 Audio activity detection method, system, device and storage medium

Country Status (1)

Country Link
CN (1) CN111833908A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114613391A (en) * 2022-02-18 2022-06-10 广州市欧智智能科技有限公司 Snore identification method and device based on half-band filter

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114613391A (en) * 2022-02-18 2022-06-10 广州市欧智智能科技有限公司 Snore identification method and device based on half-band filter
CN114613391B (en) * 2022-02-18 2022-11-25 广州市欧智智能科技有限公司 Snore identification method and device based on half-band filter

Similar Documents

Publication Publication Date Title
US10332507B2 (en) Method and device for waking up via speech based on artificial intelligence
CN109961780B (en) A man-machine interaction method a device(s) Server and storage medium
CN108304758B (en) Face characteristic point tracking method and device
EP2763134B1 (en) Method and apparatus for voice recognition
US7383178B2 (en) System and method for speech processing using independent component analysis under stability constraints
CN110853663B (en) Speech enhancement method based on artificial intelligence, server and storage medium
CN109272016B (en) Target detection method, device, terminal equipment and computer readable storage medium
US9142210B2 (en) Method and device for speaker recognition
CN110556126B (en) Speech recognition method and device and computer equipment
CN110672323B (en) Bearing health state assessment method and device based on neural network
US20220004920A1 (en) Classification device, classification method, and classification program
CN110503944B (en) Method and device for training and using voice awakening model
CN109766476B (en) Video content emotion analysis method and device, computer equipment and storage medium
CN110930987B (en) Audio processing method, device and storage medium
CN110741387A (en) Face recognition method and device, storage medium and electronic equipment
CN111833908A (en) Audio activity detection method, system, device and storage medium
WO2020098107A1 (en) Detection model-based emotions analysis method, apparatus and terminal device
CN110580897A (en) audio verification method and device, storage medium and electronic equipment
CN111640421B (en) Speech comparison method, device, equipment and computer readable storage medium
CN111798862A (en) Audio noise reduction method, system, device and storage medium
US10950244B2 (en) System and method for speaker authentication and identification
CN109377984B (en) ArcFace-based voice recognition method and device
CN109961152B (en) Personalized interaction method and system of virtual idol, terminal equipment and storage medium
CN113673349B (en) Method, system and device for generating Chinese text by image based on feedback mechanism
CN110858484A (en) Voice recognition method based on voiceprint recognition technology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20201027