CN112614506B

CN112614506B - Voice activation detection method and device

Info

Publication number: CN112614506B
Application number: CN202011572868.0A
Authority: CN
Inventors: 王雪志; 薛少飞
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2020-12-23
Filing date: 2020-12-23
Publication date: 2022-10-25
Anticipated expiration: 2040-12-23
Also published as: CN112614506A

Abstract

The invention discloses a voice activation detection method and a voice activation detection device, wherein the voice activation detection method comprises the following steps: processing the received audio to generate audio frame characteristics; calculating the probability distribution value of each audio frame characteristic which is noise or voice based on a neural network classifier; and post-processing the probability distribution value of each audio frame characteristic which is noise or voice, and outputting a state judgment result of each audio frame characteristic, wherein the state judgment result comprises a mute state, a pre-audio state, an audio state and a pre-mute state. The scheme effectively solves the problems of abnormal frames in the voice activation detection process and silence and noise sections mixed in the speaking process of a person, and greatly improves the accuracy and usability of voice activation detection. By optimizing the voice activation detection performance, the wake-up and recognition performance can be further improved.

Description

Voice activation detection method and device

Technical Field

The invention belongs to the field of voice recognition, and particularly relates to a voice activation detection method and device.

Background

Voice Activity Detection (VAD) aims to detect whether a current Voice signal contains a Voice signal, namely, to judge an input signal, distinguish the Voice signal from various background noise signals, and respectively adopt different processing methods for the two signals. The traditional way is to distinguish between speech and noise signals by zero-crossing rate and short-time energy. The voice activation detection based on the neural network is developed vigorously in recent years, and the accuracy of distinguishing voice signals from noise signals is greatly improved.

The short-time zero crossing rate represents the frequency of a voice signal waveform in a frame of voice passing through a transverse axis (zero level), and is mainly based on the fact that a voice audio frequency has a high zero crossing rate and is good in performance at a position without noise, but is not good in performance at a position with noise and poor in anti-interference capability, the zero crossing rate only counts the frequency of the waveform passing through the transverse axis within a certain time, and the waveform can also frequently pass through the transverse axis when noise exists; the method based on short-time energy is also to judge whether the voice segment is judged according to the energy level after calculating the energy of each frame, the method is too direct, the actual effect is poor, and the voice can be judged by mistake when the noise energy is large; the method is characterized in that a neural network classifier is trained to judge whether a frame is voice or noise, the method is good for judging a single frame, but the relation between frames is not considered, and a transient noise section exists in voice due to ventilation in voice of an actual person, the neural network judges the voice and the noise frames with high accuracy, but the influence of previous and next frames is not considered, and the performance is poor in actual use due to ventilation in voice of the actual person.

Disclosure of Invention

An embodiment of the present invention provides a voice activation detection method and apparatus, which are used to solve at least one of the above technical problems.

In a first aspect, an embodiment of the present invention provides a voice activation detection method, including: processing the received audio to generate audio frame characteristics; calculating the probability distribution value of each audio frame characteristic which is noise or voice based on a neural network classifier; and post-processing the probability distribution value of each audio frame characteristic which is noise or voice, and outputting a state judgment result of each audio frame characteristic, wherein the state judgment result comprises a mute state, a pre-audio state, an audio state and a pre-mute state.

In a second aspect, an embodiment of the present invention provides a voice activation detection apparatus, including: the audio processing module is configured to process the received audio to generate audio frame characteristics; the audio analysis module is used for calculating the probability distribution value of each audio frame characteristic which is noise or voice based on the neural network classifier; and the result conversion module is configured to perform post-processing on the probability distribution value of each audio frame feature, which is noise or voice, and output a state judgment result of each audio frame feature, wherein the state judgment result comprises a mute state, a pre-audio state, an audio state and a pre-mute state.

In a third aspect, an electronic device is provided, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of the first aspect.

In a fourth aspect, an embodiment of the present invention further provides a storage medium, which includes: which when executed by a processor performs the steps of the method of the first aspect

The method provided by the embodiment of the application greatly improves the accuracy and the usability of voice activation detection by effectively solving some abnormal frames in the voice activation detection process and solving the problem that silence and noise sections are mixed in the human speaking process.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a flowchart of a voice activity detection method according to an embodiment of the present invention;

FIG. 2 is a flow chart of another voice activity detection method according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating voice activity detection according to an embodiment of the present invention;

FIG. 4 is a block diagram of a voice activity detection apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a flowchart of an embodiment of a voice activity detection method according to the invention is shown.

As shown in fig. 1, in step 101, processing the received audio to generate an audio frame feature;

in step 102, calculating a probability distribution value of each audio frame feature being noise or voice based on a neural network classifier;

in step 103, post-processing the probability distribution value of each audio frame feature as noise or speech, and outputting a state determination result of each audio frame feature, where the state determination result includes a mute state, a pre-audio state, an audio state, and a pre-mute state.

In this embodiment, for step 101, the voice activity detection device processes the received audio to extract audio frame features. The audio processing is framing, the specific processing mode can refer to the prior art, and the extracted audio frame features are Mel Frequency Cepstrum Coefficients (MFCC) features. For example, the received audio is divided into a plurality of audio frames with a length of 10ms, which is not limited herein, and then the MFCC features of each audio frame are extracted, which will not be described herein again.

Then, for step 102, the audio frame features are sent to a neural network classifier, and a probability distribution value of each audio frame feature as noise or speech is calculated through the neural network classifier. The neural network classifier is configured for the voice activation detection system, and the neural network classifier has a function of acquiring probability distribution values of noise and voice of audio frame features, for example, if the probability distribution value of the noise or voice of the audio frame features is P, the audio will generate probability distribution values P1, P2 \8230, 8230pn and Pn of the audio frame features, which are not described herein again.

Finally, in step 103, performing post-processing on the probability distribution value of each audio frame feature, which is noise or speech, through a VAD (voice activity detection) post-processing model, and outputting a state determination result of each audio frame feature, where the state determination result includes a mute state, a pre-audio state, an audio state, and a pre-mute state, and each state has a corresponding preset threshold, for example, comparing the probability distribution value P1 of the audio frame feature with the preset threshold, outputting an audio frame feature determination result corresponding to P1, comparing P2 with the preset threshold, outputting an audio frame feature determination result corresponding to P2, and finally comparing Pn with the preset threshold, and outputting a Pn audio frame feature determination result.

In the scheme described in this embodiment, the received audio is processed to extract audio frame features, a probability distribution value of each audio frame feature as noise and speech is further generated, the probability distribution value of each audio frame feature as noise or speech is post-processed, a state determination result of each audio frame feature is output, and accuracy and usability of voice activation detection are improved.

In some alternative embodiments, the mute state, the pre-audio state, the audio state and the pre-mute state form a closed loop, the states of adjacent audio frame features being switchable only between adjacent states. For example, if the current audio frame characteristic is determined to be in a mute state, the next audio frame characteristic determination state may be in a mute state, a pre-mute state, or a pre-audio state, and cannot be directly converted into an audio state.

In some optional embodiments, the post-processing the probability distribution value of each of the audio frame features being noise or speech, and outputting the state determination result of each of the audio frame features includes comparing the probability distribution value of each of the audio frame features with a threshold value of each state; outputting a state judgment result corresponding to each audio frame feature based on the comparison result; wherein, in the mute state, there is a mute threshold tsil; in the pre-tone state, having a pre-tone threshold tprsp; in the audio state, having an audio threshold T sp; in the pre-mute state, there is a pre-mute low threshold T low prsil and a pre-mute high threshold T high prsil. For example, assume that the silence threshold tsil is 0.1, the pre-audio threshold T prsp is 0.6, the audio threshold tsp is 0.9, the pre-silence low threshold T low prsil is 0.3, and the pre-silence high threshold T high prsil is 0.5, if the probability distribution value of the first audio frame feature is 0.2 and is lower than the pre-silence low threshold T low prsil, it is determined that the audio frame feature is in a silence state, and the voice activation detection result is in a silence state, where the threshold may be 0.4 or 0.6, which is not limited in this application and is not described herein again.

In some optional embodiments, in the mute state, if the probability distribution value of the audio frame feature is smaller than the mute threshold tsil, the pre-audio state is entered; in the pre-audio state, if the probability distribution value of the audio frame characteristics is smaller than the pre-audio threshold value T prsp, returning to the mute state; if the probability distribution value of the audio frame characteristics is greater than the pre-audio threshold value Tprsp and the duration is greater than or equal to a first preset time, entering the audio state; in the audio state, if the probability distribution value of the audio frame characteristics is smaller than the audio threshold value T sp, entering the pre-static state; in the pre-mute state, if the probability distribution value of the audio frame characteristics is smaller than the pre-mute low threshold value T low prsil, returning to the audio state; entering the mute state if the probability distribution value of the audio frame feature is greater than the pre-mute low threshold T low prsil and the duration is greater than or equal to a second preset time, or if the probability distribution value of the audio frame feature is greater than the pre-mute high threshold T high prsil and the duration is greater than or equal to a third preset time. In the pre-mute state and the pre-audio state, the probability distribution value of the audio frame feature is compared with the preset threshold, and whether the duration is greater than or equal to the preset time is also determined, for example, if the current state is the pre-audio state and the first preset time is 10ms, the present application is not limited, the probability distribution value of the next audio frame feature is greater than the preset audio threshold tprsp, and the duration is greater than or equal to 10ms, the next audio frame feature is determined to be the audio state, which is not described herein again.

In some optional embodiments, in the mute state, if the probability distribution value of the audio frame feature is not less than the mute threshold tsil, the mute state is maintained; in the pre-audio state, if the probability distribution value of the audio frame feature is not less than the pre-audio threshold value T prsp and the duration is less than a first preset time, keeping the pre-audio state; in the audio state, if the probability distribution value of the audio frame feature is not smaller than the audio threshold value T sp, the audio state is maintained; in the pre-mute state, if the probability distribution value of the audio frame characteristic is not less than the pre-mute low threshold T low prsil and the duration is less than the second preset time, or the probability distribution value of the audio frame characteristic is not less than the pre-mute high threshold T high prsil and the duration is less than the third preset time, the pre-mute state is maintained, which is not described herein again.

In some optional embodiments, the third preset time includes being less than the second preset time, where the third preset time is M, the second preset time is N, M < N, for example, M is 5ms, N is 10ms, and the application is not limited herein, and only when the probability distribution value of the next audio frame feature is greater than the preset pitch threshold T high prsil for 5ms or the preset silence low threshold T low prsil for 10ms, the audio frame feature is determined as a silence state, which is not described herein again.

In some optional embodiments, the audio frame features include mel-frequency cepstral coefficient features, and the related art may refer to the prior art and will not be described herein again.

Please refer to fig. 2, which shows a flowchart of another voice activity detection method according to an embodiment of the present invention, and the flowchart mainly refers to a flowchart of the steps further defined in the method of "performing post-processing on the probability distribution value of each audio frame feature being noise or voice, and outputting the state determination result of each audio frame feature" in embodiment 103.

As shown in fig. 2, the state decision results include a mute state 201, a pre-audio state 202, an audio state 203, and a pre-mute state 204, wherein,

in the mute state 201, a mute threshold T sil is provided, which indicates a threshold value for continuously judging to be the mute state in the current mute state, when the mute probability is greater than T sil, the mute state is considered to be continuously in the mute state, otherwise, the mute state is switched to the pre-audio state;

in the pre-audio state 202, a pre-audio threshold T prsp is provided, which indicates a threshold value that is currently in the pre-audio state and is continuously determined as the pre-audio state, when the voice audio probability is greater than T prsp, the pre-audio state is considered to be continuously in the pre-audio state, and when the voice audio probability is less than T prsp, the pre-audio state is assumed to be continuously in the pre-audio state, the audio state is skipped, and when the voice audio probability is less than T prsp, the mute state is skipped;

in the audio state 203, an audio threshold T sp is provided, which represents a threshold value for continuously judging that the current audio state is in the audio state, when the voice audio probability is greater than T sp, the current audio state is considered to be continuously in the audio state, otherwise, the current audio state is switched from the audio state to a pre-mute state;

in the pre-mute state 204, there are a pre-mute low threshold T low prsil and a pre-mute high threshold T high prsil, which indicate that the current state is in the pre-mute state and the threshold is continuously determined as the pre-mute state, when the mute probability is greater than T low prsil, the state is considered to be in the pre-mute state, and a continuous period of time T2 is the pre-mute state, or the mute probability is greater than another higher threshold T high prsil, the state will jump to the mute state; and when the mute probability is less than T low prsil, the voice audio state is jumped back to.

If the probability distribution value of the audio frame feature is P, the first time is T1, the second time is T2, and the third time is T3. For example, assuming that the first audio frame characteristic determination result is the pre-audio state 202, the pre-audio threshold T prsp is 0.6, and T1 is 10ms, the method includes:

when the probability distribution value P of the second audio frame feature is 0.4 and is smaller than tprsp, it is determined that the second audio frame feature is in the mute state 201, and the voice activation detection result is converted into the mute state 201;

when the probability distribution value P of the second audio frame characteristic is 0.7, is greater than T prsp, and the duration is 6ms, the second audio frame characteristic is determined to be in the pre-audio state 202, and the voice activation detection result is kept in the pre-audio state 202;

when the probability distribution value P of the second audio frame feature is 0.7, is greater than T prsp, and the duration is 12ms, it is determined that the second audio frame feature is in the audio state 203, and the voice activation detection result remains in the pre-audio state 203.

In the solution described in this embodiment, the first audio frame characteristic may also be assumed to be in a mute state 201, an audio state 203, a pre-mute state 204, and a pre-audio threshold T prsp may be other values, and T1 may be other values.

It should be noted that, although the above embodiments adopt numbers with definite precedence order such as step 101 and step 102 to define the precedence order of the steps, in an actual application scenario, some steps may be executed in parallel, and the precedence order of some steps is also not defined by the numbers, and this application is not limited herein and is not described herein again.

The following description is given to a specific example describing some problems encountered by the inventor in implementing the present invention and a final solution so as to enable those skilled in the art to better understand the solution of the present application.

The inventors discovered the drawbacks of these similar techniques in the process of implementing the present invention:

the zero-crossing rate is good in performance at a position without noise, but is not good in performance at a position with noise, the anti-interference capability is poor, the zero-crossing rate is only used for counting the frequency of the waveform passing through a very axis within a certain time, and the waveform also frequently passes through a transverse axis when noise exists.

The short-time energy mode is too direct, the actual effect is poor, and when the noise energy is large, the short-time energy mode can be misjudged as voice.

The method based on the neural network classifier is good for judging a single frame, but the relation between frames is not considered, and a short noise section exists in voice due to ventilation in voice of an actual person, the accuracy of judging the voice and the noise frame by the neural network is high, but the influence of previous and next frames is not considered, and the performance is poor in actual use due to ventilation in voice of the actual person.

The inventors have found in the course of carrying out the invention why the reason is not easily imaginable:

when the traditional VAD (voice activity detection) method is used under the condition of insufficient data volume, noise can be reduced through signal processing, and the influence of noise on VAD (voice activity detection) is avoided. The approach of neural networks would be to window smoothing after the classifier outputs the classification probabilities to take into account the effect of context. The traditional technology does not carry out post-processing operation on output results, and some neural network VAD (voice activity detection) systems consider the VAD (voice activity detection) post-processing operation, but the traditional technology is rough and does not play the role of VAD (voice activity detection) post-processing.

The invention has the technical innovation points that:

the vad system added with the state transition strengthens the accuracy rate of the frame level by using a neural network technology on one hand, and introduces Hamming window smoothness and post-processing of the state transition on the other hand to strengthen the anti-manufacturing performance and the practicability of the vad system.

And continuing to refer to FIG. 2, T sil is a threshold value for continuously judging the current mute state, and when the mute probability is greater than T sil, the current mute state is considered to be continuously in the mute state, otherwise, the current mute state is converted into the pre-audio state. A threshold value for representing that the current voice frequency is in the pre-audio frequency state and continuously judging the voice frequency state is in the pre-audio frequency state, and when the voice frequency probability is greater than T prsp, the voice frequency state is considered to be continuously in the pre-audio frequency state; if the continuous time T1 is in a pre-audio state, jumping to an audio state; when the voice audio probability is less than tprsp, the state will jump back to silence. And T sp is a threshold value for continuously judging the current audio state, when the voice audio probability is greater than T sp, the voice is considered to be continuously in the audio state, otherwise, the voice is switched from the audio state to the pre-mute state. T low prsil is a threshold value which indicates that the state is in the pre-mute state before, and when the mute probability is greater than T low prsil, the state is considered to be in the pre-mute state; if the continuous time T2 is in a pre-mute state, or the mute probability is greater than another higher threshold value T high prsil, the system will jump to the mute state; and when the mute probability is less than T low prsil, the voice audio state is jumped back to.

Referring to fig. 3, a flowchart of a voice activity detection system according to an embodiment of the present invention is shown, which illustrates a process flow of determining whether a segment of audio is audio or noise by a state transition-based VAD (voice activity detection) system.

The method comprises the following steps: the input of the system is a section of audio, the audio is processed, mainly framing and frame characteristics are extracted, and the extracted specific diagnosis is generally MFCC (Mel frequency cepstrum coefficient) characteristics.

Step two: and (4) sending the extracted features to a neural network classifier, and calculating the probability distribution of noise and voice in each frame.

Step three: and sending the probability distribution of the noise and the voice of each frame to a VAD (voice activated detection) post-processing model for post-processing, and then outputting the final judgment result of each frame.

VAD (voice activity detection) flow refers to determining which state it belongs to every frame based on the output value of the model and the threshold value in the state. Consecutive frames can only jump between states where there is a connection of arrows in the figure. As can be seen from the figure, there are a total of four states: silence state (silence), pre-audio state (preseech), audio state (speech), pre-silence state (silence). The figure shows some conditional thresholds for the transition between states, and the state transition is triggered only when one or more thresholds are met.

Beta version formed by the inventor in the process of implementing the invention:

before determining the final scheme, a version is derived, which differs from the final version in that the threshold for the pre-mute state (presilence) is only one (T prsil). This scheme is relatively simple to adjust for the threshold scheme since one threshold is reduced. The disadvantage is that if a person pauses briefly during the speech, which causes VAD (voice activity detection) to be cut off, the following audio is not sent to the recognition service. Two thresholds, high and low, are added to the final version to determine whether to mute the state (silence). When the probability is greater than the high threshold, the system will stay in the pre-mute state (silence) for a short time and then jump to the mute state (silence). If the probability is greater than the low threshold, the pre-mute state (silence) will stay long and then jump to the mute state (silence).

The inventor finds that deeper effects are achieved in the process of implementing the invention:

the scheme can effectively solve the problems of abnormal frames in the VAD (voice activation detection) process and the condition that silence and noise sections are mixed in the speaking process of a person, and greatly improves the accuracy and the usability of the VAD (voice activation detection). By optimizing VAD (voice activity detection) performance, wake-up and recognition performance can be further improved.

Referring to fig. 4, a block diagram of a voice activity detection apparatus according to an embodiment of the invention is shown.

As shown in fig. 4, an audio processing module 410, an audio analysis module 420, and a result transformation module 430.

The audio processing module 410 is configured to process the received audio to generate audio frame characteristics; the audio analysis module 420 calculates a probability distribution value of each audio frame feature, which is noise or voice, based on a neural network classifier; the result conversion module 430 is configured to perform post-processing on the probability distribution value of each audio frame feature, where the probability distribution value is noise or speech, and output a state determination result of each audio frame feature, where the state determination result includes a mute state, a pre-audio state, an audio state, and a pre-mute state.

It should be understood that the modules depicted in fig. 4 correspond to various steps in the method described with reference to fig. 1. Thus, the operations and features described above for the method and the corresponding technical effects are also applicable to the modules in fig. 4, and are not described again here.

It should be noted that the modules in the embodiments of the present application are not limited to the solution of the present application, for example, the audio processing module may describe processing the received audio to generate an audio frame feature; in addition, the related functional modules may also be implemented by a hardware processor, for example, the audio processing module may be implemented by a processor, which is not described herein again.

In other embodiments, the present invention further provides a non-volatile computer storage medium, where the computer storage medium stores computer-executable instructions, and the computer-executable instructions may execute the voice activation detection method in any of the above method embodiments;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

processing the received audio to generate audio frame characteristics;

calculating the probability distribution value of each audio frame characteristic which is noise or voice based on a neural network classifier;

and post-processing the probability distribution value of each audio frame characteristic which is noise or voice, and outputting a state judgment result of each audio frame characteristic, wherein the state judgment result comprises a mute state, a pre-audio state, an audio state and a pre-mute state.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the voice activation detection apparatus, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the voice activation detection apparatus over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Embodiments of the present invention further provide a computer program product, the computer program product includes a computer program stored on a non-volatile computer-readable storage medium, the computer program includes program instructions, when executed by a computer, cause the computer to execute any of the above voice activation detection methods.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 5, the electronic device includes: one or more processors 510 and memory 520, with one processor 510 being an example in fig. 5. The apparatus for a voice activity detection method may further include: an input device 530 and an output device 540. The processor 510, the memory 520, the input device 530, and the output device 540 may be connected by a bus or other means, and the bus connection is exemplified in fig. 5. The memory 520 is a non-volatile computer-readable storage medium as described above. The processor 510 executes various functional applications of the server and data processing by executing nonvolatile software programs, instructions and modules stored in the memory 520, i.e., implements the above-described method embodiments for voice activation detection apparatus methods. The input device 530 may receive input numeric or character information and generate key signal inputs related to user settings and function control for the voice activation detection device. The output device 540 may include a display device such as a display screen.

The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided in the embodiment of the present invention.

As an embodiment, the electronic device is applied to a voice activation detection apparatus, and includes:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to:

processing the received audio to generate audio frame characteristics;

and post-processing the probability distribution value of each audio frame characteristic which is noise or voice, and outputting a state judgment result of each audio frame characteristic, wherein the state judgment result comprises a mute state, a pre-audio state, an audio state and a pre-mute state. The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) A mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.

(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has mobile internet access characteristics. Such terminals include: PDA, MID, and UMPC devices, etc.

(3) A portable entertainment device: such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.

(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.

(5) And other electronic devices with data interaction functions.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A voice activity detection method, comprising:

processing the received audio to generate audio frame characteristics, wherein the audio frame characteristics comprise Mel frequency cepstrum coefficient characteristics;

performing post-processing on a probability distribution value of each audio frame feature, which is noise or voice, and outputting a state determination result of each audio frame feature, wherein the state determination result comprises a mute state, a pre-audio state, an audio state and a pre-mute state, the pre-audio state, the audio state and the pre-mute state form a closed loop, the states of adjacent audio frame features can only be switched between the adjacent states, and the probability distribution value of each audio frame feature is compared with a threshold value of each state; outputting a state judgment result corresponding to each audio frame feature based on the comparison result; in the mute state, a mute threshold tsil is provided; in the pre-tone state, having a pre-tone threshold tprsp; in the audio state, having an audio threshold T sp;

in the pre-mute state, a pre-mute low threshold T low prsil and a pre-mute high threshold T high prsil are provided, and in the pre-mute state, if the probability distribution value of the audio frame feature is greater than the pre-mute pitch threshold T high prsil, a short time linger is in the pre-mute state, and then the audio frame is jumped to the mute state; and if the probability distribution value of the audio frame features is larger than a pre-mute low threshold value T low prsil, staying in the pre-mute state for a long time, and then jumping to the mute state.

2. The method of claim 1, further comprising:

in the mute state, if the probability distribution value of the audio frame characteristics is smaller than the mute threshold value tsil, entering the pre-audio state;

in the pre-audio state, if the probability distribution value of the audio frame characteristics is smaller than the pre-audio threshold value T prsp, returning to the mute state; if the probability distribution value of the audio frame features is larger than the pre-audio threshold value T prsp and the duration time is larger than or equal to a first preset time, entering the audio state;

in the audio state, if the probability distribution value of the audio frame characteristics is smaller than the audio threshold value T sp, entering the pre-mute state; in the pre-mute state, if the probability distribution value of the audio frame characteristics is smaller than the pre-mute low threshold value T low prsil, returning to the audio state; if the probability distribution value of the audio frame feature is greater than the pre-mute low threshold T low prsil and the duration is greater than or equal to a second preset time, or if the probability distribution value of the audio frame feature is greater than the pre-mute high threshold T high prsil and the duration is greater than or equal to a third preset time, entering the mute state.

3. The method of claim 1, further comprising:

in the mute state, if the probability distribution value of the audio frame characteristics is not less than the mute threshold value tsil, the mute state is kept;

in the pre-audio state, if the probability distribution value of the audio frame feature is not less than the pre-audio threshold value T prsp and the duration is less than a first preset time, keeping the pre-audio state;

in the audio state, if the probability distribution value of the audio frame characteristics is not smaller than the audio threshold value T sp, the audio state is kept; in the pre-mute state, if the probability distribution value of the audio frame feature is not less than the pre-mute low threshold T low prsil and the duration is less than a second preset time, or the probability distribution value of the audio frame feature is not less than the pre-mute pitch threshold T high prsil and the duration is less than a third preset time, the pre-mute state is maintained.

4. The method according to claim 2 or 3, wherein the third preset time is less than the second preset time.

5. A voice activation detection apparatus comprising:

an audio processing module configured to process the received audio to generate audio frame features, wherein the audio frame features include mel-frequency cepstrum coefficient features;

the audio analysis module is used for calculating the probability distribution value of each audio frame characteristic which is noise or voice based on the neural network classifier;

a result conversion module configured to perform post-processing on a probability distribution value of each audio frame feature, and output a state determination result of each audio frame feature, where the state determination result includes a mute state, a pre-audio state, an audio state, and a pre-mute state, where the mute state, the pre-audio state, the audio state, and the pre-mute state form a closed loop, the states of adjacent audio frame features can only be converted between adjacent states, and the probability distribution value of each audio frame feature is compared with a threshold value of each state; outputting a state judgment result corresponding to each audio frame feature based on the comparison result; in the mute state, having a mute threshold tsil; in the pre-tone state, having a pre-tone threshold tprsp; in the audio state, having an audio threshold T sp;

in the pre-mute state, a pre-mute low threshold T low prsil and a pre-mute high threshold T high prsil exist, and in the pre-mute state, if the probability distribution value of the audio frame feature is greater than the pre-mute pitch threshold T high prsil, stay in the pre-mute state for a short time, and then jump to the mute state; and if the probability distribution value of the audio frame features is larger than a pre-mute low threshold value T low prsil, staying in the pre-mute state for a long time, and then jumping to the mute state.

6. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any of claims 1-4.

7. A storage medium having a computer program stored thereon, the computer program, when being executed by a processor, realizing the steps of the method according to any one of claims 1 to 4.