CN115171735A

CN115171735A - Voice activity detection method, storage medium and electronic equipment

Info

Publication number: CN115171735A
Application number: CN202210771781.9A
Authority: CN
Inventors: 邢原博; 张宇飞; 黄辰
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-10-11

Abstract

The embodiment of the specification inputs the audio characteristics of the audio to be recognized into a voice activity detection model to obtain a recognition result output by the voice activity detection model, determines classification parameters according to the recognition scene type corresponding to the audio to be recognized, and finally recognizes the audio to be recognized according to the recognition result and the classification parameters output by the voice activity detection model, so that only one universal voice activity detection model needs to be deployed, corresponding voice activity detection models do not need to be deployed respectively aiming at different recognition scenes, and the cost of model deployment and iterative upgrade can be effectively reduced.

Description

Voice activity detection method, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of internet technologies, and in particular, to a voice activity detection method, a storage medium, and an electronic device.

Background

Currently, speech recognition technology has been widely applied in various scenes, such as telephone speech recognition, voice conference recording, and the like.

In the prior art, pre-trained machine learning models are typically employed for speech recognition. Specifically, voice Activity Detection (VAD) needs to be performed on the audio to be recognized, that is, the audio to be recognized is input into the Voice Activity Detection model to recognize a Voice segment and a non-Voice segment in the audio to be recognized, and then the recognized Voice segment is input into the Voice recognition model to perform Voice recognition. The voice activity detection model is equivalent to a two-classification model, and the input audio is subjected to two classifications of voice segments and non-voice segments.

However, in practical applications, the sampling frequency and the input mode of the audio to be recognized are different for different application scenarios, for example, in a telephone voice recognition scenario, the sampling frequency of the audio to be recognized is typically 8KHz, and the input mode is a streaming input, whereas in a voice conference recording scenario, the sampling frequency is typically 16KHz, and the input mode is a non-streaming input. Therefore, for the audio to be recognized in different application scenarios, different voice activity detection models often need to be trained to recognize the voice segments and the non-voice segments, which increases the costs of model deployment and iterative upgrade.

In view of this, how to reduce the cost of model deployment and iterative upgrade becomes an urgent problem to be solved.

Disclosure of Invention

Embodiments of the present specification provide a voice activity detection method, an apparatus, a storage medium, and an electronic device, so as to partially solve the problems in the prior art.

The embodiment of the specification adopts the following technical scheme:

the present specification provides a voice activity detection method, including:

acquiring audio to be identified;

extracting audio features from each audio frame of the audio to be identified;

inputting the audio features of each audio frame into a pre-trained voice activity detection model to obtain the probability that each audio frame output by the voice activity detection model is a voice section;

determining a classification parameter according to the identification scene type corresponding to the audio to be identified;

and identifying whether each audio frame is a voice section or not according to the probability that each audio frame output by the voice activity detection model is a voice section and the classification parameters.

Optionally, the acquiring the audio to be recognized specifically includes:

acquiring audio to be identified of streaming input; or acquiring the audio to be identified of the non-streaming input;

inputting the audio features of each audio frame into a pre-trained voice activity detection model, specifically comprising:

when the audio to be recognized is the audio of stream input, inputting the audio characteristics of each audio frame in the audio to be recognized into a pre-trained voice activity detection model in a stream input mode;

and when the audio to be recognized is the non-streaming input audio, inputting the audio features of each audio frame in the audio to be recognized into a pre-trained voice activity detection model in a non-streaming input mode.

Optionally, extracting audio features from each audio frame of the audio to be identified specifically includes:

according to the sampling frequency of the audio to be identified, extracting normalized audio features from each audio frame of the audio to be identified; the normalized audio features include: mixBand FBank feature.

Optionally, determining a classification parameter according to the identification scene type corresponding to the audio to be identified specifically includes:

determining a probability threshold corresponding to the identification scene type according to the identification scene type corresponding to the audio to be identified;

according to the probability that each audio frame output by the voice activity detection model is a voice segment and the classification parameters, identifying whether each audio frame is a voice segment specifically comprises:

and for each audio frame, if the probability that the audio frame output by the voice activity detection model is a voice section is greater than the probability threshold value, identifying the audio frame as the voice section, otherwise, identifying the audio frame as a non-voice section.

Optionally, the acquiring the audio to be recognized specifically includes:

receiving an identification request, and acquiring audio to be identified corresponding to the identification request;

determining a classification parameter according to the identification scene type corresponding to the audio to be identified, specifically comprising:

determining an identification scene type corresponding to the audio to be identified according to the identification request;

determining a smoothing processing parameter according to the identification scene type corresponding to the audio to be identified;

aiming at each audio frame, identifying whether the audio frame is a voice section or not according to the probability that the audio frame output by the voice activity detection model is a voice section and a predetermined probability threshold value;

and smoothing the recognition result of each audio frame according to the smoothing parameter and the recognition result of each audio frame in the audio to be recognized.

Optionally, the smoothing parameter comprises a minimum speech length n ₁ And a minimum non-speech length n ₂ ；

According to the smoothing processing parameter and the recognition result of each audio frame in the audio to be recognized, smoothing the recognition result of each audio frame, specifically comprising:

extracting continuous audio frames identified as speech segments and continuous audio frames identified as non-speech segments from the audio to be identified;

if the number of the continuous audio frames identified as the speech segments is less than n ₁ Then, the continuous audio frames identified as speech segments are all regarded as non-speech segments;

if the number of the continuous audio frames identified as the non-speech segments is less than n ₂ Then the consecutive audio frames identified as non-speech segments are all considered as speech segments.

Optionally, the pre-training of the voice activity detection model specifically includes:

acquiring sample audio and a label of the sample audio, wherein the label of the sample audio comprises a speech section and a non-speech section in the sample audio;

extracting audio features from each sample audio frame of the sample audio;

inputting the audio features of each sample audio frame into a voice activity detection model to be trained in a streaming input mode to obtain a first recognition result output by the voice activity detection model to be trained; inputting the audio features of each sample audio frame into a voice activity detection model to be trained in a non-streaming input mode to obtain a second recognition result output by the voice activity detection model to be trained;

determining the loss of the voice activity detection model to be trained according to the label of the sample audio, the first recognition result and the second recognition result;

and adjusting the model parameters of the voice activity detection model to be trained by taking the loss minimization as a training target.

Optionally, determining a loss of the to-be-trained voice activity detection model according to the label of the sample audio, the first recognition result, and the second recognition result, specifically including:

determining a difference between the first recognition result and the label of the sample audio as a first difference;

determining a difference between the second recognition result and the label of the sample audio as a second difference;

determining the difference between the first recognition result and the pseudo label as a third difference by taking the second recognition result as the pseudo label;

and determining the loss of the voice activity detection model to be trained according to the first difference, the second difference and the third difference.

The present specification provides a voice activity detection apparatus, the apparatus comprising:

the acquisition module is used for acquiring the audio to be identified;

the characteristic extraction module is used for extracting audio characteristics from each audio frame of the audio to be identified;

the recognition module is used for inputting the audio characteristics of each audio frame into a pre-trained voice activity detection model to obtain the probability that each audio frame output by the voice activity detection model is a voice section;

the post-processing module is used for determining classification parameters according to the identification scene type corresponding to the audio to be identified; and identifying whether each audio frame is a voice section or not according to the probability that each audio frame output by the voice activity detection model is a voice section and the classification parameters.

The present specification provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the voice activity detection method described above.

The present specification provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements the voice activity detection method described above.

The embodiment of the specification adopts at least one technical scheme which can achieve the following beneficial effects:

in the embodiment of the specification, the audio characteristics of the audio to be recognized are input into the voice activity detection model to obtain the recognition result output by the voice activity detection model, then the classification parameters are determined according to the recognition scene type corresponding to the audio to be recognized, and finally the audio to be recognized is finally recognized according to the recognition result and the classification parameters output by the voice activity detection model, so that only one universal voice activity detection model needs to be deployed, the corresponding voice activity detection models do not need to be deployed respectively aiming at different recognition scenes, and the model deployment and iterative upgrade costs can be effectively reduced.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification and are incorporated in and constitute a part of this specification, illustrate embodiments of the specification and together with the description serve to explain the principles of the specification and not to limit the specification in a limiting sense. In the drawings:

fig. 1 is a schematic diagram of a voice activity detection process provided in an embodiment of the present specification;

fig. 2 is a schematic structural diagram of a voice activity detection apparatus provided in an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of an electronic device provided in an embodiment of the present specification.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more clear, the technical solutions of the present disclosure will be clearly and completely described below with reference to the specific embodiments of the present disclosure and the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present specification without any creative effort belong to the protection scope of the present specification.

The technical solutions provided by the embodiments of the present description are described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of a voice activity detection process provided in an embodiment of the present specification, which specifically includes the following steps:

s100: and acquiring the audio to be identified.

Embodiments of the present specification are directed to deploying only one voice activity detection model through which audio of various input modes (e.g., streaming input and non-streaming input), various sampling frequencies (e.g., 8KHz, 16 KHz) is recognized.

First, the voice activity detection device obtains the audio to be recognized, and the audio to be recognized may be streaming input audio or non-streaming input audio. Specifically, the identification request may be received, and the audio to be identified corresponding to the identification request may be received. The audio to be identified can be carried in the identification request, and the identification requester can also add a mark corresponding to the identification request to the audio to be identified according to the identification request so as to mark that the audio to be identified is the audio to be identified corresponding to the identification request.

The identification request may be sent to the voice activity detection apparatus by an identification requester in the process of executing the service, and the identification requester may send the identification request carrying the service identifier to the voice activity detection apparatus according to the service identifier of the service executed by the identification requester. Correspondingly, through the service identifier, the voice activity detection device can determine the recognition scene type corresponding to the audio to be recognized, such as scenes of telephone voice recognition, voice conference recording and the like. Of course, the voice activity monitoring apparatus may also determine the recognition scene type corresponding to the audio to be recognized in other manners, which is not described in detail herein.

The voice activity detection device can pre-store the corresponding relation between different recognition scene types and input modes and sampling frequencies, and generally, the input modes and the sampling frequencies of the audio to be recognized corresponding to different recognition scene types are different, so that after the voice activity detection device determines the recognition scene type corresponding to the audio to be recognized, the input mode and the sampling frequency of the audio to be recognized can be determined, and then voice activity detection can be carried out on the audio to be recognized according to the input mode and the sampling frequency of the audio to be recognized.

S102: audio features are extracted from each audio frame of the audio to be identified.

Since the audio features extracted from the audio to be recognized with different sampling frequencies are different, in the embodiments of the present specification, normalized audio features need to be extracted from each audio frame of the audio to be recognized according to the sampling frequency of the audio to be recognized. The normalized audio features include, but are not limited to, the MixBand FBank feature. An audio frame described in the embodiments of the present specification may specifically be a section of audio of a specified duration, such as a section of audio of 10ms, in the audio to be recognized.

In particular, for a given segment of an audio signal, a fourier transform is typically used to convert the audio signal from the time domain to the frequency domain. For example, an audio signal with a sampling frequency of 16KHz of 1s, assuming that an audio frame is 10ms, the audio signal is transformed by fourier transform into 100 audio features (audio features are represented by vectors) from 16000 numbers representing audio amplitude magnitudes, and each audio feature represents an intensity value of the audio frame at a respective frequency. The resulting frequency ranges from 0-8000Hz according to the nyquist sampling law, i.e. the higher the sampling frequency of the audio signal, the easier it is to acquire the high frequency characteristics of the audio signal itself. Generally, assuming that the sampling frequency of an audio signal is f, the audio features extracted from the audio signal can represent at most the audio frequency of the audio signal itself with the frequency of f/2. For the audio features of each audio frame, a triangle mel filter bank is used for filtering, and 40 filters can be used, so that 40 feature values are obtained. Finally, the 1s audio signal is converted into a 100 × 40 feature matrix as input for the subsequent voice activity detection model.

As can be seen from the above description, for two audio signals with sampling frequencies of 8KHz and 16KHz, although audio features of 40 feature values (i.e., 40 dimensions) per audio frame can be extracted by the above method, the physical meanings of the 40-dimensional audio features extracted from the two audio signals are different, for an audio signal with a sampling frequency of 8KHz, the 40-dimensional audio features extracted from the audio signals represent the audio features of the audio signal itself at 0 to 4000Hz, and for an audio signal with a sampling frequency of 16KHz, the 40-dimensional audio features extracted from the audio signals represent the audio features of the audio signal itself at 0 to 8000 Hz. That is, although the dimensions of the audio features are the same, the physical meaning of the expressions is not the same. Since the application aims to recognize the audio to be recognized with different sampling frequencies through one voice activity detection model, the physical meanings of the input audio features need to be unified, the audio features of the audio to be recognized need to be normalized, and the audio features can be normalized in a mode of extracting the MixBand FBank features from the audio to be recognized.

Three methods of extracting normalized audio features (MixBand FBank features) are provided below.

Firstly, according to a preset standard sampling frequency, processing the audio to be identified into the audio with the sampling frequency of the standard sampling frequency, and then extracting audio features from each audio frame of the processed audio to be identified as normalized audio features. Wherein the processing of the audio to be identified into audio having a sampling frequency of said standard sampling frequency may comprise interpolation or filtering, i.e. up-sampling and down-sampling. For example, assuming that the standard sampling frequency is 16KHz, if the audio to be identified is the audio with the sampling frequency of 16KHz, the audio feature may be directly extracted as the normalized audio feature, and if the audio to be identified is the audio with the sampling frequency of 8KHz, the audio feature may be processed into the audio with 16KHz by means of interpolation, and then the audio feature may be extracted from the audio processed into 16KHz and used as the normalized audio feature. Assuming that the standard sampling frequency is 8KHz, if the audio to be identified is the audio with the sampling frequency of 16KHz, the audio can be processed into the audio with 8Hz in a filtering mode, and then the audio features are extracted from the audio processed into 8KHz and used as normalized audio features.

Second, normalized audio features are extracted by means of low-dimensional alignment of the features. Specifically, a standard sampling frequency may be preset, and the standard sampling frequency is not lower than the maximum sampling frequency of the audio to be identified. When audio features are extracted using a qmel filter bank including m filters (m is a positive integer and may be 40, for example), a positive integer k is determined according to a difference between a standard sampling frequency and a sampling frequency of an audio to be identified, where the larger the difference is, the smaller k is, the smaller the difference is, and the larger k is (where the first k needs to be selected from m triangular filters corresponding to the standard sampling frequency, so that a frequency range corresponding to the k filters covers a frequency domain of the audio to be identified as much as possible, and after an end point is given, a distribution of the sampling frequencies corresponding to the triangular filters is also deterministic, so that after the standard sampling frequency and the sampling frequency of the audio to be identified are given, a value of k may be directly determined). And adopting k filters in the m filters to extract audio features from each audio frame of the audio to be identified. At this time, the extracted audio features are k dimensions, and therefore m-k feature values need to be added to the audio features of the k dimensions to convert the audio features of the k dimensions into audio features of m dimensions, and specifically, m-k feature values may be added in a zero padding manner, that is, m-k feature values whose values are 0 are padded. In this way, no matter what sampling frequency of the audio to be identified, the audio features of m dimensions can be extracted, and the meaning of the audio feature representation of each dimension is uniform. The relationship between the difference and the positive integer k may be a linear relationship or a nonlinear relationship.

For example, assuming that the standard sampling frequency is 16KHz and the sampling frequency of the audio to be identified is 8KHz, a positive integer k of 30 may be determined for a qmel filter bank including 40 filters, 30 filters are used to extract 30-dimensional audio features from each audio frame of the audio to be identified, and then 10 feature values with a value of 0 are complemented to total 40-dimensional audio features. The audio features of the audio to be recognized with the sampling frequency of 8KHz can only represent the audio of the audio to be recognized with the frequency of 0-4000Hz, while the audio features of the audio to be recognized with the sampling frequency of 16KHz can represent the audio of the audio to be recognized with the frequency of 0-8000Hz, but because the audio features are not uniformly distributed on the frequency, the lower dimension is more dense, the higher dimension is more sparse, the first 30 filters with the sampling frequency of 16KHz correspond to 0-4000Hz, and the last 10 filters with the sampling frequency of 16KHz correspond to 4000-8000Hz, namely, the relationship between the difference value of the standard sampling frequency and the sampling frequency of the audio to be recognized and the positive integer k is a nonlinear negative correlation relationship.

Third, normalized audio features are extracted using a machine learning model. Specifically, a machine learning model for outputting an audio with a preset standard sampling frequency may be trained in advance, the audio to be recognized is input to the machine learning model to obtain the audio to be recognized with the output standard sampling frequency, and then the processing is performed by using a method similar to the first method for extracting normalized audio features. The standard sampling frequency can also be preset, a machine learning model for outputting m-k dimensional audio features is trained in advance, a positive integer k is determined by combining the second method for extracting the normalized audio features, k dimensional audio features are extracted from the audio to be identified, then the k dimensional audio features are input into the machine learning model to obtain the output m-k dimensional audio features, and the m dimensional normalized audio features are obtained through the k dimensional audio features and the m-k dimensional audio features.

S104: and inputting the audio characteristics of each audio frame into a pre-trained voice activity detection model to obtain the probability that each audio frame output by the voice activity detection model is a voice section.

After the audio features of each audio frame in the audio to be recognized are obtained in step S102, the audio features of each audio frame may be input into the pre-trained voice activity detection model, so as to obtain the probability that each audio frame output by the voice activity detection model is a voice segment.

Specifically, when the audio to be recognized is determined to be the audio of the streaming input according to the recognition request, the audio features of each audio frame in the audio to be recognized are input into the pre-trained voice activity detection model in a streaming input mode; and when the audio to be recognized is determined to be the non-streaming input audio according to the recognition request, inputting the audio features of each audio frame in the audio to be recognized into the pre-trained voice activity detection model in a non-streaming input mode.

Wherein for streaming input of a recognition scene, such as phone speech recognition, the voice activity detection model may be input with the audio features of every i audio frames as one input, whereas for non-streaming input of a recognition scene, such as voice conference recording, the voice activity detection model may be input with the audio features of every j audio frames as one input. Wherein j is greater than i. For example, i is 20, j is 200, i.e. the audio features of every 20 audio frames are input to the speech activity detection model once for streaming input, and the audio features of every 200 audio frames are input to the speech activity detection model once for non-streaming input.

S106: and determining classification parameters according to the identification scene type corresponding to the audio to be identified.

S108: and identifying whether each audio frame is a voice section or not according to the probability and the classification parameters of each audio frame output by the voice activity detection model as the voice section.

The classification parameters described in the embodiments of the present specification include, but are not limited to, a probability threshold and a smoothing parameter.

When the classification parameter is the probability threshold, the identification scene type can be determined according to the identification request, and then the probability threshold corresponding to the identification scene type is determined according to the identification scene type. Wherein, the probability threshold values corresponding to different recognition scene types are different. Then, in step S108, for each audio frame, if the probability that the audio frame output by the voice activity detection model is a speech segment is greater than the probability threshold, the audio frame is identified as a speech segment, otherwise, the audio frame is identified as a non-speech segment. For example, when the recognition scene is a telephone voice recognition scene, the probability threshold corresponding to the recognition scene may be set to a higher threshold because the noise in the scene is generally high (especially, the telephone voice recognition scene during the distribution of the distribution capacity), and when the recognition scene is a voice conference recording scene, the probability threshold corresponding to the recognition scene may be set to a lower threshold because the noise in the scene is generally low.

When the classification parameter is a smoothing processing parameter, the identification scene type can be determined according to the identification request, and then the smoothing processing parameter corresponding to the identification scene type is determined according to the identification scene type. In step S108, for each audio frame, it may be identified whether the audio frame is a speech segment according to the probability that the audio frame output by the speech activity detection model is a speech segment and a predetermined probability threshold, and the identification result of each audio frame is smoothed according to the smoothing parameter and the identification result of each audio frame in the audio to be identified.

Specifically, the smoothing parameter may include a minimum speech length n ₁ And a minimum non-speech length n ₂ When the smoothing processing is performed, continuous audio frames recognized as speech segments and continuous audio frames recognized as non-speech segments can be extracted from the audio to be recognized, if the number of the continuous audio frames recognized as speech segments is less than n ₁ If the number of the continuous audio frames identified as the speech segments is less than n, all the continuous audio frames identified as the speech segments are regarded as the non-speech segments ₂ Then consecutive audio frames identified as non-speech segments are each considered as a speech segment.

Of course, the classification parameters may also include a probability threshold and a smoothing parameter, and step S108 here specifically includes: for each audio frame, according to the probability that the audio frame output by the voice activity detection model is a voice segment and the probability threshold corresponding to the recognition scene type, if the probability that the audio frame output by the voice activity detection model is a voice segment is greater than the probability threshold, the audio frame is recognized as a voice segment, otherwise, the audio frame is recognized as a non-voice segment; and smoothing the recognition result of each audio frame according to the smoothing processing parameter and the recognition result of each audio frame in the audio to be recognized.

After the speech segment and the non-speech segment of the audio to be recognized are recognized by the method shown in fig. 1, the recognized speech segment may be input into a pre-trained speech recognition model, so as to recognize the input speech segment through the speech recognition model, i.e. convert the speech segment into a corresponding text.

By the method, only one universal voice activity detection model can be deployed, after the output result of the model is obtained, corresponding post-processing is carried out according to the type of the recognition scene (steps S106-S108), the method can be suitable for various recognition scenes, corresponding voice activity detection models do not need to be deployed respectively for different recognition scenes, and the cost of model deployment and iterative upgrade can be effectively reduced.

Especially, for the general voice activity detection model with high generalization, the voice activity detection model can process the voice activity to be recognized in any input mode and any sampling frequency.

In an embodiment of the present disclosure, the method for training the generic voice activity detection model with higher generalization includes: obtaining sample audio and a label of the sample audio, wherein the label of the sample audio comprises a speech section and a non-speech section in the sample audio; extracting audio features from each sample audio frame of the sample audio; inputting the audio characteristics of each sample audio frame into a voice activity detection model to be trained in a streaming input mode to obtain a first recognition result output by the voice activity detection model to be trained; inputting the audio features of each sample audio frame into a voice activity detection model to be trained in a non-streaming input mode to obtain a second recognition result output by the voice activity detection model to be trained; determining the loss of the voice activity detection model to be trained according to the label of the sample audio, the first recognition result and the second recognition result; and adjusting the model parameters of the voice activity detection model to be trained by taking the loss minimization as a training target.

The method for extracting audio features from each sample audio frame of the sample audio is completely the same as the method for extracting normalized audio features (MixBand FBank features) in step S102 shown in fig. 1, and is not described here again. The mode of inputting the audio features of each sample audio frame into the to-be-trained voice activity detection model in a streaming input manner is completely the same as the mode of inputting the audio features of each audio frame in the to-be-recognized audio into the pre-trained voice activity detection model in a streaming input manner in step S104 shown in fig. 1, and the mode of inputting the audio features of each sample audio frame into the to-be-trained voice activity detection model in a non-streaming input manner in week S104 shown in fig. 1 is completely the same as the mode of inputting the audio features of each audio frame in the to-be-recognized audio into the pre-trained voice activity detection model in a non-streaming input manner, which is not described herein.

It should be noted that, in the training of the voice activity detection model, in this embodiment of the present specification, the training may be performed in a knowledge distillation manner, that is, in determining the loss, a difference between the first recognition result and the label of the sample audio is determined as a first difference, a difference between the second recognition result and the label of the sample audio is determined as a second difference, the second recognition result is used as a pseudo label, a difference between the first recognition result and the pseudo label is determined as a third difference, and the loss of the voice activity detection model to be trained is determined according to the first difference, the second difference, and the third difference. The first difference, the second difference and the third difference are positively correlated with the loss, specifically, the first difference, the second difference and the third difference can be weighted and summed according to a preset weight, and the result of weighted summation is used as the loss of the voice activity detection model to be trained.

The reason why the second recognition result is used as the pseudo label in determining the third difference is that, in general, after the model is input in a non-streaming input manner, the accuracy of the recognition result obtained by the model is higher than that obtained by streaming input.

The method for training the general voice activity detection model with high generalization is mainly based on the joint training of streaming input and non-streaming input, and besides the joint training method, the voice activity detection model can be obtained by training through other methods of the joint training of streaming input and non-streaming input.

Besides the joint training method, a general voice activity detection model with high generalization can be trained by other non-joint methods. For example, a model supporting streaming input is selected as a voice activity detection model, and according to sample audio and a label of the sample audio, a supervised training method is used to train the voice activity detection model, where the sample audio includes both the sample audio of streaming input and the sample audio of non-streaming input, and the model supporting streaming input includes, but is not limited to, deep Neural Networks (DNN), recurrent Neural Networks (RNN), and Long-Short Term Memory Networks (LSTM). For another example, a model supporting non-streaming input is selected as a voice activity detection model, and the voice activity detection model is trained by a supervised training method according to the sample audio and the label of the sample audio, wherein the sample audio includes both the sample audio of streaming input and the sample audio of non-streaming input, and the model supporting non-streaming input includes, but is not limited to, a transform and LC-LSTM.

Based on the same idea, the voice activity detection method provided in the embodiment of the present specification further provides a corresponding apparatus, a storage medium, and an electronic device.

Fig. 2 is a schematic structural diagram of a voice activity apparatus provided in an embodiment of the present specification, where the apparatus includes:

an obtaining module 201, configured to obtain an audio to be identified;

a feature extraction module 202, configured to extract audio features from each audio frame of the audio to be identified;

the recognition module 203 is configured to input the audio features of each audio frame into a pre-trained voice activity detection model, so as to obtain a probability that each audio frame output by the voice activity detection model is a voice segment;

the post-processing module 204 is configured to determine a classification parameter according to the identification scene type corresponding to the audio to be identified; and identifying whether each audio frame is a voice section or not according to the probability that each audio frame output by the voice activity detection model is a voice section and the classification parameters.

Optionally, the obtaining module 201 is specifically configured to obtain an audio to be identified, which is input in a streaming manner; or acquiring the audio to be identified of the non-streaming input;

the recognition module 203 is specifically configured to, when the audio to be recognized is a streaming input audio, input the audio features of each audio frame in the audio to be recognized into a pre-trained voice activity detection model in a streaming input manner; and when the audio to be recognized is the non-streaming input audio, inputting the audio features of each audio frame in the audio to be recognized into a pre-trained voice activity detection model in a non-streaming input mode.

Optionally, the feature extraction module 202 is specifically configured to extract a normalized audio feature from each audio frame of the audio to be identified according to the sampling frequency of the audio to be identified; the normalized audio features include: mixBand FBank feature.

Optionally, the post-processing module 204 is specifically configured to determine, according to an identification scene type corresponding to the audio to be identified, a probability threshold corresponding to the identification scene type;

the post-processing module 204 is specifically configured to, for each audio frame, identify the audio frame as a speech segment if the probability that the audio frame output by the speech activity detection model is a speech segment is greater than the probability threshold, and otherwise, identify the audio frame as a non-speech segment.

Optionally, the obtaining module 201 is specifically configured to receive an identification request, and obtain an audio to be identified corresponding to the identification request;

the post-processing module 204 is specifically configured to determine, according to the identification request, an identification scene type corresponding to the audio to be identified; determining a smoothing processing parameter according to the identification scene type corresponding to the audio to be identified;

the post-processing module 204 is specifically configured to, for each audio frame, identify whether the audio frame is a speech segment according to the probability that the audio frame output by the speech activity detection model is a speech segment and a predetermined probability threshold; and smoothing the recognition result of each audio frame according to the smoothing parameter and the recognition result of each audio frame in the audio to be recognized.

The post-processing module 204 is specifically configured to extract, from the audio to be recognized, consecutive audio frames recognized as speech segments and consecutive audio frames recognized as non-speech segments; if the number of the continuous audio frames identified as the speech segments is less than n ₁ Then, the continuous audio frames identified as speech segments are all regarded as non-speech segments; if the number of the continuous audio frames identified as the non-speech segments is less than n ₂ Then the consecutive audio frames identified as non-speech segments are each considered as a speech segment.

The device further comprises:

a training module 205, configured to obtain a sample audio and an annotation of the sample audio, where the annotation of the sample audio includes a speech segment and a non-speech segment in the sample audio; extracting audio features from each sample audio frame of the sample audio; inputting the audio features of each sample audio frame into a voice activity detection model to be trained in a streaming input mode to obtain a first recognition result output by the voice activity detection model to be trained; inputting the audio features of each sample audio frame into a voice activity detection model to be trained in a non-streaming input mode to obtain a second recognition result output by the voice activity detection model to be trained; determining the loss of the voice activity detection model to be trained according to the label of the sample audio, the first recognition result and the second recognition result; and adjusting the model parameters of the voice activity detection model to be trained by taking the loss minimization as a training target.

Optionally, the training module 205 is specifically configured to determine a difference between the first recognition result and the label of the sample audio as a first difference; determining a difference between the second recognition result and the label of the sample audio as a second difference; determining the difference between the first recognition result and the pseudo label as a third difference by taking the second recognition result as the pseudo label; and determining the loss of the voice activity detection model to be trained according to the first difference, the second difference and the third difference.

The present specification also provides a computer-readable storage medium storing a computer program which, when executed by a processor, is operable to perform the voice activity detection method provided above.

Based on the voice activity detection method provided above, the embodiment of the present specification further provides a schematic structural diagram of the electronic device shown in fig. 3. As shown in fig. 3, at the hardware level, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile memory, but may also include hardware required for other services. The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to realize the voice activity detection method.

Of course, besides the software implementation, this specification does not exclude other implementations, such as logic devices or combination of software and hardware, and so on, that is, the execution subject of the following processing flow is not limited to each logic unit, and may be hardware or logic devices.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain a corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually manufacturing an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as ABEL (Advanced Boolean Expression Language), AHDL (alternate Hardware Description Language), traffic, CUPL (core universal Programming Language), HDCal, jhddl (Java Hardware Description Language), lava, lola, HDL, PALASM, rhyd (Hardware Description Language), and vhigh-Language (Hardware Description Language), which is currently used in most popular applications. It will also be apparent to those skilled in the art that hardware circuitry for implementing the logical method flows can be readily obtained by a mere need to program the method flows with some of the hardware description languages described above and into an integrated circuit.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in purely computer readable program code means, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be conceived to be both a software module implementing the method and a structure within a hardware component.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the various elements may be implemented in the same one or more pieces of software and/or hardware in the practice of this description.

As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The description has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the description. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus comprising the element.

This description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

All the embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only an example of the present disclosure, and is not intended to limit the present disclosure. Various modifications and alterations to this description will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present specification should be included in the scope of the claims of the present specification.

Claims

1. A method for voice activity detection, comprising:

acquiring audio to be identified;

extracting audio features from each audio frame of the audio to be identified;

inputting the audio characteristics of each audio frame into a pre-trained voice activity detection model to obtain the probability that each audio frame output by the voice activity detection model is a voice section;

2. The method of claim 1, wherein obtaining the audio to be recognized specifically comprises:

acquiring audio to be identified of streaming input; or, acquiring the audio to be identified of non-streaming input;

when the audio to be recognized is the audio of streaming input, inputting the audio features of each audio frame in the audio to be recognized into a pre-trained voice activity detection model in a streaming input mode;

3. The method of claim 1, wherein extracting audio features from each audio frame of the audio to be identified comprises:

extracting normalized audio features from each audio frame of the audio to be identified according to the sampling frequency of the audio to be identified; the normalized audio features include: mixBand FBank feature.

4. The method of claim 1, wherein determining the classification parameter according to the recognition scene type corresponding to the audio to be recognized specifically comprises:

5. The method of claim 1, wherein obtaining the audio to be identified specifically comprises:

receiving an identification request, and acquiring a to-be-identified audio corresponding to the identification request;

determining a classification parameter according to the identification scene type corresponding to the audio to be identified, which specifically comprises:

6. The method of claim 5, wherein the smoothing parameters comprise a minimum speech length n ₁ And a minimum non-speech length n ₂ ；

7. The method of claim 1, wherein pre-training the voice activity detection model specifically comprises:

extracting audio features from each sample audio frame of the sample audio;

inputting the audio characteristics of each sample audio frame into a voice activity detection model to be trained in a streaming input mode to obtain a first recognition result output by the voice activity detection model to be trained; inputting the audio characteristics of each sample audio frame into a voice activity detection model to be trained in a non-streaming input mode to obtain a second recognition result output by the voice activity detection model to be trained;

8. The method of claim 7, wherein determining a loss of the voice activity detection model to be trained based on the label of the sample audio, the first recognition result, and the second recognition result comprises:

9. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when being executed by a processor, carries out the method of any of the preceding claims 1-8.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1-8 when executing the program.