CN113257284B

CN113257284B - Voice activity detection model training method, voice activity detection method and related device

Info

Publication number: CN113257284B
Application number: CN202110641762.XA
Authority: CN
Inventors: 郝洋; 丁文彪; 卢鑫
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2021-06-09
Filing date: 2021-06-09
Publication date: 2021-11-02
Anticipated expiration: 2041-06-09
Also published as: CN113257284A

Abstract

The embodiment of the invention provides a voice activity detection model training method, a voice activity detection method and a related device, wherein the voice activity detection model training method comprises the steps of acquiring a voice activity detection training data set, including voice activity detection training audio frame characteristics and reference voice activity types, wherein the voice activity detection training audio frame characteristics are acquired through an audio feature extraction model after training and based on the voice activity detection training audio frame; and obtaining the training voice activity type of the voice activity detection training audio frame by using the voice activity detection model according to the voice activity detection training audio frame characteristics, and optimizing the voice activity detection model until the trained voice activity detection model is obtained, so that the overfitting of the trained voice activity detection model can be avoided, and the robustness of the trained voice activity detection model can be improved.

Description

Voice activity detection model training method, voice activity detection method and related device

Technical Field

The embodiment of the invention relates to the field of voice detection, in particular to a voice activity detection model training method, a voice activity detection method and a related device.

Background

Voice activity detection is widely used in the field of speech recognition to identify and eliminate long periods of silence from a stream of sound signals. The results of the voice activity detection are further available for speaking duration statistics. The speech segment after removing the silent part can also be used for speech recognition, resulting in text information, etc.

Therefore, the quality of voice activity detection is good and bad, and the related task of the acoustic signal and the subsequent task depending on the text are both greatly influenced.

One important difficulty in voice activity detection is how to distinguish the voice signal of a human utterance (the human voice) from the interfering voice signals in the environment that are not part of the human utterance (the noise).

Noise in real scenes is of many categories, such as cough sound, music sound, object collision sound and the like which are common in living environments, and the diversified noise increases the difficulty of voice activity detection. The voice activity detection methods in the prior art are also not ideal.

Therefore, how to improve the effect of the voice activity detection category becomes a technical problem that needs to be solved urgently by those skilled in the art.

Disclosure of Invention

The technical problem solved by the embodiment of the invention is how to improve the voice activity detection effect.

In order to solve the above problem, an embodiment of the present invention provides a method for training a voice activity detection model, including:

acquiring a voice activity detection training data set, wherein the voice activity detection training data set comprises voice activity detection training audio frame characteristics of a voice activity detection training audio frame of a voice activity detection training audio and reference voice activity categories of the voice activity detection training audio frame, the voice activity detection training audio frame characteristics pass through a trained audio feature extraction model, the voice activity detection training audio frame is acquired based on the voice activity detection training audio frame, and the category number in a feature extraction training data set used for training the audio feature extraction model is larger than the category number of the reference voice activity categories;

acquiring training voice activity types of the voice activity detection training audio frames according to the voice activity detection training audio frame characteristics by using the voice activity detection model;

and according to the training voice activity type and the reference voice activity type, acquiring voice activity detection loss of the voice activity detection model, and according to the voice activity detection loss, optimizing the voice activity detection model until the voice activity detection loss meets a preset voice activity detection loss threshold value, so as to obtain the trained voice activity detection model.

The embodiment of the invention also provides a voice activity detection method, which comprises the following steps:

acquiring the characteristics of the voice activity audio frame to be detected corresponding to the voice activity audio frame to be detected;

and training the obtained voice activity detection model by using the voice activity detection model training method, and obtaining the voice activity detection category corresponding to the voice activity audio frame to be detected based on the characteristics of the voice activity audio frame to be detected.

The embodiment of the present invention further provides a training device for a voice activity detection model, including:

a voice activity detection training data set acquisition module adapted to acquire a voice activity detection training data set, the voice activity detection training data set including voice activity detection training audio frame features of a voice activity detection training audio frame and a reference voice activity category of the voice activity detection training audio frame, wherein the voice activity detection training audio frame features are acquired based on the voice activity detection training audio frame through a trained audio feature extraction model;

the training voice activity type acquisition module is suitable for acquiring the training voice activity type of the voice activity detection training audio frame according to the voice activity detection training audio frame characteristics by using the voice activity detection model;

and the voice activity detection model optimization module is suitable for acquiring the voice activity detection loss of the voice activity detection model according to the training voice activity type and the reference voice activity type, and optimizing the voice activity detection model according to the voice activity detection loss until the voice activity detection loss meets a preset voice activity detection loss threshold value to obtain the trained voice activity detection model.

An embodiment of the present invention further provides a voice activity detection apparatus, including:

the voice activity audio frame feature acquisition module is suitable for acquiring voice activity audio frame features to be detected corresponding to the voice activity audio frames to be detected;

and the voice activity detection type acquisition module is suitable for utilizing the voice activity detection model obtained by the training method of the voice activity detection model to obtain the voice activity detection type corresponding to the voice activity audio frame to be detected based on the characteristics of the voice activity audio frame to be detected.

The embodiment of the invention also provides a storage medium, wherein the storage medium stores a program suitable for training the voice activity detection model so as to realize the voice activity detection model training method, or the storage medium stores a program suitable for detecting the voice activity so as to realize the voice activity detection method.

The embodiment of the invention also provides electronic equipment, which comprises at least one memory and at least one processor; the memory stores a program that the processor invokes to perform the voice activity detection model training method or the voice activity detection method.

Compared with the prior art, the technical scheme of the embodiment of the invention has the following advantages:

the voice activity detection model training method provided by the embodiment of the invention comprises the steps of acquiring a voice activity detection training data set, wherein the voice activity detection training data set comprises voice activity detection training audio frame characteristics of a voice activity detection training audio frame and reference voice activity categories of the voice activity detection training audio frame, the voice activity detection training audio frame characteristics pass through a trained audio feature extraction model, the voice activity detection training audio frame is acquired based on the voice activity detection training audio frame, and the category number in the feature extraction training data set used for training the audio feature extraction model is larger than the category number of the reference voice activity categories; and then, acquiring a training voice activity type of the voice activity detection training audio frame according to the voice activity detection training audio frame characteristic by using the voice activity detection model, finally acquiring the voice activity detection loss of the voice activity detection model according to the training voice activity type and the reference voice activity type, and optimizing the voice activity detection model according to the voice activity detection loss until the voice activity detection loss meets a preset voice activity detection loss threshold value to obtain the trained voice activity detection model.

It can be seen that, with the training method for a voice activity detection model provided in the embodiments of the present invention, before training the voice activity detection model, the trained audio feature extraction model is first used to extract the voice activity detection training audio frame feature of the voice activity detection training audio frame, and then the voice activity detection training audio frame feature is used to obtain the training voice activity class, and the voice activity detection model is trained by using the training voice activity class and the reference voice activity class of the voice activity detection training audio frame to obtain the trained voice activity detection model, and the audio feature extraction model and the voice activity detection model are trained respectively, so as to reduce the probability of overfitting of the trained voice activity detection model, and the number of classes in the feature extraction training data set used for training the audio feature extraction model is greater than the number of classes of the reference voice activity class, the voice activity detection training audio frame features extracted by the audio feature extraction model can be more accurate, the robustness (robustness and durability) of the trained voice activity detection model is improved, the voice activity detection model can be suitable for voice activity detection of various different types of audio, and the accuracy of voice activity detection results obtained by the trained voice activity detection model is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a schematic flow chart illustrating a method for training a voice activity detection model according to an embodiment of the present invention;

FIG. 2 is a schematic flowchart illustrating the training steps of an audio feature extraction model in the method for training a voice activity detection model according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart illustrating a process of obtaining a feature extraction training data set in the training method for a voice activity detection model according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating a voice activity detection method according to an embodiment of the present invention;

FIG. 5 is a diagram of a voice activity detection training apparatus according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating an apparatus for detecting voice activity according to an embodiment of the present invention;

fig. 7 is a schematic diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

As can be seen from the background, the existing voice activity detection is less accurate.

In order to solve the above problem, a method for training a voice activity detection model according to an embodiment of the present invention includes acquiring a voice activity detection training dataset, where the voice activity detection training dataset includes voice activity detection training audio frame features of a voice activity detection training audio frame and reference voice activity categories of the voice activity detection training audio frame, where the voice activity detection training audio frame features are extracted from a trained audio feature extraction model, and the number of categories in a feature extraction training dataset used for training the audio feature extraction model is greater than the number of categories of the reference voice activity categories, based on the acquisition of the voice activity detection training audio frame; and then, acquiring a training voice activity type of the voice activity detection training audio frame according to the voice activity detection training audio frame characteristic by using the voice activity detection model, finally acquiring the voice activity detection loss of the voice activity detection model according to the training voice activity type and the reference voice activity type, and optimizing the voice activity detection model according to the voice activity detection loss until the voice activity detection loss meets a preset voice activity detection loss threshold value to obtain the trained voice activity detection model.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a schematic flow chart of a method for training a voice activity detection model according to an embodiment of the present invention.

The embodiment of the invention provides a voice activity detection model training method, which comprises the following steps:

and step S11, training an audio characteristic extraction model.

According to the voice activity detection model training method provided by the embodiment of the invention, the voice activity detection training data set required by the voice activity detection model training is obtained by using the audio characteristic extraction model, so that the training of the audio characteristic extraction model is required firstly.

It is easily understood that the audio feature extraction model may extract features of the input audio so that the extracted features may be subsequently used for performing a voice activity detection category, and specifically, the audio feature extraction model extracts voice activity detection training audio frame features of respective audio frames (i.e., voice activity detection training audio frames) of the voice activity detection training audio.

In one embodiment, the training of the audio feature extraction model can be completed at any time before the training of the voice activity detection model, and the training of the voice activity detection model can be directly used; in other embodiments, the audio feature extraction model may be trained when speech activity detection model training is required.

For convenience of description, the invention is explained by firstly training an audio feature extraction model when the training of a voice activity detection model is required:

the specific training method of the audio feature extraction model can be selected according to needs. In an embodiment, please refer to fig. 2, and fig. 2 is a flowchart illustrating a step of training an audio feature extraction model in a method for training a voice activity detection model according to an embodiment of the present invention.

The step S11 of training the audio feature extraction model may include:

step S111: and acquiring a feature extraction training data set.

The feature extraction training data set includes feature extraction audio frames and reference natural categories for the feature extraction audio frames.

The reference natural category may be a specific category of the feature extraction audio frame, such as a specific category of sounds like whistling, mouse clicks, piano sounds, human voices, and the like.

In an embodiment, please refer to fig. 3, and fig. 3 is a schematic flow chart illustrating a process of obtaining a feature extraction training data set in a training method for a voice activity detection model according to an embodiment of the present invention.

The step S111: obtaining the feature extraction training data set may include:

step S1111: and obtaining the feature extraction audio frequency spectrum corresponding to each audio frame of the feature extraction audio by using the window function, the set time window value and the set frame shift.

In particular, the window function may be selected as desired. The window function is used for processing, so that the leakage of frequency spectrum can be reduced.

In one embodiment, the window function may be a hanning window, a hamming window, or the like. When the window function can select a Hanning window, the obtained feature extraction audio frequency spectrum can have good frequency resolution and less frequency spectrum leakage at the same time.

The set window value is the length of each segment of the audio frame. The setting window value can also be set according to the requirement. In one embodiment, the set time window value may be 25ms to 50ms, such as 25ms, 30ms, or 40 ms.

Of course, the setting frame shift can also be set as required. The frame shift is the length of the overlapping part between frames, and the proper frame shift can be selected to meet the requirement of signal continuity. In one embodiment, the set frame shift may be 5ms to 20ms, such as 10 ms.

Step S1112: and mapping the feature extraction audio frequency spectrum to a filter bank to calculate a feature extraction audio frequency sound spectrum corresponding to the feature extraction audio frequency spectrum.

Wherein the filter bank can be selected as desired.

In one embodiment, the filter bank comprises a mel filter bank, and the feature extracted audio spectrum comprises a feature extracted audio mel spectrum.

Specifically, the change formula of the frequency of the mel-frequency spectrum and the actual frequency is as follows:Fmel(f)=1125ln(1+f/700), wherein F_melIs the mel frequency of the audio spectrum, f is the actual frequency of the audio spectrum.

Because the sound level heard by human ears is not in linear direct proportion to the frequency of the sound, the obtained feature extraction audio spectrum can better conform to the auditory feature of the human ears by adopting the Mel filter bank.

Step S1113: and performing cepstrum analysis on the feature extraction audio frequency spectrum to obtain a frequency cepstrum coefficient corresponding to the feature extraction audio frequency spectrum.

In particular, when the feature extracted audio spectrum comprises a feature extracted audio mel-frequency spectrum, the frequency cepstral coefficients may comprise mel-frequency cepstral coefficients. By performing cepstrum analysis, the obtained frequency cepstrum coefficient corresponding to the feature extraction audio frequency spectrum can well reflect the features of the feature extraction audio frequency spectrum.

Step S1114: and utilizing the frequency cepstrum coefficient of the feature extraction audio data to obtain the feature extraction audio frame.

The frequency cepstrum coefficient corresponding to the feature extraction audio frequency spectrum corresponding to each obtained audio frame can be directly used as the feature extraction audio frame.

In another specific embodiment, the frequency cepstrum coefficients corresponding to the feature extraction audio frequency spectrums corresponding to the obtained audio frames may also be combined to obtain the feature extraction audio frames.

Therefore, the step of obtaining the feature-extracted audio frame by using the frequency cepstrum coefficients of the feature-extracted audio data may include framing at least two consecutive groups of the frequency cepstrum coefficients of the feature-extracted audio data to obtain combined frequency cepstrum coefficients as the feature-extracted audio frame.

Step S1115: and marking the feature extraction audio frame by using the reference natural category of the feature extraction audio obtained in advance to obtain the feature extraction audio frame and the reference natural category of the feature extraction audio frame, namely the feature extraction training data set.

The characteristic extraction audio is processed through a series of steps, and the obtained characteristic extraction audio frame can greatly reduce the data volume and well reserve the characteristics of the characteristic extraction audio, so that the subsequent processing is facilitated.

Step S112: and acquiring the feature extraction audio frame features of the feature extraction audio frame by using the feature extraction layer of the audio feature extraction model.

In order to train the audio feature extraction model, firstly, a feature extraction layer of the audio feature extraction model is used for acquiring features of a feature extraction audio frame.

The audio feature extraction model comprises a feature extraction layer and a natural category layer, wherein the feature extraction layer is suitable for extracting features in audio, the natural category layer can extract features of audio frames through extracting the features to obtain training natural categories of input audio, and of course, the types of the training natural categories can be consistent with the types of the reference natural categories of the feature extraction audio.

Specifically, the structure of the audio feature extraction model may be selected as needed, and in a specific embodiment, the structure of the audio feature extraction model may adopt a VGGish model, which includes a plurality of convolutional layers, pooling layers, and fully-connected layers, and a Softmax classifier, where portions of the plurality of convolutional layers, pooling layers, and fully-connected layers are used as feature extraction layers, and input is an audio frame, and the obtained result is a 4096-dimensional vector, which may be used as a feature vector of the input audio frame, that is, a feature extraction audio frame feature; the Softmax classifier is used as the natural class layer, the feature vector output by the feature extraction layer is input, the output is a probability value or a specific classification result of multiple classifications (which can be 1000 classifications, and can also be adjusted according to needs), and the output can be used as the classification of the input audio frame.

Of course, the specific parameters of the audio feature extraction model structure can be adjusted according to the needs.

Step S113: and extracting the characteristics of the audio frame according to the characteristics, and acquiring the training natural category of the characteristic extraction audio frame by using the natural category layer of the audio characteristic extraction model.

As described above, the natural category layer may extract audio frame features through the features to derive training natural categories of the input audio.

Step S114: and acquiring the natural category loss of the audio feature extraction model according to the training natural category and the reference natural category.

The natural class loss may reflect a degree of coincidence of the training natural class and the reference natural class.

Step S115: judging whether the natural category loss meets a preset natural category loss threshold value, if not, executing a step S116; if so, step S117 is performed.

When the natural class loss satisfies a predetermined natural class loss threshold, it may be considered that the accuracy of the audio feature extraction model reaches a certain degree, and may satisfy the requirement, then step S117 is performed.

When the natural class loss does not satisfy the predetermined natural class loss threshold, it may be determined that the accuracy of the audio feature extraction model does not satisfy the requirement, and the optimization needs to be continued, then step S116 is performed.

Step S116: optimizing the audio feature extraction model according to the natural category loss; then, step S112 is performed.

And further optimizing the audio feature extraction model by taking the natural category loss as a reference, and adjusting parameters of the audio feature extraction model.

When the audio characteristic extraction model is optimized, loss functions of different types can be selected according to requirements. In a specific embodiment, the adopted natural category loss can be a cross entropy loss function, so that the optimization effect is good.

The learning rate when the audio feature extraction model is optimized can be selected as required, and in a specific implementation mode, the learning rate when the audio feature extraction model is optimized is less than 0.001, so that overlarge adjustment span can be avoided, and the optimization effect is good.

Step S117: and obtaining the trained audio feature extraction model.

The audio feature extraction model obtained by training by the method can well extract the audio features in the audio. And the downstream task can utilize the extracted audio features to detect the type of voice activity, so that overfitting of a model used by the downstream task can be avoided.

Step S12: a voice activity detection training data set is obtained.

The voice activity detection training data set comprises voice activity detection training audio frame characteristics of a voice activity detection training audio frame and reference voice activity categories of the voice activity detection training audio frame, wherein the voice activity detection training audio frame characteristics are obtained based on the voice activity detection training audio frame through an audio feature extraction model after training is completed.

The voice activity detection training dataset may be obtained by a voice activity detection training audio process, and the audio in the voice activity detection training dataset may be different from the feature extraction training dataset.

And based on the voice activity detection training audio frame, acquiring the voice activity detection training audio frame characteristics through a trained audio characteristic extraction model.

Specifically, in an embodiment, the specific manner of the voice activity detection training audio processing may refer to the foregoing description of fig. 3, that is, the step S12: the step of obtaining a voice activity detection training data set comprises:

firstly: and obtaining the voice activity detection training audio frequency spectrum corresponding to each audio frequency frame of the voice activity detection training audio by using the window function, the set time window value and the set frame shift.

In particular, the window function may be selected as desired. In one embodiment, the window function may be a hanning window, a hamming window, or the like.

Of course, the setting window value can also be set according to the requirement. In one embodiment, the set time window value may be 25ms to 50ms, such as 25ms, 30ms, or 40 ms.

Of course, the setting frame shift can also be set as required. In one embodiment, the set frame shift may be 5ms to 20ms, such as 10 ms.

Then, the voice activity detection training audio frequency spectrum is mapped to a filter bank to calculate a corresponding voice activity detection training audio frequency spectrum of the voice activity detection training audio frequency spectrum.

Wherein the filter bank can be selected as desired.

In one embodiment, the filter bank comprises a mel filter bank, the voice activity detection training audio spectrum comprises a voice activity detection training audio mel spectrum, and the frequency cepstral coefficients comprise mel frequency cepstral coefficients.

Secondly, the method comprises the following steps: and carrying out cepstrum analysis on the voice activity detection training audio spectrum to obtain a frequency cepstrum coefficient corresponding to the voice activity detection training audio spectrum.

By performing cepstrum analysis, the obtained frequency cepstrum coefficient corresponding to the voice activity detection training audio frequency spectrum can well reflect the characteristics of the voice activity detection training audio frequency spectrum.

And thirdly: and using the frequency cepstrum coefficient of the voice activity detection training audio data to obtain the voice activity detection training audio frame.

The obtained frequency cepstrum coefficient corresponding to the voice activity detection training audio frequency spectrum may be directly used as the voice activity detection training audio frame.

In another specific embodiment, the obtained frequency cepstrum coefficients corresponding to the voice activity detection training audio frequency spectrum may also be combined to obtain the voice activity detection training audio frame.

Therefore, the step of obtaining the voice activity detection training audio frame by using the frequency cepstrum coefficients of the voice activity detection training audio data may include framing at least two consecutive groups of the frequency cepstrum coefficients of the voice activity detection training audio data to obtain combined frequency cepstrum coefficients as the voice activity detection training audio frame.

Then, the voice activity detection training audio frame is marked by using the reference voice activity category of the voice activity detection training audio obtained in advance, so that the voice activity detection training audio frame and the reference voice activity category of the voice activity detection training audio frame are obtained.

The voice activity detection training audio is processed through a series of steps, and the obtained voice activity detection training audio frame can greatly reduce the data volume and well keep the characteristics of the voice activity detection training audio, so that subsequent characteristic extraction is facilitated.

The category of the reference voice activity category may be selected as desired.

Wherein the number of categories of the reference natural category may be greater than the number of categories of the reference voice activity category.

Since the data set labeled with the reference natural category (e.g., specific kind of sounds such as whistling, mouse clicks, piano sounds, etc.) is large in scale, the category of audio is also relatively detailed; the voice activity detection category data set marked with voice activity detection categories such as noise category and non-noise category is small in size, and the cost of self-marking is high. Therefore, the audio feature extraction model can be trained by using the data set marked with the reference natural category, and the voice activity detection model can be trained by using only the data set of the voice activity detection category, so that the cost can be reduced, and the training accuracy can be improved.

For example, the reference voice activity category may include at least a noise category and a non-noise category, since distinguishing between the noise category and the non-noise category may facilitate subsequent denoising.

Of course, further, the non-noise class may include a human voice class and a mute class, wherein the sound intensity of the mute class is less than the preset mute judgment threshold. The non-noise class is classified into a human voice class and a mute class. The mute part can be conveniently removed, the voice part can be extracted, and subsequent voice recognition and other processing can be carried out.

And finally: and extracting the voice activity detection training audio frame characteristics of the voice activity detection training audio frame by using the audio characteristic extraction model to obtain a voice activity detection training data set.

As described above, the audio feature extraction model that has been trained in step S11 is obtained, so that the feature extraction layer of the audio feature extraction model that has been trained can be used to extract the voice activity detection training audio frame features of the voice activity detection training audio frame that can well represent the information of the voice activity detection training audio frame, and the voice activity detection training audio frame features of the voice activity detection training audio frame and the reference voice activity category of the voice activity detection training audio frame constitute the voice activity detection training data set.

And step S13, acquiring the training voice activity type of the voice activity detection training audio frame according to the voice activity detection training audio frame characteristics by using the voice activity detection model.

The voice activity detection model is suitable for obtaining the corresponding voice activity detection category according to the input audio characteristics.

It will be appreciated that the class of the training audio may be consistent with the class of the reference voice activity class.

And step S14, acquiring the voice activity detection loss of the voice activity detection model according to the training voice activity type and the reference voice activity type.

After the training voice activity category is obtained, the difference between the training voice activity category and the reference voice activity category, i.e. the voice activity detection loss, can be obtained, and it is easy to understand that the voice activity detection loss can reflect the degree of coincidence between the training voice activity category and the reference voice activity category.

Step S15, judging whether the voice activity detection loss meets the preset voice activity detection loss threshold value, if yes, executing step S17; if not, step S16 is performed.

When the voice activity detection loss satisfies a predetermined voice activity detection loss threshold, it may be considered that the accuracy of the voice activity detection model reaches a certain degree, which may satisfy requirements.

When the voice activity detection loss does not satisfy the predetermined voice activity detection loss threshold, it may be considered that the accuracy of the voice activity detection model has not yet satisfied the requirement, and the optimization needs to be continued, and step S16 is executed.

Step S16: the voice activity detection model is optimized according to the voice activity detection loss, and then step S13 is performed.

Wherein, when optimizing the voice activity detection model, different types of loss functions can be selected as required. In a specific embodiment, the adopted voice activity detection loss can be a cross entropy loss function, and the optimization effect is good.

In addition, the learning rate for optimizing the voice activity detection model may also be selected as needed, and in a specific embodiment, the learning rate for optimizing the voice activity detection model is less than or equal to 0.002, so that the optimization effect is fast and good.

Step S17: and obtaining the trained voice activity detection model.

Referring to fig. 4, fig. 4 is a flowchart illustrating a voice activity detection method according to an embodiment of the present invention, and an embodiment of the present invention further provides a voice activity detection method, including:

step S21: and acquiring the characteristics of the voice activity audio frame to be detected corresponding to the voice activity audio frame to be detected.

The trained audio feature acquisition model can be used for acquiring the features of the voice activity audio frames to be detected corresponding to the voice activity audio frames to be detected.

The method for acquiring the voice activity audio frame to be detected according to the voice activity audio to be detected is the same as the method for acquiring the feature extraction audio frame and the voice activity detection training audio frame, and specifically comprises the following steps:

firstly, a window function, a set time window value and a set frame shift are used to obtain the voice activity audio frequency spectrum to be detected corresponding to each audio frequency frame of the voice activity audio to be detected.

Secondly, mapping the audio frequency spectrum of the voice activity to be detected to a filter bank to calculate the training audio frequency spectrum of the voice activity to be detected corresponding to the training audio frequency spectrum of the voice activity to be detected.

Wherein the filter bank can be selected as desired.

In a specific embodiment, the filter bank includes a mel filter bank, the audio spectrum of the voice activity to be detected includes a mel training audio spectrum of the voice activity to be detected, and the frequency cepstral coefficients include mel frequency cepstral coefficients.

And thirdly, performing cepstrum analysis on the audio frequency spectrum of the voice activity to be detected to obtain a frequency cepstrum coefficient corresponding to the audio frequency spectrum of the voice activity to be detected.

By performing cepstrum analysis, the obtained frequency cepstrum coefficient corresponding to the audio frequency spectrum of the voice activity to be detected can well reflect the characteristics of the audio frequency spectrum of the voice activity to be detected.

And then, obtaining the voice activity audio frame to be detected by using the frequency cepstrum coefficient of the voice activity audio data to be detected.

The obtained frequency cepstrum coefficient corresponding to the audio frequency spectrum of the voice activity to be detected can be directly used as the audio frame of the voice activity to be detected.

In another specific implementation, the obtained frequency cepstrum coefficients corresponding to the audio frequency spectrum of the voice activity to be detected may also be combined to obtain the audio frame of the voice activity to be detected.

Therefore, the step of obtaining the voice activity audio frame to be detected by using the frequency cepstrum coefficient of the voice activity audio data to be detected may include framing at least two consecutive groups of the frequency cepstrum coefficients of the voice activity audio data to be detected to obtain a combined frequency cepstrum coefficient, which is used as the voice activity audio frame to be detected.

The training audio to be detected for voice activity is processed through a series of steps, and the obtained audio frame of voice activity to be detected can greatly reduce the data volume, well keep the characteristics of the audio of voice activity to be detected, and facilitate the subsequent characteristic extraction.

And finally, acquiring the voice activity audio frame characteristics to be detected corresponding to the voice activity audio frames to be detected by utilizing the trained audio characteristic extraction model.

As described above, the trained audio feature extraction model can be used to extract the features of the to-be-detected voice activity audio frame, and the features of the to-be-detected voice activity audio frame can well represent the information of the to-be-detected voice activity audio frame.

And step S22, training the obtained voice activity detection model by using the voice activity detection model training method, and obtaining the voice activity detection category corresponding to the voice activity audio frame to be detected based on the characteristics of the voice activity audio frame to be detected.

In particular, the category of the voice activity detection category may be consistent with the reference voice activity category and the voice activity category.

It can be seen that, by using the voice activity detection method provided by the embodiment of the present invention, the audio feature extraction model that is trained is first used to extract the feature of the voice activity audio frame to be detected, and then the trained voice activity detection model is used to obtain the voice activity detection category according to the feature of the voice activity audio frame to be detected.

In the following, the speech activity detection model training apparatus and the speech activity detection apparatus provided by the embodiments of the present invention are introduced, and the speech activity detection model training apparatus and the speech activity detection apparatus described below may be regarded as a functional module architecture that is required to be configured by an electronic device (e.g., a PC) to respectively implement the speech activity detection model training method and the speech activity detection method provided by the embodiments of the present invention. The contents of the speech activity detection model training apparatus and the speech activity detection apparatus described below may be referred to in correspondence with the contents of the speech activity detection model training method and the speech activity detection method described above, respectively.

Referring to fig. 5, fig. 5 is a schematic diagram of a speech activity detection model training device according to an embodiment of the present invention, where the embodiment of the present invention provides a speech activity detection model training device, including:

a voice activity detection training data set acquisition module 22 adapted to acquire a voice activity detection training data set including voice activity detection training audio frame features of a voice activity detection training audio frame of a voice activity detection training audio and a reference voice activity category of the voice activity detection training audio frame, wherein the voice activity detection training audio frame features are acquired by the trained audio feature extraction model 11 based on the voice activity detection training audio frame, and the category number of the reference natural category is greater than the category number of the reference voice activity category;

a training voice activity category obtaining module 23, adapted to obtain, by using the voice activity detection model 12, a training voice activity category of the voice activity detection training audio frame according to the voice activity detection training audio frame feature;

and the voice activity detection model optimization module 24 is adapted to obtain the voice activity detection loss of the voice activity detection model 12 according to the training voice activity category and the reference voice activity category, and optimize the voice activity detection model 12 according to the voice activity detection loss until the voice activity detection loss meets a predetermined voice activity detection loss threshold value, so as to obtain the trained voice activity detection model 12.

The voice activity detection model training device provided by the embodiment of the invention utilizes the audio characteristic extraction model 11 to obtain the voice activity detection training data set required by the training of the voice activity detection model, and therefore, the training of the audio characteristic extraction model 11 is required firstly.

It is easily understood that the audio feature extraction model 11 may extract features of the input audio, so that the extracted features may be subsequently used for performing a voice activity detection category, and specifically, the audio feature extraction model 11 extracts voice activity detection training audio frame features of respective audio frames (i.e., voice activity detection training audio frames) of the voice activity detection training audio.

In one embodiment, the training of the audio feature extraction model 11 may be completed at any time before the training of the voice activity detection model, and the training of the voice activity detection model is directly used; of course, in other embodiments, the audio feature extraction model 11 may be trained again when the training of the voice activity detection model is required.

It can be seen that, with the training apparatus for a voice activity detection model according to the embodiment of the present invention, before training the voice activity detection model 12, the trained audio feature extraction model 11 is first used to extract the voice activity detection training audio frame feature of the voice activity detection training audio frame, and then the voice activity detection training audio frame feature is used to obtain the training voice activity type, and the voice activity detection model 12 is trained by using the training voice activity type and the reference voice activity type of the voice activity detection training audio frame, so as to obtain the trained voice activity detection model 12, and the audio feature extraction model 11 and the voice activity detection model 12 are trained separately, so that the probability of overfitting of the trained voice activity detection model 12 can be reduced, and the number of types in the feature extraction training data set used for training the audio feature extraction model 11 is greater than the number of types of the reference voice activity type The number of classes of the audio signal can make the voice activity detection training audio frame feature extracted by the audio feature extraction model 11 more accurate, improve the robustness (robustness, durability) of the trained voice activity detection model 12, and be applicable to the voice activity detection of various different classes of audio signals, thereby improving the accuracy of the voice activity detection class result obtained by using the trained voice activity detection model 12.

In a specific embodiment, the method further comprises the following steps: an audio feature extraction model training module 21, wherein the audio feature extraction model training module 21 comprises:

a feature extraction training data set obtaining unit 211 adapted to obtain a feature extraction training data set including feature extraction audio frames and reference natural categories of the feature extraction audio frames;

a feature extraction audio frame feature obtaining unit 212 adapted to obtain a feature extraction audio frame feature of the feature extraction audio frame by using a feature extraction layer of the audio feature extraction model 11;

a training natural category obtaining unit 213 adapted to extract audio frame features according to the features, and obtain training natural categories of the feature-extracted audio frames by using a natural category layer of the audio feature extraction model 11, wherein the number of categories of the training natural categories is greater than the number of categories of the training voice activity categories;

an audio feature extraction model optimization unit 214, adapted to obtain a natural category loss of the audio feature extraction model 11 according to the training natural category and the reference natural category, and optimize the audio feature extraction model 11 according to the natural category loss until the natural category loss meets a predetermined natural category loss threshold, so as to obtain the trained audio feature extraction model 11;

the voice activity detection training data set acquisition module 22 is adapted to acquire the voice activity detection training audio frame features based on the voice activity detection training audio frame through a feature extraction layer of the audio feature extraction model 11.

The reference natural category may be a specific category of the feature extraction audio frame, such as a specific category of sounds like whistling, mouse clicks, piano sounds, human voices, and the like. Wherein the number of categories of the reference natural category may be greater than the number of categories of the reference voice activity category.

It is easy to understand that the audio feature extraction model 11 trained by the above method can well extract the audio features in the audio. And the downstream task can utilize the extracted audio features to detect the type of voice activity, so that overfitting of a model used by the downstream task can be avoided.

Optionally, the category of the reference voice activity category and the category of the training voice activity category each include at least a noise category and a non-noise category.

The voice activity detection of noise type and non-noise type audios can be realized, and the subsequent denoising processing is facilitated.

Optionally, the non-noise class includes a human voice class and a mute class.

The mute part can be conveniently removed, the voice part can be extracted, and subsequent voice recognition and other processing can be carried out.

Optionally, the voice activity detection training data set obtaining module 22 includes a voice activity detection training audio frame obtaining unit, adapted to: obtaining the voice activity detection training audio frequency spectrum corresponding to each audio frame of the voice activity detection training audio by using the window function, the set time window value and the set frame shift;

mapping the voice activity detection training audio frequency spectrum to a filter bank to calculate a voice activity detection training audio frequency spectrum corresponding to the voice activity detection training audio frequency spectrum;

performing cepstrum analysis on the feature extraction audio frequency spectrum to obtain a frequency cepstrum coefficient corresponding to the feature extraction audio frequency spectrum;

and utilizing the frequency cepstrum coefficient of the feature extraction audio data to obtain the feature extraction audio frame.

Optionally, the filter bank includes a mel filter bank, the voice activity detection training audio spectrum includes a voice activity detection training audio mel spectrum, and the frequency cepstral coefficients include mel frequency cepstral coefficients.

The voice activity detection training audio Mel sound spectrum can better accord with the result obtained by the auditory characteristics of human ears, and the Mel frequency cepstrum coefficient can well reflect the characteristics of the characteristic extraction audio frequency spectrum.

Optionally, the voice activity detection training audio frame obtaining unit is further adapted to: framing the frequency cepstrum coefficients of at least two groups of continuous voice activity detection training audio data to obtain combined frequency cepstrum coefficients serving as the voice activity detection training audio frames.

Referring to fig. 6, fig. 6 is a schematic diagram of a voice activity detection apparatus according to an embodiment of the present invention, and the embodiment of the present invention further provides a voice activity detection apparatus, including:

the voice activity audio frame feature acquisition module 31 is adapted to acquire voice activity audio frame features to be detected corresponding to the voice activity audio frames to be detected;

and a voice activity detection category obtaining module 32, adapted to utilize the voice activity detection model 12 obtained by the training method of the voice activity detection model to obtain a voice activity detection category corresponding to the voice activity audio frame to be detected based on the characteristics of the voice activity audio frame to be detected.

It can be seen that, by using the voice activity detection apparatus provided in the embodiment of the present invention, the trained audio feature extraction model can be used to extract the feature of the to-be-detected voice activity audio frame, and the trained voice activity detection model 12 is used to obtain the voice activity detection category according to the feature of the to-be-detected voice activity audio frame.

Of course, an embodiment of the present invention further provides an electronic device, where the electronic device provided in the embodiment of the present invention may load a program module framework in a program form to implement the voice activity detection model training method and the voice activity detection method provided in the embodiment of the present invention; the hardware device can be applied to an electronic device with specific data processing capacity, and the electronic device can be: such as a terminal device or a server device.

Therefore, referring to fig. 7, fig. 7 is a schematic view of an electronic device according to an embodiment of the invention.

The device provided by the embodiment of the invention comprises: at least one memory 41 and at least one processor 42, the memory 41 storing one or more computer-executable instructions, the processor 42 invoking the one or more computer-executable instructions to perform the speech activity detection model training method and the speech activity detection method.

It will be appreciated that the device may also comprise at least one communication interface 43 and at least one communication bus 44; processor 42 and memory 41 may be located on the same electronic device, for example processor 42 and memory 41 may be located on a server device or a terminal device; the processor 42 and the memory 41 may also be located on different electronic devices.

In the embodiment of the present invention, the electronic device may be a tablet computer, a notebook computer, or the like capable of performing the voice activity detection model training and the voice activity detection.

In the embodiment of the present invention, the number of the processor 42, the communication interface 43, the memory 41 and the communication bus 44 is at least one, and the processor 42, the communication interface 43 and the memory 41 complete the communication with each other through the communication bus 44; it is clear that the communication connections of the processor 42, the communication interface 43, the memory 41 and the communication bus 44 shown in the figure are only an alternative.

Alternatively, the communication interface 43 may be an interface of a communication module, such as an interface of a GSM module; processor 42 may be a central processing unit CPU, or a specific integrated circuit ASIC, or one or more integrated circuits configured to implement an embodiment of the present invention; the memory 41 may comprise high speed RAM memory, and may also include non-volatile memory, such as at least one disk memory.

It should be noted that the above-mentioned apparatus may also include other devices (not shown) that may not be necessary to the disclosure of the embodiments of the present invention; these other components may not be necessary to understand the disclosure of embodiments of the present invention, which are not individually described herein.

An embodiment of the present invention further provides a storage medium, where the storage medium stores one or more computer-executable instructions, and the one or more computer-executable instructions are used to execute the voice activity detection model training apparatus method and the voice activity detection method.

The computer executable instruction stored in the storage medium provided by the embodiment of the invention utilizes the voice activity detection model training method provided by the embodiment of the invention, before the voice activity detection model is trained, the trained audio feature extraction model is used for extracting the voice activity detection training audio frame feature of the voice activity detection training audio frame, then the voice activity detection training audio frame feature is used for obtaining the training voice activity type, the reference voice activity type of the training audio frame is detected through the training voice activity type and the voice activity, the voice activity detection model is trained to obtain the trained voice activity detection model, the audio feature extraction model and the voice activity detection model are respectively trained, the probability of overfitting of the trained voice activity detection model can be reduced, and the number of categories in the feature extraction training data set used for training the audio feature extraction model is greater than the number of categories of the reference voice activity categories, so that the voice activity detection training audio frame features extracted by the audio feature extraction model can be more accurate, the robustness (robustness and durability) of the trained voice activity detection model is improved, the trained voice activity detection model can be suitable for voice activity detection of various different categories of audio, and the accuracy of a voice activity detection result obtained by using the trained voice activity detection model is improved.

The embodiments of the present invention described above are combinations of elements and features of the present invention. Unless otherwise mentioned, the elements or features may be considered optional. Each element or feature may be practiced without being combined with other elements or features. In addition, the embodiments of the present invention may be configured by combining some elements and/or features. The order of operations described in the embodiments of the present invention may be rearranged. Some configurations of any embodiment may be included in another embodiment, and may be replaced with corresponding configurations of the other embodiment. It is obvious to those skilled in the art that claims that are not explicitly cited in each other in the appended claims may be combined into an embodiment of the present invention or may be included as new claims in a modification after the filing of the present application.

Embodiments of the invention may be implemented by various means, such as hardware, firmware, software, or a combination thereof. In a hardware configuration, the method according to an exemplary embodiment of the present invention may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), processors, controllers, micro-controllers, microprocessors, and the like.

In a firmware or software configuration, embodiments of the present invention may be implemented in the form of modules, procedures, functions, and the like. The software codes may be stored in memory units and executed by processors. The memory unit is located inside or outside the processor, and may transmit and receive data to and from the processor via various known means.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Although the embodiments of the present invention are disclosed above, the embodiments of the present invention are not limited thereto. Various changes and modifications may be effected therein by one of ordinary skill in the pertinent art without departing from the scope or spirit of the present embodiments, and it is intended that the scope of the present embodiments be defined by the appended claims.

Claims

1. A method for training a voice activity detection model, comprising:

acquiring a voice activity detection training data set, wherein the voice activity detection training data set comprises voice activity detection training audio frame characteristics of a voice activity detection training audio frame of a voice activity detection training audio and reference voice activity categories of the voice activity detection training audio frame, the voice activity detection training audio frame characteristics pass through a trained audio feature extraction model, the voice activity detection training audio frame is acquired based on the voice activity detection training audio frame, and the category number in the feature extraction training data set used by the trained audio feature extraction model is larger than the category number of the reference voice activity categories;

acquiring voice activity detection loss of the voice activity detection model according to the training voice activity type and the reference voice activity type, and optimizing the voice activity detection model according to the voice activity detection loss until the voice activity detection loss meets a preset voice activity detection loss threshold value to obtain the trained voice activity detection model;

the training method of the audio feature extraction model comprises the following steps: acquiring a feature extraction training data set, wherein the feature extraction training data set comprises feature extraction audio frames and reference natural categories of the feature extraction audio frames, and the category quantity of the reference natural categories is greater than the category quantity of the reference voice activity categories; acquiring feature extraction audio frame features of the feature extraction audio frame by using a feature extraction layer of the audio feature extraction model; extracting the characteristics of the audio frame according to the characteristics, and acquiring the training natural category of the characteristic extraction audio frame by using the natural category layer of the audio characteristic extraction model; the reference voice activity category and the training voice activity detection category both at least include a noise category and a non-noise category, and the reference natural category and the training natural category respectively include at least one of a whistle sound, a mouse click sound, a piano sound, and a human voice.

2. The training method of a voice activity detection model according to claim 1, wherein the number of classes of the training natural classes is larger than the number of classes of the training voice activity classes, the training method of an audio feature extraction model further comprising:

acquiring the natural category loss of an audio feature extraction model according to the training natural category and the reference natural category, and optimizing the audio feature extraction model according to the natural category loss until the natural category loss meets a preset natural category loss threshold value to obtain the trained audio feature extraction model;

the obtaining of the voice activity detection training audio frame features based on the voice activity detection training audio frame through an audio feature extraction model comprises:

and the voice activity detection training audio frame features are obtained based on the voice activity detection training audio frame through a feature extraction layer of an audio feature extraction model.

3. The method of training a voice activity detection model of claim 1, wherein the non-noise classes include a human voice class and a silence class.

4. The method for training a speech activity detection model according to any one of claims 1-3, wherein the step of obtaining speech activity detection training audio frames comprises:

obtaining each tone of a speech activity detection training audio using a window function, a set time window value, and a set frame shift

The voice activity detection training audio frequency spectrum corresponding to the frequency frame;

performing cepstrum analysis on the voice activity detection training audio spectrum to obtain a frequency cepstrum coefficient corresponding to the voice activity detection training audio spectrum;

and obtaining the voice activity detection training audio frame by using the frequency cepstrum coefficient of the voice activity detection training audio data.

5. The method of training a voice activity detection model of claim 4, wherein the filter bank comprises a Mel filter bank, the voice activity detection training audio spectrum comprises a voice activity detection training audio Mel spectrum, and the frequency cepstral coefficients comprise Mel frequency cepstral coefficients.

6. The method of claim 4, wherein the step of deriving the voice activity detection training audio frame using the frequency cepstral coefficients of the voice activity detection training audio data comprises:

framing the frequency cepstrum coefficients of at least two groups of continuous voice activity detection training audio data to obtain combined frequency cepstrum coefficients serving as the voice activity detection training audio frames.

7. A method of voice activity detection, comprising:

the voice activity detection model obtained by training with the voice activity detection model training method according to any one of claims 1 to 6, and obtaining a voice activity detection category corresponding to the voice activity audio frame to be detected based on the voice activity audio frame feature to be detected.

8. A speech activity detection model training apparatus, comprising:

a voice activity detection model optimization module, adapted to obtain a voice activity detection loss of the voice activity detection model according to the training voice activity category and the reference voice activity category, and optimize the voice activity detection model according to the voice activity detection loss until the voice activity detection loss meets a predetermined voice activity detection loss threshold, so as to obtain the trained voice activity detection model;

an audio feature extraction model training module, the audio feature extraction model training module comprising:

a feature extraction training data set acquisition unit adapted to acquire a feature extraction training data set including feature extraction audio frames and reference natural categories of the feature extraction audio frames, wherein the number of categories of the reference natural categories is greater than the number of categories of the reference voice activity categories;

the characteristic extraction audio frame characteristic acquisition unit is suitable for acquiring the characteristic extraction audio frame characteristics of the characteristic extraction audio frame by utilizing the characteristic extraction layer of the audio characteristic extraction model;

the training natural category acquisition unit is suitable for extracting the characteristics of the audio frames according to the characteristics and acquiring the training natural categories of the characteristic extraction audio frames by utilizing the natural category layer of the audio characteristic extraction model; the reference voice activity category and the training voice activity detection category both at least include a noise category and a non-noise category, and the reference natural category and the training natural category respectively include at least one of a whistle sound, a mouse click sound, a piano sound, and a human voice.

9. The voice activity detection model training apparatus of claim 8, wherein the number of classes of the training natural classes is greater than the number of classes of the training voice activity classes, the audio feature extraction model training module further comprising:

the audio feature extraction model optimization unit is suitable for obtaining the natural category loss of an audio feature extraction model according to the training natural category and the reference natural category, and optimizing the audio feature extraction model according to the natural category loss until the natural category loss meets a preset natural category loss threshold value to obtain the trained audio feature extraction model;

the voice activity detection training data set acquisition module is adapted to acquire the voice activity detection training audio frame features based on the voice activity detection training audio frame through a feature extraction layer of an audio feature extraction model.

10. A voice activity detection apparatus, comprising:

a voice activity detection category obtaining module, adapted to utilize the voice activity detection model obtained by the training method of the voice activity detection model according to any one of claims 1 to 6 to obtain a voice activity detection category corresponding to the voice activity audio frame to be detected based on the characteristics of the voice activity audio frame to be detected.

11. A storage medium, characterized in that the storage medium stores a program adapted for a speech activity detection model training to implement the speech activity detection model training method according to any one of claims 1 to 6, or the storage medium stores a program adapted for a speech activity detection to implement the speech activity detection method according to claim 8.

12. An electronic device comprising at least one memory and at least one processor; the memory stores a program that the processor invokes to perform the voice activity detection model training method according to any one of claims 1 to 6 or the voice activity detection method according to claim 7.