CN116895289A

CN116895289A - Training method of voice activity detection model, voice activity detection method and device

Info

Publication number: CN116895289A
Application number: CN202311049011.4A
Authority: CN
Inventors: 张结; 王景渊; 周叶萍; 刘沛奇
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2023-08-18
Filing date: 2023-08-18
Publication date: 2023-10-17

Abstract

The disclosure provides a training method of a voice activity detection model, a voice activity detection method and a voice activity detection device. The training method comprises the steps of obtaining a training set, wherein the training set comprises a plurality of voice training samples; performing conversion processing on the voice training sample to obtain the target logarithmic Mel spectrum characteristics; processing the target logarithmic mel spectrum characteristics by using a gating convolution layer and a maximum pooling layer to obtain a coding result, wherein the convolution coding module comprises the gating convolution layer, the maximum pooling layer and a first full-connection layer; processing the coding result by using the first full-connection layer to obtain a prediction tag, wherein the prediction tag characterizes whether a voice signal exists in a voice training sample; processing the coding result by using a residual error decoding module to obtain a prediction result, wherein the initial voice detection model comprises a convolution coding module and a residual error decoding module; inputting the prediction label and the prediction result into a loss function, and outputting a loss result; and iteratively adjusting network parameters of the initial voice detection model according to the loss result to obtain a trained voice activity detection model.

Description

Training method of voice activity detection model, voice activity detection method and device

Technical Field

The present disclosure relates to the field of speech data processing technology, and more particularly, to a training method of a speech activity detection model, a speech activity detection method, a training apparatus of a speech activity detection model, a speech activity detection apparatus, an electronic device, a computer-readable storage medium, and a computer program product.

Background

Voice activity detection (Voice Activity Detection, VAD) aims to identify whether a voice signal is present in an audio signal that is subject to various background noise disturbances. Which is typically used as a front-end preprocessor and affects the performance of the back-end tasks to a large extent. For example, in automatic speech recognition (Automatic Speech Recognition, ASR), studies have shown that even though the background noise is small, half the word error rate is not matched by the front-end VAD. In speech coding tasks, VADs can be effectively utilized to reduce average bit rate and inter-channel interference. Applications for VAD include, among others, speech separation, enhancement, speaker recognition, etc.

In the process of implementing the disclosed concept, the inventor finds that at least the following problems exist in the related art: there are several key factors to consider when considering the front-end voice activity detection module in combination with more complex subsequent voice tasks. First, the expected VAD model needs to have a lightweight model size in order to operate efficiently in a resource-constrained environment. Second, it also needs to have low latency characteristics to ensure real-time performance and immediate response.

Disclosure of Invention

In view of this, embodiments of the present disclosure provide a training method of a voice activity detection model, a voice activity detection method, a training apparatus of a voice activity detection model, a voice activity detection apparatus, an electronic device, a computer-readable storage medium, and a computer program product.

One aspect of an embodiment of the present disclosure provides a training method of a voice activity detection model, including:

acquiring a training set, wherein the training set comprises a plurality of voice training samples;

converting the voice training samples aiming at each voice training sample to obtain a target logarithmic Mel spectrum characteristic;

processing the target logarithmic mel spectrum characteristics by using a gating convolution layer and a maximum pooling layer to obtain a coding result, wherein a convolution coding module comprises the gating convolution layer, the maximum pooling layer and a first full-connection layer;

processing the coding result by using the first full-connection layer to obtain a prediction tag, wherein the prediction tag represents whether a voice signal exists in the voice training sample;

processing the coding result by using a residual error decoding module to obtain a prediction result, wherein an initial voice detection model comprises the convolution coding module and the residual error decoding module;

Inputting the prediction label and the prediction result into a loss function, and outputting a loss result;

and iteratively adjusting network parameters of the initial voice detection model according to the loss result to obtain a trained voice activity detection model.

According to an embodiment of the present disclosure, performing conversion processing on the voice training sample to obtain a target log mel-spectrum feature includes:

carrying out framing treatment on the voice training sample to obtain a plurality of voice sub-signals;

performing short-time Fourier transform processing on the voice sub-signals aiming at each voice sub-signal to obtain frequency domain information;

carrying out logarithmic Mel filtering processing on the frequency domain information to obtain initial logarithmic Mel spectrum characteristics;

and generating the target logarithmic Mel spectrum characteristics according to a plurality of the initial logarithmic Mel spectrum characteristics.

According to an embodiment of the present disclosure, the framing processing is performed on the voice training samples to obtain a plurality of voice sub-signals, including:

based on a preset step length, the voice training sample is segmented by utilizing a preset hanning window, and a plurality of voice sub-signals are obtained.

According to an embodiment of the present disclosure, the gating convolution layer includes a plurality of convolution layers;

The method for processing the target log mel spectrum features by using a gating convolution layer and a maximum pooling layer to obtain a coding result comprises the following steps:

processing the target logarithmic mel spectrum characteristic by utilizing a convolution layer of the first part to obtain a voice characteristic;

processing the target logarithmic mel spectrum characteristics by using a convolution layer of the second part to obtain mask weights;

generating initial convolution characteristics according to the voice characteristics and the mask weights;

and processing the initial convolution characteristic by using the maximum pooling layer to generate the coding result.

According to an embodiment of the present disclosure, the residual decoding module includes i residual blocks connected in sequence and a second full connection layer connected to the last residual block;

the residual error decoding module is used for processing the coding result to obtain a prediction result, and the method comprises the following steps:

when i is equal to 1, the coding result is processed by using an ith residual error block to obtain an initial output result;

when i is not equal to 1, the initial output result output by the ith-1 residual block is processed by the ith residual block, and a target output result is obtained;

and processing a target output result output by the last residual block by using the second full connection layer to obtain the prediction result.

According to an embodiment of the present disclosure, for the i-th residual block:

processing input information by using a convolution block to obtain an initial output characteristic, wherein the convolution block comprises a plurality of convolution layers, and the input information represents a coding result of the ith residual block or the initial output result;

and generating a target output characteristic according to the input information and the initial output characteristic, wherein the target output characteristic represents an initial output result or a target output result output by the ith residual block.

According to an embodiment of the present disclosure, inputting the prediction tag and the prediction result into a loss function, outputting a loss result, includes:

inputting the prediction tag and the prediction result into a binary cross entropy function based on a preset activation function, and outputting the loss result, wherein the binary cross entropy function is generated according to a super parameter determined by importance between the convolutional encoding module and the residual decoding module.

Another aspect of an embodiment of the present disclosure provides a voice activity detection method, including:

acquiring detection voice information;

inputting the detected voice information into a voice activity detection model, and outputting a recognition result, wherein the recognition result represents whether a voice signal exists in the detected voice information;

Wherein the voice activity detection model is trained based on the method.

Another aspect of an embodiment of the present disclosure provides a training apparatus of a voice activity detection model, including:

the first acquisition module is used for acquiring a training set, wherein the training set comprises a plurality of voice training samples;

the conversion module is used for carrying out conversion processing on the voice training samples aiming at each voice training sample to obtain the target logarithmic Mel spectrum characteristics;

the first processing module is used for processing the target logarithmic mel spectrum characteristics by using a gating convolution layer and a maximum pooling layer to obtain a coding result, wherein the convolution coding module comprises the gating convolution layer, the maximum pooling layer and a first full-connection layer;

the second processing module is used for processing the coding result by using the first full-connection layer to obtain a prediction tag, wherein the prediction tag represents whether a voice signal exists in the voice training sample;

the third processing module is used for processing the coding result by using the residual error decoding module to obtain a prediction result, wherein the initial voice detection model comprises the convolution coding module and the residual error decoding module;

The loss calculation module is used for inputting the prediction label and the prediction result into a loss function and outputting a loss result;

and the iteration adjustment module is used for iteratively adjusting the network parameters of the initial voice detection model according to the loss result to obtain a trained voice activity detection model.

Another aspect of an embodiment of the present disclosure provides a voice activity detection apparatus, including:

the second acquisition module is used for acquiring the detection voice information;

the detection module is used for inputting the detection voice information into a voice activity detection model and outputting a recognition result, wherein the recognition result represents whether a voice signal exists in the detection voice information;

wherein the voice activity detection model is trained based on the method.

Another aspect of an embodiment of the present disclosure provides an electronic device, including: one or more processors; and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method as described above.

Another aspect of an embodiment of the present disclosure provides a computer-readable storage medium storing computer-executable instructions that, when executed, are configured to implement a method as described above.

Another aspect of the disclosed embodiments provides a computer program product comprising computer executable instructions which, when executed, are to implement a method as described above.

According to the embodiment of the disclosure, the voice activity detection model is obtained by converting a voice training sample into a target log-mel spectrum feature, then generating a coding result according to the target log-mel spectrum feature by using a convolution coding module only comprising a gating convolution layer and a maximum pooling layer, and generating a prediction tag and a prediction result by using a first full connection layer and a residual decoding module according to the coding result respectively so as to adjust network parameters of an initial voice detection model according to a loss result calculated by the prediction tag and the prediction result. The convolution coding module only comprises the gating convolution layer and the maximum pooling layer, so that the batch normalization layer is omitted, the parameter number of the model is reduced, and the model is light under the condition that the model prediction accuracy is not obviously reduced.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments thereof with reference to the accompanying drawings in which:

FIG. 1 schematically illustrates an exemplary system architecture of a training method or a voice activity detection method to which a voice activity detection model may be applied, according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a training method of a speech activity detection model according to an embodiment of the disclosure;

FIG. 3 schematically illustrates a structural schematic of a voice activity detection model according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates a block diagram of a convolutional encoding module in accordance with an embodiment of the present disclosure;

fig. 5 schematically illustrates a structural schematic diagram of a residual decoding module according to an embodiment of the present disclosure;

FIG. 6 schematically illustrates a flow chart of a voice activity detection method according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates a block diagram of a training apparatus of a speech activity detection model according to an embodiment of the disclosure;

FIG. 8 schematically illustrates a block diagram of a voice activity detection apparatus according to an embodiment of the present disclosure;

fig. 9 schematically shows a block diagram of an electronic device adapted to implement the method described above, according to an embodiment of the disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

Traditional VAD methods rely primarily on energy-based features such as time-domain power, spectral features, short-time energy, and spectral entropy, among others. However, in an environment where the Signal-to-Noise Ratio (SNR) is low, it is challenging to accurately distinguish human voice from Noise using the conventional method. In recent years, with the development of neural networks and the application in the field of speech processing, some deep learning models such as deep fully connected networks, convolutional neural networks, and the like have also received research attention in the field. Although these methods are improved in performance over conventional methods, they are still largely affected by noise conditions.

To further overcome these limitations, the related art proposes some more complex models. For example, by training a bottleneck depth neural network (bottleneck Deep Neural Network, bDNN) using a Multi-Resolution cochlear map (Multi-Resolution CochleaGram, MRCG), the network can achieve better performance while requiring a higher computational burden. Some of the related art proposes an adaptive contextual attention model (Adaptive Context Attention Model, ACAM) based on an attention mechanism that uses contextual information in the training process and that works better than bDNN. However, the ACAM training process is not stable. Some related art proposes a time-frequency attention-based VAD Model (STAM) that is based on ACAM with the same feature input to improve training stability.

Since the front-end voice activity detection module is often combined with more complex subsequent voice tasks, the intended VAD model needs to meet two important requirements in addition to being noise robust: one is a lightweight model size and the other is low latency. The model size is not only related to the spatial complexity but also affects the decoding time, which directly determines the delay performance. For example, in the ACAM and STAM methods, VAD becomes non-causal due to the use of context information, which means that several time frames must be waited to build an input feature. However, such high-latency non-causal designs are obviously incompatible with online speech tasks (e.g., streaming automatic speech recognition, etc.).

In view of this, embodiments of the present disclosure provide a training method for a voice activity detection model, a voice activity detection method and a device. The training method comprises the steps of obtaining a training set, wherein the training set comprises a plurality of voice training samples; aiming at each voice training sample, converting the voice training sample to obtain the target logarithmic Mel spectrum characteristics; processing the target logarithmic mel spectrum characteristics by using a gating convolution layer and a maximum pooling layer to obtain a coding result, wherein the convolution coding module comprises the gating convolution layer, the maximum pooling layer and a first full-connection layer; processing the coding result by using a first full-connection layer to obtain a prediction tag, wherein the prediction tag characterizes whether a voice signal exists in a voice training sample; processing the coding result by using a residual error decoding module to obtain a prediction result, wherein the initial voice detection model comprises a convolution coding module and a residual error decoding module; inputting the prediction label and the prediction result into a loss function, and outputting a loss result; and iteratively adjusting network parameters of the initial voice detection model according to the loss result to obtain a trained voice activity detection model.

Fig. 1 schematically illustrates an exemplary system architecture 100 in which a training method of a speech activity detection model or a speech activity detection method may be applied according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios.

As shown in fig. 1, a system architecture 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired and/or wireless communication links, and the like.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications may be installed on the terminal devices 101, 102, 103, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients and/or social platform software, to name a few.

The terminal devices 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for websites browsed by users using the terminal devices 101, 102, 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that, the training method of the voice activity detection model or the voice activity detection method provided by the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the training device or voice activity detection device of the voice activity detection model provided by embodiments of the present disclosure may be generally provided in the server 105. The training method of the voice activity detection model or the voice activity detection method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the training apparatus of the voice activity detection model or the voice activity detection apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Alternatively, the training method of the voice activity detection model or the voice activity detection method provided by the embodiments of the present disclosure may also be performed by the terminal device 101, 102, or 103, or may also be performed by other terminal devices other than the terminal device 101, 102, or 103. Accordingly, the training apparatus or the voice activity detection apparatus of the voice activity detection model provided in the embodiments of the present disclosure may also be provided in the terminal device 101, 102, or 103, or in another terminal device different from the terminal device 101, 102, or 103.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Fig. 2 schematically illustrates a flowchart of a training method of a speech activity detection model according to an embodiment of the present disclosure. Fig. 3 schematically illustrates a structural diagram of a voice activity detection model according to an embodiment of the present disclosure.

As shown in fig. 2, the training method of the voice activity detection model includes operations S201 to S207.

In operation S201, a training set is acquired, wherein the training set includes a plurality of voice training samples;

in operation S202, for each voice training sample, performing conversion processing on the voice training sample to obtain a target log mel spectrum feature;

in operation S203, processing the target log mel spectrum feature by using the gating convolution layer and the maximum pooling layer to obtain a coding result, where the convolution coding module includes the gating convolution layer, the maximum pooling layer and the first full connection layer;

in operation S204, the encoding result is processed by using the first full connection layer to obtain a prediction tag, where the prediction tag characterizes whether a speech signal exists in the speech training sample;

In operation S205, the encoding result is processed by using the residual decoding module to obtain a prediction result, wherein the initial speech detection model includes a convolutional encoding module and a residual decoding module;

in operation S206, the prediction tag and the prediction result are input into the loss function, and the loss result is output;

in operation S207, the network parameters of the initial voice detection model are iteratively adjusted according to the loss result, resulting in a trained voice activity detection model.

According to an embodiment of the present disclosure, each speech training sample in the training set is converted to a log-Mel-target spectral feature X, as shown in fig. 3. And then inputting the target log-mel spectrum characteristics into a convolution coding module, so that a gating convolution layer and a maximum pooling layer in the convolution coding module generate a coding result according to the target log-mel spectrum characteristics. The convolution encoding module of the present disclosure eliminates the batch normalization (Batch Normalization, BN) layer of the spectrum attention module of the STAM model due to the inclusion of only the gated convolution layer and the max pooling layer in the spectrum attention module of the gated convolutional neural network (Convolutional Neural Network, CNN), which effectively reduces the parameter amount without severely impacting performance. Meanwhile, the coding result is also input into an additional first full-connection layer, so that a binary result, namely a predictive label, whether the binary result contains voice or not is output. If the binary result is 1, it represents that the voice signal exists in the voice training sample, and if the binary result is 0, it represents that the voice signal does not exist in the voice training sample.

According to the embodiment of the disclosure, a residual decoding module is utilized to process a coding result to obtain a prediction result, the prediction result is also a binary result of whether voice is contained, the prediction label and the prediction result are input into a loss function, and the loss result is output, so that network parameters of an initial voice detection model are iteratively adjusted according to the loss result, and a trained voice activity detection model can be obtained.

According to an embodiment of the present disclosure, a conversion process is performed on a voice training sample to obtain a target log mel-spectrum feature, including: carrying out framing treatment on the voice training sample to obtain a plurality of voice sub-signals; for each voice sub-signal, carrying out short-time Fourier transform processing on the voice sub-signal to obtain frequency domain information; carrying out logarithmic Mel filtering processing on the frequency domain information to obtain initial logarithmic Mel spectrum characteristics; and generating target logarithmic Mel spectrum characteristics according to the plurality of initial logarithmic Mel spectrum characteristics.

According to an embodiment of the present disclosure, in order to obtain a characteristic signal at a frame level, an input speech training sample is framed to obtain a plurality of speech sub-signals. Next, the voice sub-signals of each frame are respectively converted by Short-time fourier transform (Short-Time Fourier Transform, STFT) of 1024 points to obtain frequency domain information, and then the frequency domain information is passed through a logarithmic mel filter of d=80 to obtain initial logarithmic mel spectrum characteristics. Finally, the initial logarithmic mel-spectrum features of the plurality of frames are spliced to obtain the target logarithmic mel-spectrum features with context information, including information of the current frame and the past frame.

According to an embodiment of the present disclosure, since the target log mel-spectrum feature and the prediction tag of the input voice activity detection model are constructed using a series of frames, the feature vector at the instant index T contains information of the current and past frames, given by equation (1):

wherein X is the target logarithmic Mel spectrum characteristic, t= [ t ] ₀ ,t ₁ ,t ₂ ,···,t _n ]Representing the set of relative time indices of the frames under consideration, F represents the logarithmic mel-spectrum acoustic signature at the frame level. Similarly, the vector of the predictive label for the current time step is shown in equation (2):

where T represents the transpose of the vector/matrix. It is evident from the feature formation considered that only current and past information is used in detecting the state of the current frame, and no future frames are needed, resulting in the causality of the proposed speech activity detection model. In particular, when L is 1, it represents the current frame speech signal; when L is 0, it represents that the current frame has no speech signal.

According to an embodiment of the present disclosure, framing a speech training sample to obtain a plurality of speech sub-signals includes: based on the preset step length, the voice training sample is segmented by utilizing a preset hanning window, so that a plurality of voice sub-signals are obtained.

According to embodiments of the present disclosure, the preset step size and the preset hanning window may be adjusted according to actual situations, for example, a hanning window with a preset step size of 10 ms and 25 ms.

Fig. 4 schematically illustrates a structural diagram of a convolutional encoding module according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, the gated convolutional layer comprises a plurality of convolutional layers.

The method for processing the target log-mel spectrum features by using the gating convolution layer and the maximum pooling layer to obtain a coding result comprises the following steps: processing the target logarithmic mel spectrum characteristic by utilizing the convolution layer of the first part to obtain a voice characteristic; processing the target logarithmic mel spectrum characteristics by utilizing the convolution layer of the second part to obtain mask weights; generating initial convolution characteristics according to the voice characteristics and the mask weights; and processing the initial convolution characteristic by using a maximum pooling layer to generate a coding result.

According to an embodiment of the present disclosure, the convolution kernel size in the convolution layer is 3×3. The number of input and output channels increases from {1,2} to {8, 16} in turn, depending on the number of convolutional layers. If the number of convolution layers is increased by one layer, the input and output channels of the corresponding layer become twice as many as those of the previous layer. The window size of the pooling layer is 2 x 2. The number of hidden units and output units in the first fully connected layer is 256 and 1, respectively. In the present embodiment, the number of layers of the convolution layer is empirically set to 4.

It should be noted that, the above parameters may be specifically set according to actual requirements, and the setting of the above parameters is not a limitation on the protection scope of the present disclosure, and for example, the number of layers of the convolution layer or the size of the convolution kernel may be set according to actual requirements.

In accordance with an embodiment of the present disclosure, the convolutional encoding module consists of a gated convolutional layer of multiple convolutional layers and a max-pooling layer, as indicated by the dashed box in fig. 4. Given the input target log mel-spectrum feature X, the upper convolution layer is responsible for extracting the speech feature CX, and the lower convolution layer generates the corresponding mask weight MX. The speech features CX are multiplied by the mask weights MX to generate an initial convolution feature, which is then input to the max pooling layer. The maximum pooling layer outputs the coding result E, as well asThe time-coded result E is sent to an additional first full-concatenated layer to obtain predictive labels for loss calculation during trainingWherein, predictive label->May be regarded as a predictive output of the speech activity detection model.

Fig. 5 schematically illustrates a structural diagram of a residual decoding module according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, as shown in fig. 5, the residual decoding module includes i residual blocks connected in sequence and a second full connection layer connected to the last residual block.

The method for obtaining the prediction result comprises the following steps of:

under the condition that i is equal to 1, processing the coding result by utilizing an ith residual error block to obtain an initial output result; when i is not equal to 1, the initial output result output by the ith-1 residual block is processed by the ith residual block, and a target output result is obtained; and processing a target output result output by the last residual block by using the second full connection layer to obtain a prediction result.

According to embodiments of the present disclosure, predictive tagsIs the output obtained through one first full connection layer based on the encoding result E only. This means that the encoding result E already contains some semantic information and it is desirable to keep this information as much as possible in the residual decoding module. To achieve this, a residual connection is used in the residual decoding module, which helps to preserve the information from the previous stage and only trains the error module of the characteristic information. By stacking multiple residual blocks, the residual blocks are able to achieve a large receptive field with a small number of parameters, which benefits from the nature of the convolution kernel in the residual blocks. Four residual convolutions can be configured in the residual decoding module Blocks (i.e., residual blocks of the present disclosure), each residual convolution block including at least two convolution layers, the plurality of residual convolution blocks forming a pipelined structure.

According to an embodiment of the disclosure, referring to fig. 5, after the encoding result E is input to the residual decoding module, the 1 st residual block generates an initial output result according to the encoding result E, the initial output result is input to the 2 nd residual block, the target output result is output, the target output result of the 2 nd residual block is input to the 3 rd residual block, the target output result is output, and so on, the target output result output by the last residual block is processed by the second full connection layer to obtain a prediction result

According to an embodiment of the present disclosure, for the i-th residual block: processing input information by using a convolution block to obtain an initial output characteristic, wherein the convolution block comprises a plurality of convolution layers, and the input information represents a coding result or an initial output result of an i-th residual block; and generating target output characteristics according to the input information and the initial output characteristics, wherein the target output characteristics represent an initial output result or a target output result output by the ith residual block.

According to an embodiment of the present disclosure, the convolution kernel size of the convolution layers in the convolution block is 3×3, the number of channels in each convolution layer is {1,4,1}, and the number of output units in the second full connection layer is 1. The number of residual blocks can be specifically set according to practical requirements, for example, 4.

According to an embodiment of the disclosure, for each residual block, a 1 st residual block is assumed, a convolution layer in the 1 st residual block performs feature extraction on the encoding result E to obtain an initial output feature, a sum of the initial output feature and the encoding result E is taken as an output of the 1 st residual block, that is, a target output feature, and then the target output feature output by the 1 st residual block is taken as input information of the 2 nd residual block, and performs the same processing manner as the 1 st residual block, and so on, and the target output feature output by the last residual block is input to the second full connection layer to obtain a prediction result

According to an embodiment of the present disclosure, inputting a prediction tag and a prediction result into a loss function, outputting a loss result, includes: based on a preset activation function, the prediction label and the prediction result are input into a binary cross entropy function, and a loss result is output, wherein the binary cross entropy function is generated according to a super parameter determined by importance between a convolution encoding module and a residual error decoding module.

According to an embodiment of the present disclosure, adam optimizer may be used in training, learning rate from 10 ^-3 Becomes 10 ^-5 The learning rate per round was 0.8 relative to the decay rate of the previous step. Predictive tags using the output of a convolutional encoding module And the prediction result of the output of the residual decoding module +.>To calculate the binary Cross Entropy (CE) loss relative to the true value, the loss result is shown in formula (3):

wherein CE represents a binary cross entropy loss function, S represents a sigmoid activation function, and the super parameter k is used for specifying the importance of the convolutional encoding module and the residual decoding module, and may be specifically set according to actual requirements, for example, 0.7. Note that all frames are considered in the loss calculation.

In one embodiment, the present disclosure model training and experimental verification was performed on an Intel Xeon E5-2680 CPU and NVIDIA GeForce GTX 3090 GPU. The model is built using the Python language and the PyTorch framework. In the configuration of the super parameters, all experiments are carried out by using an Adam optimizer, and the learning rate is 10 ^-3 Becomes 10 ^-5 The decay ratio of the learning rate of each round relative to the previous stepAn example is 0.8.

Two data sets were used, including QUT-NOISE-TIMIT and LibriSpeech, to verify the VAD performance in this disclosure. QUT-NOISE-TIMIT data set is formed by mixing the TIMIT data set with QUT-NOISE background NOISE data set. This mixing operation can obtain noise speech data with a duration of 600 hours accumulated in ten scenes at six different signal-to-noise ratio SNR levels (-10, -5,0,5, 10, 15) dB. To maintain independence of the training set and the test set, 100 hours of data were randomly selected as the training set and another 100 hours as the test set. Thus, the training set and the test set have no overlapping segments in terms of either human voice or background noise.

Since librispech was collected primarily for the english ASR task, the present disclosure mixes the librispech-dev-clean dataset with NoiseX-92 to create another noisy test dataset containing 15 environments and 15 hours recordings of 6 different SNR levels (-10, -5,0,5, 10, 15) dB. Since the focus of the present disclosure is on VADs under low signal-to-noise conditions, only the results of SNR ε { -10, -5,0,5} dB are tested.

According to an embodiment of the present disclosure, the QUT-NOISE-TIMIT dataset is used for training and testing, while the Librispech dataset is used for testing only. ACAM, STAM and the model proposed by the present disclosure were tested on two data sets, respectively. Wherein the input signal to the STAM is non-causal and the input signal of the present disclosure is causal.

According to an embodiment of the present disclosure, the accuracy of the VAD is measured using the area under the curve (Area Under the Curve, AUC), which is indicative of the area under the curve of the receiver operating characteristic (Receiver Operating Characteristic, ROC). In addition, the THOP package in Python is used to calculate the number of parameters and floating point operations (FLOPs), which are used to measure the spatial complexity and computational complexity of the model, respectively.

Tables 1 and 2 compare the prediction results of two data sets, respectively, according to embodiments of the present disclosure. It is clear on both data sets that the proposed method of the present disclosure achieves optimal performance, regardless of noise level, over the existing optimal non-causal STAM model. This is mainly because the proposed convolutional coding module is already capable of original VAD prediction and residual decoding provides further tag refinement.

Table 1. AUC scores on qut-NOISE-time dataset.

Table 2. AUC scores for noisy librispech data sets.

It can be observed from table 3 that the proposed method of the present disclosure introduces less FLOP and parameter amounts, showing a wider applicability, in addition to performance superiority, than the existing optimal non-causal STAM model.

Table 3. Complexity comparison of model.

Table 4 shows a more complete comparison with the STAM model over the QUT-NOISE-TIMIT dataset, where several types of additive NOISE and different signal-to-NOISE ratio SNR NOISE levels are considered. It can be seen that the proposed method of the present disclosure achieves better results than non-causal STAM in the CAFE-FOODCOURTB, REVERB-CARPARK, REVERB-POOL environment, where REVERB in the latter two represents a reverberant environment. Under the rest of the conditions, the performance of the two methods is almost equivalent. In both reverberant environments, the AUC of the voice activity detection model proposed by the present disclosure is 2.8% higher on average than that of STAM, showing greater robustness against reverberation.

Table 4. AUC scores for different environments on the QUT-NOISE-TIMIT dataset.

Moreover, the voice activity detection model proposed by the present disclosure has a real-time factor of about 0.03 on the QUT-NOISE-TIMIT dataset, which is acceptable for real-time applications.

Overall, compared to STAM, the voice activity detection model proposed by the present disclosure can solve the problem of non-causal relationship, while the VAD accuracy improves by 0.74% on average with reduced parameters and calculation.

Fig. 6 schematically illustrates a flowchart of a voice activity detection method according to an embodiment of the present disclosure.

As shown in fig. 6, the voice activity detection method includes operations S601 to S602.

In operation S601, detected voice information is acquired;

in operation S602, the detected voice information is input into the voice activity detection model, and a recognition result is output, wherein the recognition result characterizes whether a voice signal exists in the detected voice information.

According to the embodiment of the disclosure, the detected voice information can be collected by a voice collection device such as a microphone, and the collected detected voice information is input into the voice activity detection model of the disclosure, so that whether a voice signal exists in the detected voice information can be recognized.

According to the embodiment of the disclosure, a voice training sample is converted into a target log-mel spectrum characteristic in the training process of a voice activity detection model, then a convolution coding module only comprising a gating convolution layer and a maximum pooling layer is utilized to generate a coding result according to the target log-mel spectrum characteristic, a prediction label and a prediction result are respectively generated by utilizing a first full-connection layer and a residual decoding module according to the coding result, and network parameters of an initial voice detection model are adjusted according to the loss results calculated by the prediction label and the prediction result, so that the voice activity detection model is realized. The convolution coding module only comprises the gating convolution layer and the maximum pooling layer, so that the batch normalization layer is omitted, the parameter number of the model is reduced, the model is light under the condition that the model prediction accuracy is not obviously reduced, and efficient operation under the environment with limited resources is facilitated.

Fig. 7 schematically illustrates a block diagram of a training apparatus of a speech activity detection model according to an embodiment of the disclosure.

As shown in fig. 7, the training apparatus 700 of the voice activity detection model includes a first acquisition module 710, a conversion module 720, a first processing module 730, a second processing module 740, a third processing module 750, a loss calculation module 760, and an iteration adjustment module 770.

A first obtaining module 710, configured to obtain a training set, where the training set includes a plurality of voice training samples;

the conversion module 720 is configured to perform conversion processing on the voice training samples for each voice training sample to obtain a target log mel spectrum feature;

a first processing module 730, configured to process the target log mel spectrum feature by using a gating convolution layer and a max pooling layer to obtain a coding result, where the convolution coding module includes the gating convolution layer, the max pooling layer, and a first full connection layer;

a second processing module 740, configured to process the encoding result by using the first full-connection layer to obtain a prediction tag, where the prediction tag characterizes whether a speech signal exists in the speech training sample;

a third processing module 750, configured to process the encoding result by using a residual decoding module to obtain a prediction result, where the initial speech detection model includes a convolutional encoding module and a residual decoding module;

The loss calculation module 760 is configured to input the prediction label and the prediction result into a loss function, and output a loss result;

the iteration adjusting module 770 is configured to iteratively adjust network parameters of the initial voice detection model according to the loss result to obtain a trained voice activity detection model.

According to an embodiment of the present disclosure, the conversion module 720 includes a framing unit, a conversion unit, a filtering unit, and a first generation unit.

The framing unit is used for framing the voice training samples to obtain a plurality of voice sub-signals;

the conversion unit is used for carrying out short-time Fourier transform processing on the voice sub-signals aiming at each voice sub-signal to obtain frequency domain information;

the filtering unit is used for carrying out logarithmic Mel filtering processing on the frequency domain information to obtain initial logarithmic Mel spectrum characteristics;

the first generation unit is used for generating target logarithmic Mel spectrum characteristics according to the initial logarithmic Mel spectrum characteristics.

According to an embodiment of the present disclosure, the framing unit comprises a framing sub-unit.

The frame dividing sub-unit is used for carrying out segmentation processing on the voice training sample by utilizing a preset hanning window based on a preset step length to obtain a plurality of voice sub-signals.

According to an embodiment of the present disclosure, a gated convolutional layer includes a plurality of convolutional layers;

according to an embodiment of the present disclosure, the first processing module 730 includes a first convolution unit, a second generation unit, and a third generation unit.

The first convolution unit is used for processing the target log-mel spectrum characteristic by utilizing the convolution layer of the first part to obtain a voice characteristic;

the second convolution unit is used for processing the target log-mel spectrum characteristic by utilizing the convolution layer of the second part to obtain mask weights;

The second generating unit is used for generating initial convolution characteristics according to the voice characteristics and the mask weights;

and the third generating unit is used for processing the initial convolution characteristic by utilizing the maximum pooling layer and generating a coding result.

According to an embodiment of the present disclosure, the residual decoding module includes i residual blocks connected in sequence and a second full connection layer connected with the last residual block;

according to an embodiment of the present disclosure, the third processing module 740 includes a first processing unit, a second processing unit, and a deriving unit.

The first processing unit is used for processing the coding result by utilizing the ith residual error block under the condition that i is equal to 1, so as to obtain an initial output result;

the second processing unit is used for processing an initial output result output by the ith-1 residual block by utilizing the ith residual block under the condition that i is not equal to 1, so as to obtain a target output result;

and the obtaining unit is used for processing a target output result output by the last residual block by using the second full connection layer to obtain a prediction result.

According to an embodiment of the present disclosure, either one of the first processing unit and the second processing unit includes a convolution subunit, a generation subunit.

The convolution subunit is used for processing input information by utilizing a convolution block to obtain an initial output characteristic, wherein the convolution block comprises a plurality of convolution layers, and the input information represents a coding result or an initial output result of an i-th residual block;

And the generating subunit is used for generating target output characteristics according to the input information and the initial output characteristics, wherein the target output characteristics represent an initial output result or a target output result output by the ith residual block.

According to an embodiment of the present disclosure, the loss calculation module 760 includes a loss calculation unit.

And the loss calculation unit is used for inputting the prediction label and the prediction result into a binary cross entropy function based on a preset activation function and outputting the loss result, wherein the binary cross entropy function is generated according to the super-parameters determined by the importance between the convolution encoding module and the residual error decoding module.

Fig. 8 schematically illustrates a block diagram of a voice activity detection apparatus according to an embodiment of the present disclosure.

As shown in fig. 8, the voice activity detection apparatus 800 includes a second acquisition module 810 and a detection module 820.

A second obtaining module 810, configured to obtain detected voice information;

the detection module 820 is configured to input the detected voice information into the voice activity detection model, and output a recognition result, where the recognition result characterizes whether a voice signal exists in the detected voice information.

Any number of the modules, units, sub-units, or at least some of the functionality of any number of the modules, units, sub-units, or sub-units according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules, units, sub-units according to embodiments of the present disclosure may be implemented as split into multiple modules. Any one or more of the modules, units, sub-units according to embodiments of the present disclosure may be implemented at least in part as a hardware circuit, such as a field programmable gate array (Field Programmable Gate Array, FPGA), a programmable logic array (Programmable Logic Arrays, PLA), a system on a chip, a system on a substrate, a system on a package, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or in hardware or firmware in any other reasonable manner of integrating or packaging the circuits, or in any one of or a suitable combination of any of the three. Alternatively, one or more of the modules, units, sub-units according to embodiments of the disclosure may be at least partially implemented as computer program modules, which when executed, may perform the corresponding functions.

For example, any of the first acquisition module 710, the conversion module 720, the first processing module 730, the second processing module 740, the third processing module 750, the loss calculation module 760, and the iteration adjustment module 770, or the second acquisition module 810 and the detection module 820 may be combined in one module/unit/sub-unit, or any one of the modules/units/sub-units may be split into a plurality of modules/units/sub-units. Alternatively, at least some of the functionality of one or more of these modules/units/sub-units may be combined with at least some of the functionality of other modules/units/sub-units and implemented in one module/unit/sub-unit. According to embodiments of the present disclosure, at least one of the first acquisition module 710, the conversion module 720, the first processing module 730, the second processing module 740, the third processing module 750, the loss calculation module 760, and the iteration adjustment module 770, or the second acquisition module 810 and the detection module 820 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable manner of integrating or packaging circuitry, or in any one of or a suitable combination of three of software, hardware, and firmware. Alternatively, at least one of the first acquisition module 710, the conversion module 720, the first processing module 730, the second processing module 740, the third processing module 750, the loss calculation module 760, and the iteration adjustment module 770, or the second acquisition module 810 and the detection module 820 may be at least partially implemented as a computer program module, which when executed, may perform the corresponding functions.

It should be noted that, the training device and the voice activity detection device portion of the voice activity detection model in the embodiments of the present disclosure correspond to the training method and the voice activity detection method portion of the voice activity detection model in the embodiments of the present disclosure, and the description of the training device and the voice activity detection device portion of the voice activity detection model specifically refers to the training method and the voice activity detection method portion of the voice activity detection model, which are not described herein again.

Fig. 9 schematically shows a block diagram of an electronic device adapted to implement the method described above, according to an embodiment of the disclosure. The electronic device shown in fig. 9 is merely an example, and should not impose any limitations on the functionality and scope of use of embodiments of the present disclosure.

As shown in fig. 9, an electronic device 900 according to an embodiment of the present disclosure includes a processor 901 that can perform various appropriate actions and processes according to a program stored in a Read-Only Memory (ROM) 902 or a program loaded from a storage portion 908 into a random access Memory (Random Access Memory, RAM) 903. The processor 901 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. Processor 901 may also include on-board memory for caching purposes. Processor 901 may include a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the present disclosure.

In the RAM 903, various programs and data necessary for the operation of the electronic device 900 are stored. The processor 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. The processor 901 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 902 and/or the RAM 903. Note that the program may be stored in one or more memories other than the ROM 902 and the RAM 903. The processor 901 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in the one or more memories.

According to an embodiment of the disclosure, the electronic device 900 may also include an input/output (I/O) interface 905, the input/output (I/O) interface 905 also being connected to the bus 904. The system 900 may also include one or more of the following components connected to the I/O interface 905: an input section 906 including a keyboard, a mouse, and the like; an output portion 907 including a display such as a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and a speaker; a storage portion 908 including a hard disk or the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as needed. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 910 so that a computer program read out therefrom is installed into the storage section 908 as needed.

According to embodiments of the present disclosure, the method flow according to embodiments of the present disclosure may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 909 and/or installed from the removable medium 911. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 901. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium. Examples may include, but are not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (EPROM) or flash Memory, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM 902 and/or RAM 903 and/or one or more memories other than ROM 902 and RAM 903 described above.

Embodiments of the present disclosure also include a computer program product comprising a computer program comprising program code for performing the methods provided by the embodiments of the present disclosure, the program code for causing an electronic device to implement the training method or the voice activity detection method of the voice activity detection model provided by the embodiments of the present disclosure when the computer program product is run on the electronic device.

The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 901. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed, and downloaded and installed in the form of a signal on a network medium, via communication portion 909, and/or installed from removable medium 911. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be combined in various combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.

The embodiments of the present disclosure are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims

1. A method of training a speech activity detection model, comprising:

processing the target log Mel spectrum characteristics by using a gating convolution layer and a maximum pooling layer to obtain a coding result, wherein a convolution coding module comprises the gating convolution layer, the maximum pooling layer and a first full connection layer;

2. The method of claim 1, wherein converting the speech training samples to obtain target log mel-spectrum features comprises:

and generating the target logarithmic Mel spectrum characteristics according to the initial logarithmic Mel spectrum characteristics.

3. The method of claim 2, wherein framing the speech training samples to obtain a plurality of speech sub-signals comprises:

Based on a preset step length, the voice training samples are segmented by utilizing a preset hanning window, and a plurality of voice sub-signals are obtained.

4. The method of claim 1, the gated convolutional layer comprising a plurality of convolutional layers;

the target log mel spectrum feature is processed by a gating convolution layer and a maximum pooling layer to obtain a coding result, which comprises the following steps:

processing the target logarithmic mel-spectrum characteristics by using a convolution layer of the second part to obtain mask weights;

generating initial convolution features according to the voice features and the mask weights;

5. The method of claim 1, wherein the residual decoding module comprises i residual blocks connected in sequence and a second full connection layer connected to a last residual block;

the method comprises the steps of processing the coding result by using a residual error decoding module to obtain a prediction result, wherein the method comprises the following steps:

under the condition that i is equal to 1, the coding result is processed by utilizing an ith residual error block, and an initial output result is obtained;

6. The method of claim 5, wherein, for the i-th residual block:

7. The method of claim 1, wherein inputting the predictive label and the predictive result into a loss function, outputting a loss result, comprises:

inputting the prediction label and the prediction result into a binary cross entropy function based on a preset activation function, and outputting the loss result, wherein the binary cross entropy function is generated according to a super parameter determined by importance between the convolution encoding module and the residual decoding module.

8. A voice activity detection method, comprising:

acquiring detection voice information;

wherein the speech activity detection model is trained based on the method of any one of claims 1 to 7.

9. A training apparatus for a speech activity detection model, comprising:

the conversion module is used for carrying out conversion processing on the voice training samples aiming at each voice training sample to obtain a target logarithmic Mel spectrum characteristic;

the second processing module is used for processing the coding result by utilizing the first full-connection layer to obtain a prediction tag, wherein the prediction tag represents whether a voice signal exists in the voice training sample;

10. A voice activity detection apparatus comprising: