CN115035897B

CN115035897B - Keyword detection method and system

Info

Publication number: CN115035897B
Application number: CN202210952631.8A
Authority: CN
Inventors: 王啸; 李郡; 尚德龙; 周玉梅
Original assignee: Zhongke Nanjing Intelligent Technology Research Institute
Current assignee: Zhongke Nanjing Intelligent Technology Research Institute
Priority date: 2022-08-10
Filing date: 2022-08-10
Publication date: 2022-11-11
Anticipated expiration: 2042-08-10
Also published as: CN115035897A

Abstract

The invention relates to a keyword detection method and a keyword detection system. The method comprises the steps of obtaining a voice signal to be processed; carrying out feature extraction on a voice signal to be processed, determining Mel frequency spectrum features, and carrying out variable-dimension processing on the Mel frequency spectrum features; the Mel frequency spectrum characteristic after the variable dimension processing is a one-dimensional characteristic; determining a detection result by utilizing a trained time domain convolution neural network according to the Mel frequency spectrum characteristic after the dimension changing processing; the trained time domain bottleneck block in the time domain convolutional neural network comprises a point convolutional layer, a BN layer, a ReLU activation function, a deep time domain convolutional layer, a BN layer, a ReLU activation function, a point convolutional layer, a BN layer and a ReLU activation function which are connected in sequence; wherein, the input end of the point convolution layer of the first layer in each layer of time domain bottleneck block is connected with the output end of the BN layer of the last layer. The keyword detection method and the device can reduce the calculation amount of the keyword detection and improve the accuracy of the keyword detection.

Description

Keyword detection method and system

Technical Field

The invention relates to the technical field of keyword detection in voice recognition, in particular to a keyword detection method and a keyword detection system.

Background

The keyword detection system generally operates in applications of mobile devices (e.g., hey of apple, siri; lovely of millet), and the memory of the mobile devices is small and the computing power is limited, so the keyword detection system should meet the requirements of high accuracy, low delay, low memory and low computation amount at the same time. In the conventional convolutional neural network commonly used in the keyword detection system, the used two-dimensional convolution is complex in calculation and obvious in time consumption.

Therefore, it is desirable to provide a method or system capable of reducing the amount of calculation for keyword detection and improving the accuracy of keyword detection.

Disclosure of Invention

The invention aims to provide a keyword detection method and a keyword detection system, which can reduce the calculation amount of keyword detection and improve the accuracy of keyword detection.

In order to achieve the purpose, the invention provides the following scheme:

a keyword detection method includes:

acquiring a voice signal to be processed;

extracting the characteristics of the voice signal to be processed, determining Mel frequency spectrum characteristics, and performing variable-dimension processing on the Mel frequency spectrum characteristics; the Mel frequency spectrum characteristic after the variable dimension processing is a one-dimensional characteristic;

determining a detection result of the voice signal to be processed by utilizing a trained time domain convolution neural network according to the Mel frequency spectrum characteristic after the dimension changing processing; the detection result is a keyword or a non-keyword corresponding to the voice signal to be processed;

the process of detecting the keywords by the trained time domain convolution neural network comprises the following steps:

inputting the Mel frequency spectrum characteristics after the dimension changing processing into a first layer time domain convolution layer for convolution processing;

inputting the features after the convolution processing into a BN layer for regularization processing;

inputting the characteristics after the regularization processing into a ReLU activation function, further sequentially inputting 6 layers of time domain bottleneck blocks, a global average pooling layer and a full connection layer to determine corresponding output probabilities of all keywords and non-keywords, and taking a word corresponding to the maximum output probability as a detection result of the trained time domain convolutional neural network; the time-domain bottleneck block comprises: the device comprises a point convolution layer, a BN layer, a ReLU activation function, a depth time domain convolution layer, a BN layer, a ReLU activation function, a point convolution layer, a BN layer and a ReLU activation function which are connected in sequence; wherein, the input end of the point convolution layer of the first layer in each layer of time domain bottleneck block is connected with the output end of the BN layer of the last layer.

Optionally, the performing feature extraction on the voice signal to be processed to determine mel-frequency spectrum features specifically includes:

pre-emphasis, framing and windowing, fast Fourier Transform (FFT), mel filtering and logarithm processing are carried out on the voice signal to be processed, and Mel frequency spectrum characteristics are determined; the frame length of the framing windowing is 30ms, and the frame shift is 10ms; the Mel filtering process includes: 40 Mel filters.

Optionally, the performing feature extraction on the voice signal to be processed, determining a mel-frequency spectrum feature, and performing variable-dimension processing on the mel-frequency spectrum feature specifically includes:

and performing dimension changing processing on the Mel frequency spectrum characteristic by using a view function or a reshape function in a PyTorch frame.

Optionally, the number of output channels of the first layer of time domain convolution layer is C, and the convolution kernel is 1 × 3.

Optionally, the convolution kernels of the dot convolution layers are each 1 x 1 in size.

Optionally, the convolution kernels of the depth-time domain convolution layer each have a size of 1 × 9.

A keyword detection system, comprising:

the voice signal acquisition module is used for acquiring a voice signal to be processed;

the Mel frequency spectrum feature extraction and dimension variation module is used for extracting features of the voice signal to be processed, determining Mel frequency spectrum features and carrying out dimension variation processing on the Mel frequency spectrum features; the Mel frequency spectrum characteristic after the dimension changing processing is a one-dimensional characteristic;

the detection result determining module is used for determining the detection result of the voice signal to be processed by utilizing the trained time domain convolution neural network according to the Mel frequency spectrum characteristic after the dimension changing processing; the detection result is a keyword or a non-keyword corresponding to the voice signal to be processed;

inputting the characteristics after the regularization processing into a ReLU activation function, further sequentially inputting 6 layers of time domain bottleneck blocks, a global average pooling layer and a full connection layer to determine corresponding output probabilities of all keywords and non-keywords, and taking a word corresponding to the maximum output probability as a detection result of the trained time domain convolutional neural network; the time domain bottleneck block comprises: the device comprises a point convolution layer, a BN layer, a ReLU activation function, a depth time domain convolution layer, a BN layer, a ReLU activation function, a point convolution layer, a BN layer and a ReLU activation function which are connected in sequence; wherein, the input end of the point convolution layer of the first layer in each layer of time domain bottleneck block is connected with the output end of the BN layer of the last layer.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

according to the keyword detection method and system provided by the invention, the traditional residual block is replaced by the time domain bottleneck block (TBB) in the trained time domain convolutional neural network, so that the calculation amount required by convolution is greatly reduced, the accuracy of the system is improved, and the deployment difficulty of hardware implementation is reduced. In addition, the trained time domain convolution neural network takes the one-dimensional characteristics as input, so that the characteristic dimension is reduced, efficient convolution processing can be realized by using less parameter quantity, the calculated quantity of keyword detection is reduced, and the accuracy of keyword detection is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required in the embodiments will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a keyword detection method according to the present invention;

fig. 2 is a schematic structural diagram of a keyword detection system provided in the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Fig. 1 is a schematic flow chart of a keyword detection method provided by the present invention, and as shown in fig. 1, the keyword detection method provided by the present invention includes:

s101, acquiring a voice signal to be processed.

S102, extracting the characteristics of the voice signal to be processed, determining Mel frequency spectrum characteristics, and carrying out variable-dimension processing on the Mel frequency spectrum characteristics; the Mel frequency spectrum feature after the dimension changing processing is a one-dimensional feature.

S102, specifically including:

pre-emphasis, framing and windowing, fast Fourier Transform (FFT), mel filtering and logarithm processing are carried out on the voice signal to be processed, and therefore the Mel frequency spectrum characteristic is determined; the frame length of the frame windowing is 30ms, and the frame shift is 10ms.

As a specific example, the length of the speech processed for each feature is 1s, and each second includes 98 frames, 98= ((1000-30)/10); the number of Mel filters used in the Mel filtering process is 40, so that 40 × 98 Mel frequency spectrum characteristics are obtained. And, since the speech signal input is a single channel input, a mel spectrum characteristic of 1 × 40 × 98 is finally obtained.

As a specific example, the mel-frequency spectrum features of 1 × 40 × 98 are subjected to dimension-changing processing, so that the mel-frequency spectrum features of 40 × 1 × 98 are obtained, namely, the two-dimensional features of 40 × 98 of a single channel are changed into the time-domain features of 1 × 98 of 40 channels.

S103, determining a detection result of the voice signal to be processed by utilizing a trained time domain convolution neural network according to the Mel frequency spectrum characteristic after the dimension changing processing; and the detection result is a keyword or a non-keyword corresponding to the voice signal to be processed.

inputting the Mel frequency spectrum characteristics after the dimension changing processing into a first layer time domain convolution layer for convolution processing; the number of input channels of the first time domain convolution layer is 40 (the number of mel filters in feature extraction), the number of output channels is C, and the number of convolution kernels is 1 × 3, that is, the 1 × 3 convolution kernels can be used to perform convolution processing on the time domain feature input of 1 × 98, so as to obtain the layer output of C × 1 × n; c is the number of output channels of the layer, N is related to parameters such as step length, filling and the like of the layer besides being affected by the convolution kernel and the input dimension, and is therefore replaced by N, i.e., N =1 + (M-F + 2P)/S, where M is the input dimension, i.e., 98; f is the convolution kernel size, i.e., 3; p is filling, and 0 is taken as a first layer; s is the step length, and the first layer is 1; finally, N =98 is obtained.

In order to reduce overfitting, the features after convolution processing are input into the BN layer to be subjected to regularization processing.

In order to increase nonlinearity in the network, inputting the characteristics after regularization processing into a ReLU activation function, further sequentially inputting 6 layers of time domain bottleneck blocks, a global average pooling layer and a full connection layer to determine corresponding output probabilities of all keywords and non-keywords, and taking the word corresponding to the maximum output probability as a detection result of the trained time domain convolutional neural network. The time-domain bottleneck block comprises: the point convolution layer of 1 × 1 convolution kernel, the BN layer, the ReLU activation function, the deep time domain convolution layer of 1 × 9 convolution kernel, the BN layer, the ReLU activation function, the point convolution layer of 1 × 1 convolution kernel, the BN layer and the ReLU activation function are connected in sequence. Wherein, the input end of the point convolution layer of the first layer in each layer of time domain bottleneck block is connected with the output end of the BN layer of the last layer.

As a specific embodiment, the time domain bottleneck block has a shortcut, and the shortcut connects the input end of the point convolution layer of the first layer in each layer of time domain bottleneck block to the output end of the BN layer of the last layer. The shortcut is used for copying the input of the convolution layer, and the input of the next layer is obtained after the input is added with the output of the BN layer and then is subjected to the ReLU activation function.

The point convolution layer of the 1 x 1 convolution kernel is used to match the dimensions and modify the number of output channels. The depth time domain convolution layer of the 1 × 9 convolution kernel can obtain a receptive field with the same size as that of the traditional 3 × 3 convolution layer, and meanwhile, the calculated amount is obviously reduced; the parameters such as step length, filling, number of input and output channels and the like can be selected according to requirements. After the input data enters the deep time domain convolution layer and is subjected to convolution operation, output of C1 x N is obtained (C is the number of output channels of the layer), the output data passes through the point convolution layer of 1 x 1 again and is added with input of the time domain bottleneck block of the layer transmitted by shortcut (if the original input dimension is not C1 x N, dimension changing operation is added in the shortcut and can be realized by using convolution kernel of 1 x 1), input of the next time domain bottleneck block is obtained, operation is repeated until output of the bottleneck block of the last layer, namely output of the bottleneck block of the 6 th time domain is obtained, the output of the bottleneck block of the 6 th time domain enters the global averaging pooling layer, dimension reduction processing is carried out, and output of C is finally obtained (C is the number of output channels of the global averaging pooling layer). And finally, entering a full connection layer, outputting n dimensions, and outputting n output probabilities after the processing of softmax, wherein n is the number of the keywords and the non-keywords set during the keyword detection. Taking n =12 as an example, wherein 11 keywords and 1 non-keyword are included, 12 output probabilities corresponding to these categories are finally output.

Selecting the value with the maximum output probability value, outputting the corresponding keyword or non-keyword, and if the output is the keyword, detecting the success of the keyword and recording; if the output is a non-keyword, no record is made.

Compared with the traditional voice awakening system using the convolutional neural network as the classifier, the voice awakening system uses the time domain convolutional neural network as the classifier, and uses the one-dimensional time domain convolution to replace the traditional two-dimensional convolution, so that the calculated amount and the required memory occupation of a keyword detection system are obviously reduced, the accuracy rate of the system is improved, and the deployment difficulty of hardware implementation is reduced.

Compared with the output dimension C X M N (C is the number of output channels and M N is the size of characteristic input) of the traditional two-dimensional convolution, the output dimension C X M N greatly reduces the size of characteristic output, greatly reduces the storage capacity of data and the operation amount of a system, and further reduces the power consumption. In addition, because the characteristic dimension is reduced, efficient convolution processing can be realized by using less parameter quantity, so that the keyword detection accuracy is improved, and finally, a lightweight scheme of the keyword detection system convenient for a mobile terminal is provided.

Fig. 2 is a schematic structural diagram of a keyword detection system provided by the present invention, and as shown in fig. 2, the keyword detection system provided by the present invention includes:

a voice signal obtaining module 201, configured to obtain a voice signal to be processed.

A mel-frequency spectrum feature extraction and dimension change module 202, configured to perform feature extraction on the voice signal to be processed, determine mel-frequency spectrum features, and perform dimension change processing on the mel-frequency spectrum features; the Mel frequency spectrum feature after the dimension changing processing is a one-dimensional feature.

The detection result determining module 203 is configured to determine a detection result of the voice signal to be processed by using the trained time domain convolutional neural network according to the mel frequency spectrum feature after the dimension change processing; and the detection result is a keyword or a non-keyword corresponding to the voice signal to be processed.

inputting the Mel frequency spectrum characteristics subjected to the dimension changing processing into a first layer of time domain convolution layer for convolution processing;

and inputting the features subjected to the convolution processing into the BN layer for regularization processing.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principle and the embodiment of the present invention are explained by applying specific examples, and the above description of the embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the foregoing, the description is not to be taken in a limiting sense.

Claims

1. A keyword detection method is characterized by comprising the following steps:

acquiring a voice signal to be processed;

extracting the characteristics of the voice signal to be processed, determining Mel frequency spectrum characteristics, and performing variable-dimension processing on the Mel frequency spectrum characteristics; the Mel frequency spectrum characteristic after the dimension changing processing is a one-dimensional characteristic;

inputting the characteristics after the regularization processing into a ReLU activation function, further sequentially inputting 6 layers of time domain bottleneck blocks, a global average pooling layer and a full connection layer to determine corresponding output probabilities of all keywords and non-keywords, and taking a word corresponding to the maximum output probability as a detection result of the trained time domain convolutional neural network; the time domain bottleneck block comprises: the device comprises a point convolution layer, a BN layer, a ReLU activation function, a depth time domain convolution layer, a BN layer, a ReLU activation function, a point convolution layer, a BN layer and a ReLU activation function which are connected in sequence; the input end of the point convolution layer of the first layer in each layer of time domain bottleneck block is connected with the output end of the BN layer of the last layer; the number of output channels of the first layer of time domain convolution layer is C, and the convolution kernel is 1 x 3; the convolution kernel size of the point convolution layer is 1 x 1; the convolution kernel size of the depth time domain convolution layer is 1 x 9.

2. The method as claimed in claim 1, wherein said extracting features of the speech signal to be processed, determining mel-frequency spectrum features, and performing dimension-changing processing on the mel-frequency spectrum features specifically comprises:

pre-emphasis, framing and windowing, fast Fourier transform, mel filtering and logarithm processing are carried out on the voice signal to be processed, and Mel frequency spectrum characteristics are determined; the frame length of the framing windowing is 30ms, and the frame shift is 10ms; the Mel filtering process includes: 40 Mel filters.

3. The method according to claim 1, wherein the performing feature extraction on the to-be-processed speech signal, determining mel-frequency spectrum features, and performing dimension-changing processing on the mel-frequency spectrum features specifically comprises:

and carrying out dimension changing processing on the Mel frequency spectrum characteristic by using a view function or a reshape function in a PyTorch framework.

4. A keyword detection system for implementing the keyword detection method according to any one of claims 1 to 3, the keyword detection system comprising:

the Mel frequency spectrum feature extraction and dimension change module is used for extracting features of the voice signal to be processed, determining Mel frequency spectrum features and carrying out dimension change processing on the Mel frequency spectrum features; the Mel frequency spectrum characteristic after the variable dimension processing is a one-dimensional characteristic;

inputting the characteristics after the regularization processing into a ReLU activation function, further sequentially inputting 6 layers of time domain bottleneck blocks, a global average pooling layer and a full connection layer to determine corresponding output probabilities of all keywords and non-keywords, and taking a word corresponding to the maximum output probability as a detection result of the trained time domain convolutional neural network; the time domain bottleneck block comprises: the device comprises a point convolution layer, a BN layer, a ReLU activation function, a deep time domain convolution layer, a BN layer, a ReLU activation function, a point convolution layer, a BN layer and a ReLU activation function which are connected in sequence; wherein, the input end of the point convolution layer of the first layer in each layer of time domain bottleneck block is connected with the output end of the BN layer of the last layer.