CN115035897B - Keyword detection method and system - Google Patents
Keyword detection method and system Download PDFInfo
- Publication number
- CN115035897B CN115035897B CN202210952631.8A CN202210952631A CN115035897B CN 115035897 B CN115035897 B CN 115035897B CN 202210952631 A CN202210952631 A CN 202210952631A CN 115035897 B CN115035897 B CN 115035897B
- Authority
- CN
- China
- Prior art keywords
- layer
- time domain
- frequency spectrum
- convolution
- processing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 64
- 238000001228 spectrum Methods 0.000 claims abstract description 54
- 230000004913 activation Effects 0.000 claims abstract description 28
- 238000000034 method Methods 0.000 claims abstract description 16
- 238000013528 artificial neural network Methods 0.000 claims abstract description 13
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 12
- 238000000605 extraction Methods 0.000 claims abstract description 9
- 230000008569 process Effects 0.000 claims description 9
- 238000011176 pooling Methods 0.000 claims description 8
- 238000001914 filtration Methods 0.000 claims description 6
- 230000008859 change Effects 0.000 claims description 5
- 238000009432 framing Methods 0.000 claims description 5
- 230000037433 frameshift Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 abstract description 6
- 230000006870 function Effects 0.000 description 21
- 238000012935 Averaging Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 1
- 241000238558 Eucarida Species 0.000 description 1
- 244000062793 Sorghum vulgare Species 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 235000019713 millet Nutrition 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Complex Calculations (AREA)
Abstract
The invention relates to a keyword detection method and a keyword detection system. The method comprises the steps of obtaining a voice signal to be processed; carrying out feature extraction on a voice signal to be processed, determining Mel frequency spectrum features, and carrying out variable-dimension processing on the Mel frequency spectrum features; the Mel frequency spectrum characteristic after the variable dimension processing is a one-dimensional characteristic; determining a detection result by utilizing a trained time domain convolution neural network according to the Mel frequency spectrum characteristic after the dimension changing processing; the trained time domain bottleneck block in the time domain convolutional neural network comprises a point convolutional layer, a BN layer, a ReLU activation function, a deep time domain convolutional layer, a BN layer, a ReLU activation function, a point convolutional layer, a BN layer and a ReLU activation function which are connected in sequence; wherein, the input end of the point convolution layer of the first layer in each layer of time domain bottleneck block is connected with the output end of the BN layer of the last layer. The keyword detection method and the device can reduce the calculation amount of the keyword detection and improve the accuracy of the keyword detection.
Description
Technical Field
The invention relates to the technical field of keyword detection in voice recognition, in particular to a keyword detection method and a keyword detection system.
Background
The keyword detection system generally operates in applications of mobile devices (e.g., hey of apple, siri; lovely of millet), and the memory of the mobile devices is small and the computing power is limited, so the keyword detection system should meet the requirements of high accuracy, low delay, low memory and low computation amount at the same time. In the conventional convolutional neural network commonly used in the keyword detection system, the used two-dimensional convolution is complex in calculation and obvious in time consumption.
Therefore, it is desirable to provide a method or system capable of reducing the amount of calculation for keyword detection and improving the accuracy of keyword detection.
Disclosure of Invention
The invention aims to provide a keyword detection method and a keyword detection system, which can reduce the calculation amount of keyword detection and improve the accuracy of keyword detection.
In order to achieve the purpose, the invention provides the following scheme:
a keyword detection method includes:
acquiring a voice signal to be processed;
extracting the characteristics of the voice signal to be processed, determining Mel frequency spectrum characteristics, and performing variable-dimension processing on the Mel frequency spectrum characteristics; the Mel frequency spectrum characteristic after the variable dimension processing is a one-dimensional characteristic;
determining a detection result of the voice signal to be processed by utilizing a trained time domain convolution neural network according to the Mel frequency spectrum characteristic after the dimension changing processing; the detection result is a keyword or a non-keyword corresponding to the voice signal to be processed;
the process of detecting the keywords by the trained time domain convolution neural network comprises the following steps:
inputting the Mel frequency spectrum characteristics after the dimension changing processing into a first layer time domain convolution layer for convolution processing;
inputting the features after the convolution processing into a BN layer for regularization processing;
inputting the characteristics after the regularization processing into a ReLU activation function, further sequentially inputting 6 layers of time domain bottleneck blocks, a global average pooling layer and a full connection layer to determine corresponding output probabilities of all keywords and non-keywords, and taking a word corresponding to the maximum output probability as a detection result of the trained time domain convolutional neural network; the time-domain bottleneck block comprises: the device comprises a point convolution layer, a BN layer, a ReLU activation function, a depth time domain convolution layer, a BN layer, a ReLU activation function, a point convolution layer, a BN layer and a ReLU activation function which are connected in sequence; wherein, the input end of the point convolution layer of the first layer in each layer of time domain bottleneck block is connected with the output end of the BN layer of the last layer.
Optionally, the performing feature extraction on the voice signal to be processed to determine mel-frequency spectrum features specifically includes:
pre-emphasis, framing and windowing, fast Fourier Transform (FFT), mel filtering and logarithm processing are carried out on the voice signal to be processed, and Mel frequency spectrum characteristics are determined; the frame length of the framing windowing is 30ms, and the frame shift is 10ms; the Mel filtering process includes: 40 Mel filters.
Optionally, the performing feature extraction on the voice signal to be processed, determining a mel-frequency spectrum feature, and performing variable-dimension processing on the mel-frequency spectrum feature specifically includes:
and performing dimension changing processing on the Mel frequency spectrum characteristic by using a view function or a reshape function in a PyTorch frame.
Optionally, the number of output channels of the first layer of time domain convolution layer is C, and the convolution kernel is 1 × 3.
Optionally, the convolution kernels of the dot convolution layers are each 1 x 1 in size.
Optionally, the convolution kernels of the depth-time domain convolution layer each have a size of 1 × 9.
A keyword detection system, comprising:
the voice signal acquisition module is used for acquiring a voice signal to be processed;
the Mel frequency spectrum feature extraction and dimension variation module is used for extracting features of the voice signal to be processed, determining Mel frequency spectrum features and carrying out dimension variation processing on the Mel frequency spectrum features; the Mel frequency spectrum characteristic after the dimension changing processing is a one-dimensional characteristic;
the detection result determining module is used for determining the detection result of the voice signal to be processed by utilizing the trained time domain convolution neural network according to the Mel frequency spectrum characteristic after the dimension changing processing; the detection result is a keyword or a non-keyword corresponding to the voice signal to be processed;
the process of detecting the keywords by the trained time domain convolution neural network comprises the following steps:
inputting the Mel frequency spectrum characteristics after the dimension changing processing into a first layer time domain convolution layer for convolution processing;
inputting the features after the convolution processing into a BN layer for regularization processing;
inputting the characteristics after the regularization processing into a ReLU activation function, further sequentially inputting 6 layers of time domain bottleneck blocks, a global average pooling layer and a full connection layer to determine corresponding output probabilities of all keywords and non-keywords, and taking a word corresponding to the maximum output probability as a detection result of the trained time domain convolutional neural network; the time domain bottleneck block comprises: the device comprises a point convolution layer, a BN layer, a ReLU activation function, a depth time domain convolution layer, a BN layer, a ReLU activation function, a point convolution layer, a BN layer and a ReLU activation function which are connected in sequence; wherein, the input end of the point convolution layer of the first layer in each layer of time domain bottleneck block is connected with the output end of the BN layer of the last layer.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
according to the keyword detection method and system provided by the invention, the traditional residual block is replaced by the time domain bottleneck block (TBB) in the trained time domain convolutional neural network, so that the calculation amount required by convolution is greatly reduced, the accuracy of the system is improved, and the deployment difficulty of hardware implementation is reduced. In addition, the trained time domain convolution neural network takes the one-dimensional characteristics as input, so that the characteristic dimension is reduced, efficient convolution processing can be realized by using less parameter quantity, the calculated quantity of keyword detection is reduced, and the accuracy of keyword detection is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required in the embodiments will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a keyword detection method according to the present invention;
fig. 2 is a schematic structural diagram of a keyword detection system provided in the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a keyword detection method and a keyword detection system, which can reduce the calculation amount of keyword detection and improve the accuracy of keyword detection.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Fig. 1 is a schematic flow chart of a keyword detection method provided by the present invention, and as shown in fig. 1, the keyword detection method provided by the present invention includes:
s101, acquiring a voice signal to be processed.
S102, extracting the characteristics of the voice signal to be processed, determining Mel frequency spectrum characteristics, and carrying out variable-dimension processing on the Mel frequency spectrum characteristics; the Mel frequency spectrum feature after the dimension changing processing is a one-dimensional feature.
S102, specifically including:
pre-emphasis, framing and windowing, fast Fourier Transform (FFT), mel filtering and logarithm processing are carried out on the voice signal to be processed, and therefore the Mel frequency spectrum characteristic is determined; the frame length of the frame windowing is 30ms, and the frame shift is 10ms.
As a specific example, the length of the speech processed for each feature is 1s, and each second includes 98 frames, 98= ((1000-30)/10); the number of Mel filters used in the Mel filtering process is 40, so that 40 × 98 Mel frequency spectrum characteristics are obtained. And, since the speech signal input is a single channel input, a mel spectrum characteristic of 1 × 40 × 98 is finally obtained.
And performing dimension changing processing on the Mel frequency spectrum characteristic by using a view function or a reshape function in a PyTorch frame.
As a specific example, the mel-frequency spectrum features of 1 × 40 × 98 are subjected to dimension-changing processing, so that the mel-frequency spectrum features of 40 × 1 × 98 are obtained, namely, the two-dimensional features of 40 × 98 of a single channel are changed into the time-domain features of 1 × 98 of 40 channels.
S103, determining a detection result of the voice signal to be processed by utilizing a trained time domain convolution neural network according to the Mel frequency spectrum characteristic after the dimension changing processing; and the detection result is a keyword or a non-keyword corresponding to the voice signal to be processed.
The process of detecting the keywords by the trained time domain convolution neural network comprises the following steps:
inputting the Mel frequency spectrum characteristics after the dimension changing processing into a first layer time domain convolution layer for convolution processing; the number of input channels of the first time domain convolution layer is 40 (the number of mel filters in feature extraction), the number of output channels is C, and the number of convolution kernels is 1 × 3, that is, the 1 × 3 convolution kernels can be used to perform convolution processing on the time domain feature input of 1 × 98, so as to obtain the layer output of C × 1 × n; c is the number of output channels of the layer, N is related to parameters such as step length, filling and the like of the layer besides being affected by the convolution kernel and the input dimension, and is therefore replaced by N, i.e., N =1 + (M-F + 2P)/S, where M is the input dimension, i.e., 98; f is the convolution kernel size, i.e., 3; p is filling, and 0 is taken as a first layer; s is the step length, and the first layer is 1; finally, N =98 is obtained.
In order to reduce overfitting, the features after convolution processing are input into the BN layer to be subjected to regularization processing.
In order to increase nonlinearity in the network, inputting the characteristics after regularization processing into a ReLU activation function, further sequentially inputting 6 layers of time domain bottleneck blocks, a global average pooling layer and a full connection layer to determine corresponding output probabilities of all keywords and non-keywords, and taking the word corresponding to the maximum output probability as a detection result of the trained time domain convolutional neural network. The time-domain bottleneck block comprises: the point convolution layer of 1 × 1 convolution kernel, the BN layer, the ReLU activation function, the deep time domain convolution layer of 1 × 9 convolution kernel, the BN layer, the ReLU activation function, the point convolution layer of 1 × 1 convolution kernel, the BN layer and the ReLU activation function are connected in sequence. Wherein, the input end of the point convolution layer of the first layer in each layer of time domain bottleneck block is connected with the output end of the BN layer of the last layer.
As a specific embodiment, the time domain bottleneck block has a shortcut, and the shortcut connects the input end of the point convolution layer of the first layer in each layer of time domain bottleneck block to the output end of the BN layer of the last layer. The shortcut is used for copying the input of the convolution layer, and the input of the next layer is obtained after the input is added with the output of the BN layer and then is subjected to the ReLU activation function.
The point convolution layer of the 1 x 1 convolution kernel is used to match the dimensions and modify the number of output channels. The depth time domain convolution layer of the 1 × 9 convolution kernel can obtain a receptive field with the same size as that of the traditional 3 × 3 convolution layer, and meanwhile, the calculated amount is obviously reduced; the parameters such as step length, filling, number of input and output channels and the like can be selected according to requirements. After the input data enters the deep time domain convolution layer and is subjected to convolution operation, output of C1 x N is obtained (C is the number of output channels of the layer), the output data passes through the point convolution layer of 1 x 1 again and is added with input of the time domain bottleneck block of the layer transmitted by shortcut (if the original input dimension is not C1 x N, dimension changing operation is added in the shortcut and can be realized by using convolution kernel of 1 x 1), input of the next time domain bottleneck block is obtained, operation is repeated until output of the bottleneck block of the last layer, namely output of the bottleneck block of the 6 th time domain is obtained, the output of the bottleneck block of the 6 th time domain enters the global averaging pooling layer, dimension reduction processing is carried out, and output of C is finally obtained (C is the number of output channels of the global averaging pooling layer). And finally, entering a full connection layer, outputting n dimensions, and outputting n output probabilities after the processing of softmax, wherein n is the number of the keywords and the non-keywords set during the keyword detection. Taking n =12 as an example, wherein 11 keywords and 1 non-keyword are included, 12 output probabilities corresponding to these categories are finally output.
Selecting the value with the maximum output probability value, outputting the corresponding keyword or non-keyword, and if the output is the keyword, detecting the success of the keyword and recording; if the output is a non-keyword, no record is made.
Compared with the traditional voice awakening system using the convolutional neural network as the classifier, the voice awakening system uses the time domain convolutional neural network as the classifier, and uses the one-dimensional time domain convolution to replace the traditional two-dimensional convolution, so that the calculated amount and the required memory occupation of a keyword detection system are obviously reduced, the accuracy rate of the system is improved, and the deployment difficulty of hardware implementation is reduced.
Compared with the output dimension C X M N (C is the number of output channels and M N is the size of characteristic input) of the traditional two-dimensional convolution, the output dimension C X M N greatly reduces the size of characteristic output, greatly reduces the storage capacity of data and the operation amount of a system, and further reduces the power consumption. In addition, because the characteristic dimension is reduced, efficient convolution processing can be realized by using less parameter quantity, so that the keyword detection accuracy is improved, and finally, a lightweight scheme of the keyword detection system convenient for a mobile terminal is provided.
Fig. 2 is a schematic structural diagram of a keyword detection system provided by the present invention, and as shown in fig. 2, the keyword detection system provided by the present invention includes:
a voice signal obtaining module 201, configured to obtain a voice signal to be processed.
A mel-frequency spectrum feature extraction and dimension change module 202, configured to perform feature extraction on the voice signal to be processed, determine mel-frequency spectrum features, and perform dimension change processing on the mel-frequency spectrum features; the Mel frequency spectrum feature after the dimension changing processing is a one-dimensional feature.
The detection result determining module 203 is configured to determine a detection result of the voice signal to be processed by using the trained time domain convolutional neural network according to the mel frequency spectrum feature after the dimension change processing; and the detection result is a keyword or a non-keyword corresponding to the voice signal to be processed.
The process of detecting the keywords by the trained time domain convolution neural network comprises the following steps:
inputting the Mel frequency spectrum characteristics subjected to the dimension changing processing into a first layer of time domain convolution layer for convolution processing;
and inputting the features subjected to the convolution processing into the BN layer for regularization processing.
Inputting the characteristics after the regularization processing into a ReLU activation function, further sequentially inputting 6 layers of time domain bottleneck blocks, a global average pooling layer and a full connection layer to determine corresponding output probabilities of all keywords and non-keywords, and taking a word corresponding to the maximum output probability as a detection result of the trained time domain convolutional neural network; the time domain bottleneck block comprises: the device comprises a point convolution layer, a BN layer, a ReLU activation function, a depth time domain convolution layer, a BN layer, a ReLU activation function, a point convolution layer, a BN layer and a ReLU activation function which are connected in sequence; wherein, the input end of the point convolution layer of the first layer in each layer of time domain bottleneck block is connected with the output end of the BN layer of the last layer.
In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principle and the embodiment of the present invention are explained by applying specific examples, and the above description of the embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the foregoing, the description is not to be taken in a limiting sense.
Claims (4)
1. A keyword detection method is characterized by comprising the following steps:
acquiring a voice signal to be processed;
extracting the characteristics of the voice signal to be processed, determining Mel frequency spectrum characteristics, and performing variable-dimension processing on the Mel frequency spectrum characteristics; the Mel frequency spectrum characteristic after the dimension changing processing is a one-dimensional characteristic;
determining a detection result of the voice signal to be processed by utilizing a trained time domain convolution neural network according to the Mel frequency spectrum characteristic after the dimension changing processing; the detection result is a keyword or a non-keyword corresponding to the voice signal to be processed;
the process of detecting the keywords by the trained time domain convolution neural network comprises the following steps:
inputting the Mel frequency spectrum characteristics after the dimension changing processing into a first layer time domain convolution layer for convolution processing;
inputting the features after the convolution processing into a BN layer for regularization processing;
inputting the characteristics after the regularization processing into a ReLU activation function, further sequentially inputting 6 layers of time domain bottleneck blocks, a global average pooling layer and a full connection layer to determine corresponding output probabilities of all keywords and non-keywords, and taking a word corresponding to the maximum output probability as a detection result of the trained time domain convolutional neural network; the time domain bottleneck block comprises: the device comprises a point convolution layer, a BN layer, a ReLU activation function, a depth time domain convolution layer, a BN layer, a ReLU activation function, a point convolution layer, a BN layer and a ReLU activation function which are connected in sequence; the input end of the point convolution layer of the first layer in each layer of time domain bottleneck block is connected with the output end of the BN layer of the last layer; the number of output channels of the first layer of time domain convolution layer is C, and the convolution kernel is 1 x 3; the convolution kernel size of the point convolution layer is 1 x 1; the convolution kernel size of the depth time domain convolution layer is 1 x 9.
2. The method as claimed in claim 1, wherein said extracting features of the speech signal to be processed, determining mel-frequency spectrum features, and performing dimension-changing processing on the mel-frequency spectrum features specifically comprises:
pre-emphasis, framing and windowing, fast Fourier transform, mel filtering and logarithm processing are carried out on the voice signal to be processed, and Mel frequency spectrum characteristics are determined; the frame length of the framing windowing is 30ms, and the frame shift is 10ms; the Mel filtering process includes: 40 Mel filters.
3. The method according to claim 1, wherein the performing feature extraction on the to-be-processed speech signal, determining mel-frequency spectrum features, and performing dimension-changing processing on the mel-frequency spectrum features specifically comprises:
and carrying out dimension changing processing on the Mel frequency spectrum characteristic by using a view function or a reshape function in a PyTorch framework.
4. A keyword detection system for implementing the keyword detection method according to any one of claims 1 to 3, the keyword detection system comprising:
the voice signal acquisition module is used for acquiring a voice signal to be processed;
the Mel frequency spectrum feature extraction and dimension change module is used for extracting features of the voice signal to be processed, determining Mel frequency spectrum features and carrying out dimension change processing on the Mel frequency spectrum features; the Mel frequency spectrum characteristic after the variable dimension processing is a one-dimensional characteristic;
the detection result determining module is used for determining the detection result of the voice signal to be processed by utilizing the trained time domain convolution neural network according to the Mel frequency spectrum characteristic after the dimension changing processing; the detection result is a keyword or a non-keyword corresponding to the voice signal to be processed;
the process of detecting the keywords by the trained time domain convolution neural network comprises the following steps:
inputting the Mel frequency spectrum characteristics subjected to the dimension changing processing into a first layer of time domain convolution layer for convolution processing;
inputting the features after the convolution processing into a BN layer for regularization processing;
inputting the characteristics after the regularization processing into a ReLU activation function, further sequentially inputting 6 layers of time domain bottleneck blocks, a global average pooling layer and a full connection layer to determine corresponding output probabilities of all keywords and non-keywords, and taking a word corresponding to the maximum output probability as a detection result of the trained time domain convolutional neural network; the time domain bottleneck block comprises: the device comprises a point convolution layer, a BN layer, a ReLU activation function, a deep time domain convolution layer, a BN layer, a ReLU activation function, a point convolution layer, a BN layer and a ReLU activation function which are connected in sequence; wherein, the input end of the point convolution layer of the first layer in each layer of time domain bottleneck block is connected with the output end of the BN layer of the last layer.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210952631.8A CN115035897B (en) | 2022-08-10 | 2022-08-10 | Keyword detection method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210952631.8A CN115035897B (en) | 2022-08-10 | 2022-08-10 | Keyword detection method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115035897A CN115035897A (en) | 2022-09-09 |
CN115035897B true CN115035897B (en) | 2022-11-11 |
Family
ID=83130310
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210952631.8A Active CN115035897B (en) | 2022-08-10 | 2022-08-10 | Keyword detection method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115035897B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110334800A (en) * | 2019-07-18 | 2019-10-15 | 南京风兴科技有限公司 | A kind of lightweight 3D convolutional network system for video identification |
CN112825250A (en) * | 2019-11-20 | 2021-05-21 | 芋头科技(杭州)有限公司 | Voice wake-up method, apparatus, storage medium and program product |
CN113344188A (en) * | 2021-06-18 | 2021-09-03 | 东南大学 | Lightweight neural network model based on channel attention module |
CN114708855A (en) * | 2022-06-07 | 2022-07-05 | 中科南京智能技术研究院 | Voice awakening method and system based on binary residual error neural network |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220084535A1 (en) * | 2021-10-06 | 2022-03-17 | Intel Corporation | Reduced latency streaming dynamic noise suppression using convolutional neural networks |
-
2022
- 2022-08-10 CN CN202210952631.8A patent/CN115035897B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110334800A (en) * | 2019-07-18 | 2019-10-15 | 南京风兴科技有限公司 | A kind of lightweight 3D convolutional network system for video identification |
CN112825250A (en) * | 2019-11-20 | 2021-05-21 | 芋头科技(杭州)有限公司 | Voice wake-up method, apparatus, storage medium and program product |
CN113344188A (en) * | 2021-06-18 | 2021-09-03 | 东南大学 | Lightweight neural network model based on channel attention module |
CN114708855A (en) * | 2022-06-07 | 2022-07-05 | 中科南京智能技术研究院 | Voice awakening method and system based on binary residual error neural network |
Non-Patent Citations (2)
Title |
---|
"TIME-DOMAIN SPEAKER VERIFICATION USING TEMPORAL CONVOLUTIONAL NETWORKS";Sangwook Han;《ICASSP 2021》;20211231;正文第6688-6691页 * |
"基于超轻量通道注意力的端对端语音增强方法";洪依;《智能科学与技术学报》;20210930;第3卷(第3期);正文第351-357页 * |
Also Published As
Publication number | Publication date |
---|---|
CN115035897A (en) | 2022-09-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102213013B1 (en) | Frequency-based audio analysis using neural networks | |
US10937438B2 (en) | Neural network generative modeling to transform speech utterances and augment training data | |
CN109448719B (en) | Neural network model establishing method, voice awakening method, device, medium and equipment | |
CN111583903B (en) | Speech synthesis method, vocoder training method, device, medium, and electronic device | |
CN110718211B (en) | Keyword recognition system based on hybrid compressed convolutional neural network | |
CN111785288B (en) | Voice enhancement method, device, equipment and storage medium | |
CN109919295B (en) | Embedded audio event detection method based on lightweight convolutional neural network | |
CN111357051B (en) | Speech emotion recognition method, intelligent device and computer readable storage medium | |
CN109543029B (en) | Text classification method, device, medium and equipment based on convolutional neural network | |
US11854536B2 (en) | Keyword spotting apparatus, method, and computer-readable recording medium thereof | |
CN114708855B (en) | Voice awakening method and system based on binary residual error neural network | |
Peter et al. | End-to-end keyword spotting using neural architecture search and quantization | |
CN112233675A (en) | Voice awakening method and system based on separation convolutional neural network | |
CN113409827B (en) | Voice endpoint detection method and system based on local convolution block attention network | |
CN115035897B (en) | Keyword detection method and system | |
CN111489739B (en) | Phoneme recognition method, apparatus and computer readable storage medium | |
CN111259189A (en) | Music classification method and device | |
CN112397086A (en) | Voice keyword detection method and device, terminal equipment and storage medium | |
Pan et al. | An efficient hybrid learning algorithm for neural network–based speech recognition systems on FPGA chip | |
CN107919136B (en) | Digital voice sampling frequency estimation method based on Gaussian mixture model | |
CN112989106B (en) | Audio classification method, electronic device and storage medium | |
CN111160517A (en) | Convolutional layer quantization method and device of deep neural network | |
CN113409775B (en) | Keyword recognition method and device, storage medium and computer equipment | |
CN113609970A (en) | Underwater target identification method based on grouping convolution depth U _ Net | |
CN113763976A (en) | Method and device for reducing noise of audio signal, readable medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |