CN115035897B - Keyword detection method and system - Google Patents

Keyword detection method and system Download PDF

Info

Publication number
CN115035897B
CN115035897B CN202210952631.8A CN202210952631A CN115035897B CN 115035897 B CN115035897 B CN 115035897B CN 202210952631 A CN202210952631 A CN 202210952631A CN 115035897 B CN115035897 B CN 115035897B
Authority
CN
China
Prior art keywords
layer
time domain
frequency spectrum
convolution
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210952631.8A
Other languages
Chinese (zh)
Other versions
CN115035897A (en
Inventor
王啸
李郡
尚德龙
周玉梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Nanjing Intelligent Technology Research Institute
Original Assignee
Zhongke Nanjing Intelligent Technology Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongke Nanjing Intelligent Technology Research Institute filed Critical Zhongke Nanjing Intelligent Technology Research Institute
Priority to CN202210952631.8A priority Critical patent/CN115035897B/en
Publication of CN115035897A publication Critical patent/CN115035897A/en
Application granted granted Critical
Publication of CN115035897B publication Critical patent/CN115035897B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Complex Calculations (AREA)

Abstract

The invention relates to a keyword detection method and a keyword detection system. The method comprises the steps of obtaining a voice signal to be processed; carrying out feature extraction on a voice signal to be processed, determining Mel frequency spectrum features, and carrying out variable-dimension processing on the Mel frequency spectrum features; the Mel frequency spectrum characteristic after the variable dimension processing is a one-dimensional characteristic; determining a detection result by utilizing a trained time domain convolution neural network according to the Mel frequency spectrum characteristic after the dimension changing processing; the trained time domain bottleneck block in the time domain convolutional neural network comprises a point convolutional layer, a BN layer, a ReLU activation function, a deep time domain convolutional layer, a BN layer, a ReLU activation function, a point convolutional layer, a BN layer and a ReLU activation function which are connected in sequence; wherein, the input end of the point convolution layer of the first layer in each layer of time domain bottleneck block is connected with the output end of the BN layer of the last layer. The keyword detection method and the device can reduce the calculation amount of the keyword detection and improve the accuracy of the keyword detection.

Description

Keyword detection method and system
Technical Field
The invention relates to the technical field of keyword detection in voice recognition, in particular to a keyword detection method and a keyword detection system.
Background
The keyword detection system generally operates in applications of mobile devices (e.g., hey of apple, siri; lovely of millet), and the memory of the mobile devices is small and the computing power is limited, so the keyword detection system should meet the requirements of high accuracy, low delay, low memory and low computation amount at the same time. In the conventional convolutional neural network commonly used in the keyword detection system, the used two-dimensional convolution is complex in calculation and obvious in time consumption.
Therefore, it is desirable to provide a method or system capable of reducing the amount of calculation for keyword detection and improving the accuracy of keyword detection.
Disclosure of Invention
The invention aims to provide a keyword detection method and a keyword detection system, which can reduce the calculation amount of keyword detection and improve the accuracy of keyword detection.
In order to achieve the purpose, the invention provides the following scheme:
a keyword detection method includes:
acquiring a voice signal to be processed;
extracting the characteristics of the voice signal to be processed, determining Mel frequency spectrum characteristics, and performing variable-dimension processing on the Mel frequency spectrum characteristics; the Mel frequency spectrum characteristic after the variable dimension processing is a one-dimensional characteristic;
determining a detection result of the voice signal to be processed by utilizing a trained time domain convolution neural network according to the Mel frequency spectrum characteristic after the dimension changing processing; the detection result is a keyword or a non-keyword corresponding to the voice signal to be processed;
the process of detecting the keywords by the trained time domain convolution neural network comprises the following steps:
inputting the Mel frequency spectrum characteristics after the dimension changing processing into a first layer time domain convolution layer for convolution processing;
inputting the features after the convolution processing into a BN layer for regularization processing;
inputting the characteristics after the regularization processing into a ReLU activation function, further sequentially inputting 6 layers of time domain bottleneck blocks, a global average pooling layer and a full connection layer to determine corresponding output probabilities of all keywords and non-keywords, and taking a word corresponding to the maximum output probability as a detection result of the trained time domain convolutional neural network; the time-domain bottleneck block comprises: the device comprises a point convolution layer, a BN layer, a ReLU activation function, a depth time domain convolution layer, a BN layer, a ReLU activation function, a point convolution layer, a BN layer and a ReLU activation function which are connected in sequence; wherein, the input end of the point convolution layer of the first layer in each layer of time domain bottleneck block is connected with the output end of the BN layer of the last layer.
Optionally, the performing feature extraction on the voice signal to be processed to determine mel-frequency spectrum features specifically includes:
pre-emphasis, framing and windowing, fast Fourier Transform (FFT), mel filtering and logarithm processing are carried out on the voice signal to be processed, and Mel frequency spectrum characteristics are determined; the frame length of the framing windowing is 30ms, and the frame shift is 10ms; the Mel filtering process includes: 40 Mel filters.
Optionally, the performing feature extraction on the voice signal to be processed, determining a mel-frequency spectrum feature, and performing variable-dimension processing on the mel-frequency spectrum feature specifically includes:
and performing dimension changing processing on the Mel frequency spectrum characteristic by using a view function or a reshape function in a PyTorch frame.
Optionally, the number of output channels of the first layer of time domain convolution layer is C, and the convolution kernel is 1 × 3.
Optionally, the convolution kernels of the dot convolution layers are each 1 x 1 in size.
Optionally, the convolution kernels of the depth-time domain convolution layer each have a size of 1 × 9.
A keyword detection system, comprising:
the voice signal acquisition module is used for acquiring a voice signal to be processed;
the Mel frequency spectrum feature extraction and dimension variation module is used for extracting features of the voice signal to be processed, determining Mel frequency spectrum features and carrying out dimension variation processing on the Mel frequency spectrum features; the Mel frequency spectrum characteristic after the dimension changing processing is a one-dimensional characteristic;
the detection result determining module is used for determining the detection result of the voice signal to be processed by utilizing the trained time domain convolution neural network according to the Mel frequency spectrum characteristic after the dimension changing processing; the detection result is a keyword or a non-keyword corresponding to the voice signal to be processed;
the process of detecting the keywords by the trained time domain convolution neural network comprises the following steps:
inputting the Mel frequency spectrum characteristics after the dimension changing processing into a first layer time domain convolution layer for convolution processing;
inputting the features after the convolution processing into a BN layer for regularization processing;
inputting the characteristics after the regularization processing into a ReLU activation function, further sequentially inputting 6 layers of time domain bottleneck blocks, a global average pooling layer and a full connection layer to determine corresponding output probabilities of all keywords and non-keywords, and taking a word corresponding to the maximum output probability as a detection result of the trained time domain convolutional neural network; the time domain bottleneck block comprises: the device comprises a point convolution layer, a BN layer, a ReLU activation function, a depth time domain convolution layer, a BN layer, a ReLU activation function, a point convolution layer, a BN layer and a ReLU activation function which are connected in sequence; wherein, the input end of the point convolution layer of the first layer in each layer of time domain bottleneck block is connected with the output end of the BN layer of the last layer.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
according to the keyword detection method and system provided by the invention, the traditional residual block is replaced by the time domain bottleneck block (TBB) in the trained time domain convolutional neural network, so that the calculation amount required by convolution is greatly reduced, the accuracy of the system is improved, and the deployment difficulty of hardware implementation is reduced. In addition, the trained time domain convolution neural network takes the one-dimensional characteristics as input, so that the characteristic dimension is reduced, efficient convolution processing can be realized by using less parameter quantity, the calculated quantity of keyword detection is reduced, and the accuracy of keyword detection is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required in the embodiments will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a keyword detection method according to the present invention;
fig. 2 is a schematic structural diagram of a keyword detection system provided in the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a keyword detection method and a keyword detection system, which can reduce the calculation amount of keyword detection and improve the accuracy of keyword detection.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Fig. 1 is a schematic flow chart of a keyword detection method provided by the present invention, and as shown in fig. 1, the keyword detection method provided by the present invention includes:
s101, acquiring a voice signal to be processed.
S102, extracting the characteristics of the voice signal to be processed, determining Mel frequency spectrum characteristics, and carrying out variable-dimension processing on the Mel frequency spectrum characteristics; the Mel frequency spectrum feature after the dimension changing processing is a one-dimensional feature.
S102, specifically including:
pre-emphasis, framing and windowing, fast Fourier Transform (FFT), mel filtering and logarithm processing are carried out on the voice signal to be processed, and therefore the Mel frequency spectrum characteristic is determined; the frame length of the frame windowing is 30ms, and the frame shift is 10ms.
As a specific example, the length of the speech processed for each feature is 1s, and each second includes 98 frames, 98= ((1000-30)/10); the number of Mel filters used in the Mel filtering process is 40, so that 40 × 98 Mel frequency spectrum characteristics are obtained. And, since the speech signal input is a single channel input, a mel spectrum characteristic of 1 × 40 × 98 is finally obtained.
And performing dimension changing processing on the Mel frequency spectrum characteristic by using a view function or a reshape function in a PyTorch frame.
As a specific example, the mel-frequency spectrum features of 1 × 40 × 98 are subjected to dimension-changing processing, so that the mel-frequency spectrum features of 40 × 1 × 98 are obtained, namely, the two-dimensional features of 40 × 98 of a single channel are changed into the time-domain features of 1 × 98 of 40 channels.
S103, determining a detection result of the voice signal to be processed by utilizing a trained time domain convolution neural network according to the Mel frequency spectrum characteristic after the dimension changing processing; and the detection result is a keyword or a non-keyword corresponding to the voice signal to be processed.
The process of detecting the keywords by the trained time domain convolution neural network comprises the following steps:
inputting the Mel frequency spectrum characteristics after the dimension changing processing into a first layer time domain convolution layer for convolution processing; the number of input channels of the first time domain convolution layer is 40 (the number of mel filters in feature extraction), the number of output channels is C, and the number of convolution kernels is 1 × 3, that is, the 1 × 3 convolution kernels can be used to perform convolution processing on the time domain feature input of 1 × 98, so as to obtain the layer output of C × 1 × n; c is the number of output channels of the layer, N is related to parameters such as step length, filling and the like of the layer besides being affected by the convolution kernel and the input dimension, and is therefore replaced by N, i.e., N =1 + (M-F + 2P)/S, where M is the input dimension, i.e., 98; f is the convolution kernel size, i.e., 3; p is filling, and 0 is taken as a first layer; s is the step length, and the first layer is 1; finally, N =98 is obtained.
In order to reduce overfitting, the features after convolution processing are input into the BN layer to be subjected to regularization processing.
In order to increase nonlinearity in the network, inputting the characteristics after regularization processing into a ReLU activation function, further sequentially inputting 6 layers of time domain bottleneck blocks, a global average pooling layer and a full connection layer to determine corresponding output probabilities of all keywords and non-keywords, and taking the word corresponding to the maximum output probability as a detection result of the trained time domain convolutional neural network. The time-domain bottleneck block comprises: the point convolution layer of 1 × 1 convolution kernel, the BN layer, the ReLU activation function, the deep time domain convolution layer of 1 × 9 convolution kernel, the BN layer, the ReLU activation function, the point convolution layer of 1 × 1 convolution kernel, the BN layer and the ReLU activation function are connected in sequence. Wherein, the input end of the point convolution layer of the first layer in each layer of time domain bottleneck block is connected with the output end of the BN layer of the last layer.
As a specific embodiment, the time domain bottleneck block has a shortcut, and the shortcut connects the input end of the point convolution layer of the first layer in each layer of time domain bottleneck block to the output end of the BN layer of the last layer. The shortcut is used for copying the input of the convolution layer, and the input of the next layer is obtained after the input is added with the output of the BN layer and then is subjected to the ReLU activation function.
The point convolution layer of the 1 x 1 convolution kernel is used to match the dimensions and modify the number of output channels. The depth time domain convolution layer of the 1 × 9 convolution kernel can obtain a receptive field with the same size as that of the traditional 3 × 3 convolution layer, and meanwhile, the calculated amount is obviously reduced; the parameters such as step length, filling, number of input and output channels and the like can be selected according to requirements. After the input data enters the deep time domain convolution layer and is subjected to convolution operation, output of C1 x N is obtained (C is the number of output channels of the layer), the output data passes through the point convolution layer of 1 x 1 again and is added with input of the time domain bottleneck block of the layer transmitted by shortcut (if the original input dimension is not C1 x N, dimension changing operation is added in the shortcut and can be realized by using convolution kernel of 1 x 1), input of the next time domain bottleneck block is obtained, operation is repeated until output of the bottleneck block of the last layer, namely output of the bottleneck block of the 6 th time domain is obtained, the output of the bottleneck block of the 6 th time domain enters the global averaging pooling layer, dimension reduction processing is carried out, and output of C is finally obtained (C is the number of output channels of the global averaging pooling layer). And finally, entering a full connection layer, outputting n dimensions, and outputting n output probabilities after the processing of softmax, wherein n is the number of the keywords and the non-keywords set during the keyword detection. Taking n =12 as an example, wherein 11 keywords and 1 non-keyword are included, 12 output probabilities corresponding to these categories are finally output.
Selecting the value with the maximum output probability value, outputting the corresponding keyword or non-keyword, and if the output is the keyword, detecting the success of the keyword and recording; if the output is a non-keyword, no record is made.
Compared with the traditional voice awakening system using the convolutional neural network as the classifier, the voice awakening system uses the time domain convolutional neural network as the classifier, and uses the one-dimensional time domain convolution to replace the traditional two-dimensional convolution, so that the calculated amount and the required memory occupation of a keyword detection system are obviously reduced, the accuracy rate of the system is improved, and the deployment difficulty of hardware implementation is reduced.
Compared with the output dimension C X M N (C is the number of output channels and M N is the size of characteristic input) of the traditional two-dimensional convolution, the output dimension C X M N greatly reduces the size of characteristic output, greatly reduces the storage capacity of data and the operation amount of a system, and further reduces the power consumption. In addition, because the characteristic dimension is reduced, efficient convolution processing can be realized by using less parameter quantity, so that the keyword detection accuracy is improved, and finally, a lightweight scheme of the keyword detection system convenient for a mobile terminal is provided.
Fig. 2 is a schematic structural diagram of a keyword detection system provided by the present invention, and as shown in fig. 2, the keyword detection system provided by the present invention includes:
a voice signal obtaining module 201, configured to obtain a voice signal to be processed.
A mel-frequency spectrum feature extraction and dimension change module 202, configured to perform feature extraction on the voice signal to be processed, determine mel-frequency spectrum features, and perform dimension change processing on the mel-frequency spectrum features; the Mel frequency spectrum feature after the dimension changing processing is a one-dimensional feature.
The detection result determining module 203 is configured to determine a detection result of the voice signal to be processed by using the trained time domain convolutional neural network according to the mel frequency spectrum feature after the dimension change processing; and the detection result is a keyword or a non-keyword corresponding to the voice signal to be processed.
The process of detecting the keywords by the trained time domain convolution neural network comprises the following steps:
inputting the Mel frequency spectrum characteristics subjected to the dimension changing processing into a first layer of time domain convolution layer for convolution processing;
and inputting the features subjected to the convolution processing into the BN layer for regularization processing.
Inputting the characteristics after the regularization processing into a ReLU activation function, further sequentially inputting 6 layers of time domain bottleneck blocks, a global average pooling layer and a full connection layer to determine corresponding output probabilities of all keywords and non-keywords, and taking a word corresponding to the maximum output probability as a detection result of the trained time domain convolutional neural network; the time domain bottleneck block comprises: the device comprises a point convolution layer, a BN layer, a ReLU activation function, a depth time domain convolution layer, a BN layer, a ReLU activation function, a point convolution layer, a BN layer and a ReLU activation function which are connected in sequence; wherein, the input end of the point convolution layer of the first layer in each layer of time domain bottleneck block is connected with the output end of the BN layer of the last layer.
In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principle and the embodiment of the present invention are explained by applying specific examples, and the above description of the embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the foregoing, the description is not to be taken in a limiting sense.

Claims (4)

1. A keyword detection method is characterized by comprising the following steps:
acquiring a voice signal to be processed;
extracting the characteristics of the voice signal to be processed, determining Mel frequency spectrum characteristics, and performing variable-dimension processing on the Mel frequency spectrum characteristics; the Mel frequency spectrum characteristic after the dimension changing processing is a one-dimensional characteristic;
determining a detection result of the voice signal to be processed by utilizing a trained time domain convolution neural network according to the Mel frequency spectrum characteristic after the dimension changing processing; the detection result is a keyword or a non-keyword corresponding to the voice signal to be processed;
the process of detecting the keywords by the trained time domain convolution neural network comprises the following steps:
inputting the Mel frequency spectrum characteristics after the dimension changing processing into a first layer time domain convolution layer for convolution processing;
inputting the features after the convolution processing into a BN layer for regularization processing;
inputting the characteristics after the regularization processing into a ReLU activation function, further sequentially inputting 6 layers of time domain bottleneck blocks, a global average pooling layer and a full connection layer to determine corresponding output probabilities of all keywords and non-keywords, and taking a word corresponding to the maximum output probability as a detection result of the trained time domain convolutional neural network; the time domain bottleneck block comprises: the device comprises a point convolution layer, a BN layer, a ReLU activation function, a depth time domain convolution layer, a BN layer, a ReLU activation function, a point convolution layer, a BN layer and a ReLU activation function which are connected in sequence; the input end of the point convolution layer of the first layer in each layer of time domain bottleneck block is connected with the output end of the BN layer of the last layer; the number of output channels of the first layer of time domain convolution layer is C, and the convolution kernel is 1 x 3; the convolution kernel size of the point convolution layer is 1 x 1; the convolution kernel size of the depth time domain convolution layer is 1 x 9.
2. The method as claimed in claim 1, wherein said extracting features of the speech signal to be processed, determining mel-frequency spectrum features, and performing dimension-changing processing on the mel-frequency spectrum features specifically comprises:
pre-emphasis, framing and windowing, fast Fourier transform, mel filtering and logarithm processing are carried out on the voice signal to be processed, and Mel frequency spectrum characteristics are determined; the frame length of the framing windowing is 30ms, and the frame shift is 10ms; the Mel filtering process includes: 40 Mel filters.
3. The method according to claim 1, wherein the performing feature extraction on the to-be-processed speech signal, determining mel-frequency spectrum features, and performing dimension-changing processing on the mel-frequency spectrum features specifically comprises:
and carrying out dimension changing processing on the Mel frequency spectrum characteristic by using a view function or a reshape function in a PyTorch framework.
4. A keyword detection system for implementing the keyword detection method according to any one of claims 1 to 3, the keyword detection system comprising:
the voice signal acquisition module is used for acquiring a voice signal to be processed;
the Mel frequency spectrum feature extraction and dimension change module is used for extracting features of the voice signal to be processed, determining Mel frequency spectrum features and carrying out dimension change processing on the Mel frequency spectrum features; the Mel frequency spectrum characteristic after the variable dimension processing is a one-dimensional characteristic;
the detection result determining module is used for determining the detection result of the voice signal to be processed by utilizing the trained time domain convolution neural network according to the Mel frequency spectrum characteristic after the dimension changing processing; the detection result is a keyword or a non-keyword corresponding to the voice signal to be processed;
the process of detecting the keywords by the trained time domain convolution neural network comprises the following steps:
inputting the Mel frequency spectrum characteristics subjected to the dimension changing processing into a first layer of time domain convolution layer for convolution processing;
inputting the features after the convolution processing into a BN layer for regularization processing;
inputting the characteristics after the regularization processing into a ReLU activation function, further sequentially inputting 6 layers of time domain bottleneck blocks, a global average pooling layer and a full connection layer to determine corresponding output probabilities of all keywords and non-keywords, and taking a word corresponding to the maximum output probability as a detection result of the trained time domain convolutional neural network; the time domain bottleneck block comprises: the device comprises a point convolution layer, a BN layer, a ReLU activation function, a deep time domain convolution layer, a BN layer, a ReLU activation function, a point convolution layer, a BN layer and a ReLU activation function which are connected in sequence; wherein, the input end of the point convolution layer of the first layer in each layer of time domain bottleneck block is connected with the output end of the BN layer of the last layer.
CN202210952631.8A 2022-08-10 2022-08-10 Keyword detection method and system Active CN115035897B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210952631.8A CN115035897B (en) 2022-08-10 2022-08-10 Keyword detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210952631.8A CN115035897B (en) 2022-08-10 2022-08-10 Keyword detection method and system

Publications (2)

Publication Number Publication Date
CN115035897A CN115035897A (en) 2022-09-09
CN115035897B true CN115035897B (en) 2022-11-11

Family

ID=83130310

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210952631.8A Active CN115035897B (en) 2022-08-10 2022-08-10 Keyword detection method and system

Country Status (1)

Country Link
CN (1) CN115035897B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334800A (en) * 2019-07-18 2019-10-15 南京风兴科技有限公司 A kind of lightweight 3D convolutional network system for video identification
CN112825250A (en) * 2019-11-20 2021-05-21 芋头科技(杭州)有限公司 Voice wake-up method, apparatus, storage medium and program product
CN113344188A (en) * 2021-06-18 2021-09-03 东南大学 Lightweight neural network model based on channel attention module
CN114708855A (en) * 2022-06-07 2022-07-05 中科南京智能技术研究院 Voice awakening method and system based on binary residual error neural network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220084535A1 (en) * 2021-10-06 2022-03-17 Intel Corporation Reduced latency streaming dynamic noise suppression using convolutional neural networks

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334800A (en) * 2019-07-18 2019-10-15 南京风兴科技有限公司 A kind of lightweight 3D convolutional network system for video identification
CN112825250A (en) * 2019-11-20 2021-05-21 芋头科技(杭州)有限公司 Voice wake-up method, apparatus, storage medium and program product
CN113344188A (en) * 2021-06-18 2021-09-03 东南大学 Lightweight neural network model based on channel attention module
CN114708855A (en) * 2022-06-07 2022-07-05 中科南京智能技术研究院 Voice awakening method and system based on binary residual error neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"TIME-DOMAIN SPEAKER VERIFICATION USING TEMPORAL CONVOLUTIONAL NETWORKS";Sangwook Han;《ICASSP 2021》;20211231;正文第6688-6691页 *
"基于超轻量通道注意力的端对端语音增强方法";洪依;《智能科学与技术学报》;20210930;第3卷(第3期);正文第351-357页 *

Also Published As

Publication number Publication date
CN115035897A (en) 2022-09-09

Similar Documents

Publication Publication Date Title
KR102213013B1 (en) Frequency-based audio analysis using neural networks
US10937438B2 (en) Neural network generative modeling to transform speech utterances and augment training data
CN109448719B (en) Neural network model establishing method, voice awakening method, device, medium and equipment
CN111583903B (en) Speech synthesis method, vocoder training method, device, medium, and electronic device
CN110718211B (en) Keyword recognition system based on hybrid compressed convolutional neural network
CN111785288B (en) Voice enhancement method, device, equipment and storage medium
CN109919295B (en) Embedded audio event detection method based on lightweight convolutional neural network
CN111357051B (en) Speech emotion recognition method, intelligent device and computer readable storage medium
CN109543029B (en) Text classification method, device, medium and equipment based on convolutional neural network
US11854536B2 (en) Keyword spotting apparatus, method, and computer-readable recording medium thereof
CN114708855B (en) Voice awakening method and system based on binary residual error neural network
Peter et al. End-to-end keyword spotting using neural architecture search and quantization
CN112233675A (en) Voice awakening method and system based on separation convolutional neural network
CN113409827B (en) Voice endpoint detection method and system based on local convolution block attention network
CN115035897B (en) Keyword detection method and system
CN111489739B (en) Phoneme recognition method, apparatus and computer readable storage medium
CN111259189A (en) Music classification method and device
CN112397086A (en) Voice keyword detection method and device, terminal equipment and storage medium
Pan et al. An efficient hybrid learning algorithm for neural network–based speech recognition systems on FPGA chip
CN107919136B (en) Digital voice sampling frequency estimation method based on Gaussian mixture model
CN112989106B (en) Audio classification method, electronic device and storage medium
CN111160517A (en) Convolutional layer quantization method and device of deep neural network
CN113409775B (en) Keyword recognition method and device, storage medium and computer equipment
CN113609970A (en) Underwater target identification method based on grouping convolution depth U _ Net
CN113763976A (en) Method and device for reducing noise of audio signal, readable medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant