CN113362854A - Sparse self-attention mechanism-based acoustic event detection method, system, storage medium and equipment - Google Patents

Sparse self-attention mechanism-based acoustic event detection method, system, storage medium and equipment Download PDF

Info

Publication number
CN113362854A
CN113362854A CN202110619344.0A CN202110619344A CN113362854A CN 113362854 A CN113362854 A CN 113362854A CN 202110619344 A CN202110619344 A CN 202110619344A CN 113362854 A CN113362854 A CN 113362854A
Authority
CN
China
Prior art keywords
acoustic event
event detection
attention mechanism
detection method
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110619344.0A
Other languages
Chinese (zh)
Other versions
CN113362854B (en
Inventor
韩纪庆
关亚东
薛嘉宾
郑贵滨
郑铁然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202110619344.0A priority Critical patent/CN113362854B/en
Publication of CN113362854A publication Critical patent/CN113362854A/en
Application granted granted Critical
Publication of CN113362854B publication Critical patent/CN113362854B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Abstract

An acoustic event detection method, system, storage medium and device based on a sparse self-attention mechanism belong to the field of machine auditory intelligence. The method aims to solve the problem that an existing time sequence feature extraction network cannot realize effective time sequence modeling, and therefore performance of an existing acoustic event detection system is limited. Firstly, extracting a Mel spectrogram from an input audio signal, then respectively inputting the Mel spectrogram into a convolutional neural network to extract local characteristics, and extracting time domain characteristics by using a transformer encoder based on a sparse self-attention mechanism; and finally, inputting the acoustic event data into a full connection layer for classification, carrying out post-processing on the result, and outputting the category and the start-stop time of each detected acoustic event according to the result. Mainly for the detection of acoustic events.

Description

Sparse self-attention mechanism-based acoustic event detection method, system, storage medium and equipment
Technical Field
The invention belongs to the field of auditory intelligence of machines, and relates to a method, a system, a storage medium and equipment for detecting acoustic events.
Background
The acoustic event detection refers to the analysis and processing of the sound signal to identify the type of acoustic event occurring in the audio signal and the start-stop time of each acoustic event. The acoustic event detection has wide application prospects in the aspects of security, smart home, smart city, multimedia information retrieval, biodiversity detection, intelligent robot environment perception and the like.
The existing acoustic event detection system structurally comprises a spectrum feature extraction module, a neural network module and a post-processing module, wherein the neural network module is a core module of the acoustic event detection system. The neural network module mainly comprises two parts: a local feature extraction network and a time sequence feature extraction network. The existing time sequence feature extraction network usually adopts a self-attention mechanism, which enables the network to be influenced by features of all moments in an audio frequency segment when processing the features of specific moments, and actually, the features of a plurality of moments are useless or even harmful to modeling of the features of the current moment, so that the network cannot realize effective time sequence modeling, thereby limiting the performance of the existing acoustic event detection system and influencing the practicability of the prior art.
Disclosure of Invention
The invention aims to solve the problem that the existing time sequence feature extraction network can not realize effective time sequence modeling, thereby limiting the performance of the existing acoustic event detection system.
An acoustic event detection method based on a sparse self-attention mechanism comprises the following steps:
firstly, extracting a Mel spectrogram from an input audio signal, then respectively inputting the Mel spectrogram into a convolutional neural network to extract local features, and extracting time domain features by using a transform Encoder based on a sparse self-attention mechanism; finally, inputting the acoustic event data into a full connection layer for classification, carrying out post-processing on the result, and outputting the category and the start-stop time of each detected acoustic event;
the process of extracting the time domain features comprises the following steps:
will liftLocal feature H takeniInputting the data into a single-layer Transformer Encoder model, and normalizing the attention weight by adopting a sparse normalization method; the normalization operation on the obtained attention weight matrix comprises the following steps:
2.1 column t of A is AtTo A, atThe elements in (1) are arranged in descending order; a is an attention weight matrix in the self-attention layer;
2.2 finding an intermediate parameter k satisfying the following conditionst
kt∈max{k∈[T]|1+kAt,k>∑j≤kAt,j}
Where T represents the size of the time dimension, [ T [ [ T ]]={1,2...T},At,k、At,jAre respectively vector AtThe kth and jth elements of (a);
2.3 calculating the threshold value taut
Figure BDA0003098965500000021
2.4 for AtEach element j in (1) finds:
A′t,j=[At,jt]+
wherein [ ·]+Is represented by [ ·]+=max{0,·};
2.5, return to step 2.1 until T ═ T, resulting in a normalized attention weight matrix a'.
Further, the attention weight matrix in the self-attention layer:
Figure BDA0003098965500000022
wherein
Figure BDA0003098965500000023
Query and Key matrices in self-attribute, dkIs the characteristic dimension size.
Further, the convolutional neural networks are respectively input into the convolutional neural networks to extract local features, and each convolutional neural network is composed of at least one convolution module, wherein each convolution module comprises a convolution layer, a normalization layer, a nonlinear layer and a maximum pooling layer.
Further, the convolutional neural network for extracting local features is composed of seven convolutional modules, and the number of stacked convolutional filters of the convolutional neural network is 16,32,64,128 and 128 in sequence; the pooling sizes of the maximum pooling layers are (2,2), (1,2), and (1, 2).
The convolutional layers in each convolutional module are two-dimensional convolutional layers, the convolutional kernel size is (3,3), and the step size is (1, 1).
Further, the process of finally inputting to the full connection layer for classification comprises the following steps:
and classifying the features by using a full-connection layer, wherein the hidden layer parameter is 128, and the activation function adopts a Sigmoid activation function.
Further, the process of post-processing the results and outputting the category and start-stop time of each detected acoustic event as a result comprises the steps of:
smoothing the output probability by using median filtering to obtain the prediction probability of the acoustic event
Figure BDA0003098965500000024
When in use
Figure BDA0003098965500000025
Greater than 0.5 indicates that the acoustic event of class c occurs at the t moment, and otherwise indicates that the event of class c does not occur; and then obtain the information whether the sound event happens at each moment, and then can obtain the occurrence and end time of the sound event.
Further, the process of extracting the mel spectrogram from the input audio signal comprises the following steps:
the input sound signal is a sound segment of 10 seconds, and the sampling rate is 16 kHz; the Mel spectrum extraction process adopts a window length of 2048, frame shift of 255, 128 Mel domain filters, and combines the numbersThe values are mapped to a natural log domain; finally, a 10 second sound fragment, the extracted Mel spectrogram XiIs (648,128); where 648 is the number of frames and 128 is the order of the mel-filter coefficients.
An acoustic event detection system based on a sparse self-attention mechanism, the system being configured to perform a sparse self-attention mechanism based acoustic event detection method.
A storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement a sparse self-attention mechanism based acoustic event detection method.
An apparatus comprising a processor and a memory, the storage medium having stored therein at least one instruction that is loaded and executed by the processor to implement a sparse self-attention mechanism based acoustic event detection method.
Has the advantages that:
according to the method provided by the invention, the attention weight is thinned, so that the coupling of the model with certain irrelevant moment information is reduced when the time sequence structure of the sound event is modeled, the more effective time sequence modeling is realized, the performance of the conventional acoustic event detection system is improved, the method provided by the invention is verified on the internationally disclosed acoustic event detection data set, and the result shows that the classification performance of the method is greatly improved compared with that of the conventional system.
Drawings
Fig. 1 is a schematic diagram of an acoustic event detection method based on a sparse self-attention mechanism.
Fig. 2 is a schematic structural diagram of the convolutional neural network portion in fig. 1, wherein x7 indicates that the left bracket contains modules stacked 7 times.
FIG. 3 is a schematic structural diagram of the transform Encoder network portion of FIG. 1, which includes the proposed self-attention weight thinning method.
Fig. 4 is a graph comparing the detection performance of the method of the present invention and the original baseline system on the international public data set.
Detailed Description
The first embodiment is as follows:
the invention provides an acoustic event detection method based on a sparse self-attention mechanism, which is used for replacing a Softmax normalization method in the self-attention mechanism with a sparse normalization method, so that coupling with useless features at other moments can be selectively reduced when a time sequence feature extraction network carries out time sequence modeling, and the time sequence features of an acoustic signal can be more effectively modeled.
Fig. 1 is a schematic diagram of an implementation model of an acoustic event detection method based on a sparse self-attention mechanism. Firstly, extracting a Mel spectrogram from an input audio signal, then respectively inputting the Mel spectrogram into a convolutional neural network CNN to extract local features, extracting time sequence features by using a transform Encoder based on a sparse self-attention mechanism, finally inputting the time sequence features into a full-link layer to be classified, carrying out post-processing on the result, and outputting the category and start-stop time of each detected acoustic event. Specifically, the method comprises the following steps:
step 1, extracting local features of the audio signal.
Step 1.1, extracting a Mel spectrogram.
Firstly, extracting commonly used Mel spectrogram characteristics of an input sound signal as model input. In some embodiments, the input sound signal is a 10 second sound segment with a sample rate of 16 kHz. The mel-frequency spectrum extraction process adopts a window length of 2048, a frame shift of 255, 128 mel-frequency domain filters, and maps the values to a natural log domain. Finally, a 10 second sound fragment, the extracted Mel spectrogram XiIs (648,128). Where 648 is the number of frames and 128 is the order of the mel-filter coefficients.
And step 1.2, extracting local features.
And inputting the extracted Mel spectrogram into a convolutional neural network model. The convolutional neural network consists of a series of convolutional blocks including convolutional layers, normalization layers, and nonlinear layers, the maximum pooling layer, as shown in fig. 2. In some embodiments, 7 convolution modules are used, the two-dimensional convolution layer convolution kernel size is (3,3), the step size is (1,1), and the number of convolution filters stacked is (16,32,64, 128) in that order. The pooling size of the maximum pooling layer is ((2,2), (2,2), (1,2), (1,2), (1, 2)). All the used two-dimensional convolution layers, the maximum pooling layer, the batch normalization layer and the linear rectifying unit are standard components in a common neural network framework.
The input Mel spectrogram is mapped by a convolution neural network to obtain local characteristics Hi. Wherein HiIs (157,128), 157 is the time dimension, and 128 is the feature dimension.
And 2, extracting time domain features.
Extracting local feature HiInput into a single-layer Transformer Encoder model. The Transformer Encoder model is composed of a full connection layer, a self-attention layer and a dropout layer, and the detailed configuration and parameters of the model are shown in FIG. 3. The number of attention points shown in the figure is 16, the dimension of the linear mapping layer is 512, and dropout is 0.2. The attention weight is normalized by adopting the proposed sparse normalization method, and other components are standard components in a common neural network framework. The network output tensor is denoted as MiIts dimension is (157,128).
Wherein the calculation process of the attention weight matrix in the self-attention layer is as follows:
Figure BDA0003098965500000041
wherein
Figure BDA0003098965500000042
Query and Key matrices in self-attribute, dkIs the characteristic dimension size.
The normalization operation on the obtained attention weight matrix is as follows:
2.1 column t of A is AtTo A, atThe elements in (1) are arranged in descending order;
2.2 finding an intermediate parameter k satisfying the following conditionst
kt∈max{k∈[T]|1+kAt,k>∑j≤kAt,j} (2)
Where T represents the size of the time dimension, [ T [ [ T ]]={1,2...T},At,k、At,jAre respectively vector AtThe kth and jth elements of (a);
2.3 calculating the threshold value taut
Figure BDA0003098965500000051
2.4 for AtEach element j in (1) finds:
A′t,j=[At,jt]+ (4)
wherein [ ·]+Is represented by [ ·]+=max{0,·}。
2.5, return to step 2.1 until T ═ T, resulting in a normalized attention weight matrix a'.
The characteristics of the process according to the invention are analyzed as follows: at a certain time t1And the acoustic event with the type S is subjected to time sequence modeling, the ideal situation is that only the characteristic weighted sum of all the moments belonging to the class S in the audio segment is carried out, and other moments t not belonging to the class S2The weight corresponding to the feature of (2) is 0, and the weight here means the time t1Characteristic of (D) and time t2Similarity of features of (c). In addition, in general, a larger attention weight value represents the similarity between the same kind of acoustic events, while a smaller attention weight value often represents the similarity between different kinds of acoustic events.
The normalized attention weight a' obtained by the above method can ignore a relatively small attention weight value compared to a normal normalized attention weight obtained by softmax transformation. Therefore, when the normalized attention weight is used for weighted summation of features at different time instants, the sparse attention weight a' may make the neural network less affected by acoustic event features of other categories when modeling acoustic event features of the same category. In addition, the invention does not simply set the small weight value to 0 directly by means of the card threshold value, but sets the relatively small weight value to 0 adaptively after all the weight values are considered comprehensively.
In conclusion, the method and the device are beneficial to modeling of the sound event by the neural network and improving the performance of the acoustic event detection system.
And step 3, characteristic classification.
Classifying the features by using a full-connection layer, wherein the hidden layer parameter is 128, the output dimension is (157,10), the activation function adopts a Sigmoid activation function, and the output matrix uses OiAnd (4) showing.
And 4, post-treatment.
Smoothing the output probability by using median filtering to obtain the prediction probability of the acoustic event
Figure BDA0003098965500000061
When in use
Figure BDA0003098965500000062
A value greater than 0.5 indicates that an acoustic event of class c occurred at time t, whereas it indicates that an event of class c did not occur. Thus, the information about whether the sound event occurs at each moment is obtained, and then the occurrence time and the end time of the sound event can be obtained.
Examples
In order to verify the effectiveness of the method, the scheme of the first embodiment is adopted to verify on the acoustic event detection data set DESED which is internationally disclosed at present, and the proposed method is compared with the original baseline method. As shown in fig. 4, the method proposed by the present invention has better recognition performance for all ten types of acoustic event detection than the original baseline system. The average performance of the original system on the data set was 44.22% and the average performance of the proposed method was 47.65%, and the results exceeded the performance of the first single model of DCASE 2020 race task four. Therefore, the experimental results fully verify the effectiveness of the invention.
The second embodiment is as follows:
the embodiment is an acoustic event detection system based on a sparse self-attention mechanism, and the system is used for executing an acoustic event detection method based on the sparse self-attention mechanism.
The third concrete implementation mode:
the present embodiment is a storage medium having at least one instruction stored therein, where the at least one instruction is loaded and executed by a processor to implement a sparse auto-attention mechanism based acoustic event detection method.
The fourth concrete implementation mode:
the present embodiment is an apparatus comprising a processor and a memory, the storage medium having at least one instruction stored therein, the at least one instruction being loaded and executed by the processor to implement a sparse self-attention mechanism based acoustic event detection method.
The above-described calculation examples of the present invention are merely to explain the calculation model and the calculation flow of the present invention in detail, and are not intended to limit the embodiments of the present invention. It will be apparent to those skilled in the art that other variations and modifications of the present invention can be made based on the above description, and it is not intended to be exhaustive or to limit the invention to the precise form disclosed, and all such modifications and variations are possible and contemplated as falling within the scope of the invention.

Claims (10)

1. An acoustic event detection method based on a sparse self-attention mechanism comprises the following steps:
firstly, extracting a Mel spectrogram from an input audio signal, then respectively inputting the Mel spectrogram into a convolutional neural network to extract local features, and extracting time domain features by using a transform Encoder based on a sparse self-attention mechanism; finally, inputting the acoustic event data into a full connection layer for classification, carrying out post-processing on the result, and outputting the category and the start-stop time of each detected acoustic event;
the method is characterized in that the process of extracting the time domain features comprises the following steps:
extracting local feature HiTransformer input to a monolayerIn the Encoder model, a sparse normalization method is adopted to normalize the attention weight; the normalization operation on the obtained attention weight matrix comprises the following steps:
2.1 column t of A is AtTo A, atThe elements in (1) are arranged in descending order; a is an attention weight matrix in the self-attention layer;
2.2 finding an intermediate parameter k satisfying the following conditionst
kt∈max{k∈[T]|1+kAt,k>∑j≤kAt,j}
Where T represents the size of the time dimension, [ T [ [ T ]]={1,2...T},At,k、At,jAre respectively vector AtThe kth and jth elements of (a);
2.3 calculating the threshold value taut
Figure FDA0003098965490000011
2.4 for AtEach element j in (1) finds:
A′t,j=[At,jt]+
wherein [ ·]+Is represented by [ ·]+=max{0,·};
2.5, return to step 2.1 until T ═ T, resulting in a normalized attention weight matrix a'.
2. The sparse self-attention mechanism-based acoustic event detection method according to claim 1, wherein the attention weight matrix in the self-attention layer is:
Figure FDA0003098965490000012
wherein
Figure FDA0003098965490000013
Are self-attenti respectivelyQuery and Key matrix in on, dkIs the characteristic dimension size.
3. The sparse attention mechanism based acoustic event detection method of claim 2, wherein the convolutional neural networks, each of which is input to a convolutional neural network to extract local features, are comprised of at least one convolutional module, the convolutional module comprising a convolutional layer, a normalization layer, a non-linear layer and a max-pooling layer.
4. The sparse attention mechanism-based acoustic event detection method according to claim 3, wherein the convolutional neural network for extracting local features is composed of seven convolutional modules, and the number of stacked convolutional filters of the convolutional neural network is 16,32,64, 128; the pooling sizes of the maximum pooling layers are (2,2), (1,2), and (1, 2).
The convolutional layers in each convolutional module are two-dimensional convolutional layers, the convolutional kernel size is (3,3), and the step size is (1, 1).
5. The sparse self-attention mechanism-based acoustic event detection method according to claim 4, wherein the process of finally inputting to the full connection layer for classification comprises the following steps:
and classifying the features by using a full-connection layer, wherein the hidden layer parameter is 128, and the activation function adopts a Sigmoid activation function.
6. The sparse self-attention mechanism-based acoustic event detection method according to any one of claims 1 to 5, wherein the step of post-processing the result and outputting the category and start-stop time of each detected acoustic event comprises the steps of:
smoothing the output probability by using median filtering to obtain the prediction probability of the acoustic event
Figure FDA0003098965490000022
When in use
Figure FDA0003098965490000021
Greater than 0.5 indicates that the acoustic event of class c occurs at the t moment, and otherwise indicates that the event of class c does not occur; and then obtain the information whether the sound event happens at each moment, and then can obtain the occurrence and end time of the sound event.
7. The sparse self-attention mechanism-based acoustic event detection method of claim 6, wherein the process of extracting the Mel spectrogram from the input audio signal comprises the following steps:
the input sound signal is a sound segment of 10 seconds, and the sampling rate is 16 kHz; in the Mel sound spectrum extraction process, 128 Mel domain filters with window length of 2048 and frame shift of 255 are adopted, and the values are mapped to a natural log domain; finally, a 10 second sound fragment, the extracted Mel spectrogram XiIs (648,128); where 648 is the number of frames and 128 is the order of the mel-filter coefficients.
8. An acoustic event detection system based on a sparse self-attention mechanism, characterized in that the system is configured to perform a sparse self-attention mechanism based acoustic event detection method according to one of claims 1 to 7.
9. A storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement a sparse self-attention mechanism based acoustic event detection method as claimed in any one of claims 1 to 7.
10. An apparatus comprising a processor and a memory, wherein the storage medium has stored therein at least one instruction that is loaded and executed by the processor to implement a sparse self-attentive mechanism based acoustic event detection method as claimed in any one of claims 1 to 7.
CN202110619344.0A 2021-06-03 2021-06-03 Sparse self-attention mechanism-based acoustic event detection method, system, storage medium and equipment Active CN113362854B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110619344.0A CN113362854B (en) 2021-06-03 2021-06-03 Sparse self-attention mechanism-based acoustic event detection method, system, storage medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110619344.0A CN113362854B (en) 2021-06-03 2021-06-03 Sparse self-attention mechanism-based acoustic event detection method, system, storage medium and equipment

Publications (2)

Publication Number Publication Date
CN113362854A true CN113362854A (en) 2021-09-07
CN113362854B CN113362854B (en) 2022-11-15

Family

ID=77531749

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110619344.0A Active CN113362854B (en) 2021-06-03 2021-06-03 Sparse self-attention mechanism-based acoustic event detection method, system, storage medium and equipment

Country Status (1)

Country Link
CN (1) CN113362854B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023245991A1 (en) * 2022-06-24 2023-12-28 南方电网调峰调频发电有限公司储能科研院 Power plant equipment state auditory monitoring method merging frequency band top-down attention mechanism

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110600059A (en) * 2019-09-05 2019-12-20 Oppo广东移动通信有限公司 Acoustic event detection method and device, electronic equipment and storage medium
CN111145718A (en) * 2019-12-30 2020-05-12 中国科学院声学研究所 Chinese mandarin character-voice conversion method based on self-attention mechanism
CN111899760A (en) * 2020-07-17 2020-11-06 北京达佳互联信息技术有限公司 Audio event detection method and device, electronic equipment and storage medium
CN111933188A (en) * 2020-09-14 2020-11-13 电子科技大学 Sound event detection method based on convolutional neural network
CN112802484A (en) * 2021-04-12 2021-05-14 四川大学 Panda sound event detection method and system under mixed audio frequency
CN113223506A (en) * 2021-05-28 2021-08-06 思必驰科技股份有限公司 Speech recognition model training method and speech recognition method
US20220068462A1 (en) * 2020-08-28 2022-03-03 doc.ai, Inc. Artificial Memory for use in Cognitive Behavioral Therapy Chatbot
US20220108698A1 (en) * 2020-10-07 2022-04-07 Mitsubishi Electric Research Laboratories, Inc. System and Method for Producing Metadata of an Audio Signal

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110600059A (en) * 2019-09-05 2019-12-20 Oppo广东移动通信有限公司 Acoustic event detection method and device, electronic equipment and storage medium
CN111145718A (en) * 2019-12-30 2020-05-12 中国科学院声学研究所 Chinese mandarin character-voice conversion method based on self-attention mechanism
CN111899760A (en) * 2020-07-17 2020-11-06 北京达佳互联信息技术有限公司 Audio event detection method and device, electronic equipment and storage medium
US20220068462A1 (en) * 2020-08-28 2022-03-03 doc.ai, Inc. Artificial Memory for use in Cognitive Behavioral Therapy Chatbot
CN111933188A (en) * 2020-09-14 2020-11-13 电子科技大学 Sound event detection method based on convolutional neural network
US20220108698A1 (en) * 2020-10-07 2022-04-07 Mitsubishi Electric Research Laboratories, Inc. System and Method for Producing Metadata of an Audio Signal
CN112802484A (en) * 2021-04-12 2021-05-14 四川大学 Panda sound event detection method and system under mixed audio frequency
CN113223506A (en) * 2021-05-28 2021-08-06 思必驰科技股份有限公司 Speech recognition model training method and speech recognition method

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
ANDR´E FT MARTINS 等: ""From softmax to sparsemax: A sparse model of attention and multi-label classifification"", 《HTTPS://ARXIV.ORG/ABS/1602.02068V2》 *
GONC¸ALO M. CORREIA 等: ""Adaptively Sparse Transformers"", 《HTTPS://ARXIV.ORG/ABS/1909.00015V1》 *
KOICHI MIYAZAKI 等: ""CONVOLUTION-AUGMENTED TRANSFORMER FOR SEMI-SUPERVISED SOUND EVENT DETECTION"", 《DETECTION AND CLASSIFICATION OF ACOUSTIC SCENES AND EVENTS 2020》 *
M.KOICHI 等: ""Weakly supervised sound event detection with self-attention"", 《ICASSP 2020》 *
QIUQIANG KONG: ""Sound Event Detection of Weakly Labelled Data with CNN-Transformer and Automatic Threshold Optimization"", 《ARXIV:1912.04761V2》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023245991A1 (en) * 2022-06-24 2023-12-28 南方电网调峰调频发电有限公司储能科研院 Power plant equipment state auditory monitoring method merging frequency band top-down attention mechanism

Also Published As

Publication number Publication date
CN113362854B (en) 2022-11-15

Similar Documents

Publication Publication Date Title
Bayar et al. On the robustness of constrained convolutional neural networks to jpeg post-compression for image resampling detection
CN111651504A (en) Multi-element time sequence multilayer space-time dependence modeling method based on deep learning
US20140019390A1 (en) Apparatus and method for audio fingerprinting
CN111564163B (en) RNN-based multiple fake operation voice detection method
CN110968845B (en) Detection method for LSB steganography based on convolutional neural network generation
CN111526144A (en) Abnormal flow detection method and system based on DVAE-Catboost
CN113362854B (en) Sparse self-attention mechanism-based acoustic event detection method, system, storage medium and equipment
CN111091839A (en) Voice awakening method and device, storage medium and intelligent device
CN116150509B (en) Threat information identification method, system, equipment and medium for social media network
CN111276133B (en) Audio recognition method, system, mobile terminal and storage medium
CN116527357A (en) Web attack detection method based on gate control converter
Liang et al. Image resampling detection based on convolutional neural network
CN109617864B (en) Website identification method and website identification system
Dehdar et al. Image steganalysis using modified graph clustering based ant colony optimization and Random Forest
CN114615010A (en) Design method of edge server-side intrusion prevention system based on deep learning
CN106909944B (en) Face picture clustering method
CN117375896A (en) Intrusion detection method and system based on multi-scale space-time feature residual fusion
CN112418173A (en) Abnormal sound identification method and device and electronic equipment
Xin et al. Research on feature selection of intrusion detection based on deep learning
CN116935303A (en) Weak supervision self-training video anomaly detection method
CN114554491A (en) Wireless local area network intrusion detection method based on improved SSAE and DNN models
CN114171057A (en) Transformer event detection method and system based on voiceprint
Kung et al. Augment deep BP-parameter learning with local XAI-structural learning
CN112769619A (en) Multi-classification network fault prediction method based on decision tree
Jia et al. A Method of Malicious Data Flow Detection Based on Convolutional Neural Network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant