CN113362854B - Sparse self-attention mechanism-based acoustic event detection method, system, storage medium and equipment - Google Patents

Sparse self-attention mechanism-based acoustic event detection method, system, storage medium and equipment Download PDF

Info

Publication number
CN113362854B
CN113362854B CN202110619344.0A CN202110619344A CN113362854B CN 113362854 B CN113362854 B CN 113362854B CN 202110619344 A CN202110619344 A CN 202110619344A CN 113362854 B CN113362854 B CN 113362854B
Authority
CN
China
Prior art keywords
acoustic event
attention mechanism
event detection
detection method
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110619344.0A
Other languages
Chinese (zh)
Other versions
CN113362854A (en
Inventor
韩纪庆
关亚东
薛嘉宾
郑贵滨
郑铁然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202110619344.0A priority Critical patent/CN113362854B/en
Publication of CN113362854A publication Critical patent/CN113362854A/en
Application granted granted Critical
Publication of CN113362854B publication Critical patent/CN113362854B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Abstract

An acoustic event detection method, system, storage medium and device based on a sparse self-attention mechanism belong to the field of machine auditory intelligence. The method aims to solve the problem that an existing time sequence feature extraction network cannot realize effective time sequence modeling, and therefore performance of an existing acoustic event detection system is limited. Firstly, extracting a Mel spectrogram from an input audio signal, then respectively inputting the Mel spectrogram into a convolutional neural network to extract local characteristics, and extracting time domain characteristics by using a transformer encoder based on a sparse self-attention mechanism; and finally, inputting the acoustic event data into a full connection layer for classification, carrying out post-processing on the result, and outputting the category and the start-stop time of each detected acoustic event according to the result. Mainly for the detection of acoustic events.

Description

Sparse self-attention mechanism-based acoustic event detection method, system, storage medium and equipment
Technical Field
The invention belongs to the field of auditory intelligence of machines, and relates to a method, a system, a storage medium and equipment for detecting acoustic events.
Background
The acoustic event detection refers to the analysis and processing of the sound signal to identify the type of acoustic event occurring in the audio signal and the start-stop time of each acoustic event. The acoustic event detection has wide application prospects in the aspects of security, smart home, smart city, multimedia information retrieval, biodiversity detection, intelligent robot environment perception and the like.
The existing acoustic event detection system structurally comprises a spectrum feature extraction module, a neural network module and a post-processing module, wherein the neural network module is a core module of the acoustic event detection system. The neural network module mainly comprises two parts: a local feature extraction network and a time sequence feature extraction network. The existing time sequence feature extraction network usually adopts a self-attention mechanism, which enables the network to be influenced by features of all moments in an audio frequency segment when processing the features of specific moments, and actually, the features of a plurality of moments are useless or even harmful to modeling of the features of the current moment, so that the network cannot realize effective time sequence modeling, thereby limiting the performance of the existing acoustic event detection system and influencing the practicability of the prior art.
Disclosure of Invention
The invention aims to solve the problem that the existing time sequence feature extraction network can not realize effective time sequence modeling, thereby limiting the performance of the existing acoustic event detection system.
An acoustic event detection method based on a sparse self-attention mechanism comprises the following steps:
firstly, extracting a Mel spectrogram from an input audio signal, then respectively inputting the Mel spectrogram into a convolutional neural network to extract local features, and extracting time domain features by using a transform Encoder based on a sparse self-attention mechanism; finally, inputting the acoustic event data into a full connection layer for classification, carrying out post-processing on the result, and outputting the category and the start-stop time of each detected acoustic event;
the process of extracting the time domain features comprises the following steps:
extracting local feature H i Inputting the data into a single-layer Transformer Encoder model, and normalizing the attention weight by adopting a sparse normalization method; the normalization operation on the obtained attention weight matrix comprises the following steps:
2.1, the tth column of A is A t To A, a t The elements in (2) are arranged in a descending order; a is an attention weight matrix in the self-attention layer;
2.2 finding an intermediate parameter k satisfying the following conditions t
k t ∈max{k∈[T]|1+kA t,k >∑ j≤k A t,j }
Where T represents the size of the time dimension, [ T [ [ T ]]={1,2...T},A t,k 、A t,j Are respectively vector A t The kth and jth elements of (a);
2.3 calculating the threshold value tau t
Figure BDA0003098965500000021
2.4 for A t Each element j in (1) finds:
A′ t,j =[A t,jt ] +
wherein [ ·] + Is represented by [ ·] + =max{0,·};
2.5, return to step 2.1 until T = T, resulting in a normalized attention weight matrix a'.
Further, the attention weight matrix in the self-attention layer:
Figure BDA0003098965500000022
wherein
Figure BDA0003098965500000023
Query and Key matrices in self-attribute, d k Is the characteristic dimension size.
Further, the convolutional neural networks are respectively input into the convolutional neural networks to extract local features, and each convolutional neural network is composed of at least one convolution module, wherein each convolution module comprises a convolution layer, a normalization layer, a nonlinear layer and a maximum pooling layer.
Further, the convolutional neural network for extracting local features is composed of seven convolutional modules, and the number of stacked convolutional filters of the convolutional neural network is 16,32,64,128 and 128 in sequence; the maximum pooling layer had a pooling size of (2,2), (2,2), (1,2), (1,2), (1,2), (1,2), or (1,2).
The convolutional layers in each convolutional module are two-dimensional convolutional layers, the convolutional kernel size is (3,3), and the step size is (1,1).
Further, the process of finally inputting to the full connection layer for classification comprises the following steps:
the features are classified by using a full connection layer, wherein the hidden layer parameter is 128, and the activation function adopts a Sigmoid activation function.
Further, the process of post-processing the results and outputting the category and start-stop time of each detected acoustic event as a result comprises the steps of:
smoothing the output probability by using median filtering to obtain the prediction probability of the acoustic event
Figure BDA0003098965500000024
When in use
Figure BDA0003098965500000025
Greater than 0.5 indicates that the acoustic event of class c occurs at the t moment, and otherwise indicates that the event of class c does not occur; and then obtain the information whether the sound event happens at each moment, and then can obtain the occurrence and end time of the sound event.
Further, the process of extracting the mel spectrogram from the input audio signal comprises the following steps:
the input sound signal is a sound segment of 10 seconds, and the sampling rate is 16kHz; in the Mel sound spectrum extraction process, 128 Mel domain filters with window length of 2048 and frame shift of 255 are adopted, and the numerical values are mapped to a natural log domain; finally, a 10 second sound fragment, the extracted Mel spectrogram X i Is (648,128); where 648 is the number of frames and 128 is the order of the mel-filter coefficients.
An acoustic event detection system based on a sparse self-attention mechanism is used for executing an acoustic event detection method based on the sparse self-attention mechanism.
A storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement a sparse self-attention mechanism based acoustic event detection method.
An apparatus comprising a processor and a memory, the storage medium having stored therein at least one instruction, the at least one instruction being loaded and executed by the processor to implement a sparse self-attention mechanism based acoustic event detection method.
Has the beneficial effects that:
according to the method provided by the invention, the attention weight is thinned, so that the coupling of the model with certain irrelevant moment information is reduced when the time sequence structure of the sound event is modeled, the more effective time sequence modeling is realized, the performance of the conventional acoustic event detection system is improved, the method provided by the invention is verified on the internationally disclosed acoustic event detection data set, and the result shows that the classification performance of the method is greatly improved compared with that of the conventional system.
Drawings
Fig. 1 is a schematic diagram of an acoustic event detection method based on a sparse self-attention mechanism.
Fig. 2 is a schematic structural diagram of the convolutional neural network portion in fig. 1, wherein x7 represents that the modules included in the left bracket are stacked 7 times.
FIG. 3 is a schematic structural diagram of the transform Encoder network portion of FIG. 1, which includes the proposed self-attention weight thinning method.
Fig. 4 is a graph comparing the detection performance of the method of the present invention and the original baseline system on the international public data set.
Detailed Description
The first embodiment is as follows:
the invention provides an acoustic event detection method based on a sparse self-attention mechanism, which is used for replacing a Softmax normalization method in the self-attention mechanism with a sparse normalization method, so that coupling with useless features at other moments can be selectively reduced when a time sequence feature extraction network carries out time sequence modeling, and the time sequence features of an acoustic signal can be more effectively modeled.
Fig. 1 is a schematic diagram of an implementation model of an acoustic event detection method based on a sparse self-attention mechanism. Firstly, extracting a Mel spectrogram from an input audio signal, then respectively inputting the Mel spectrogram into a convolutional neural network CNN to extract local features, extracting time sequence features by using a transform Encoder based on a sparse self-attention mechanism, finally inputting the time sequence features into a full-link layer to be classified, carrying out post-processing on the result, and outputting the category and start-stop time of each detected acoustic event. Specifically, the method comprises the following steps:
step 1, extracting local features of the audio signal.
Step 1.1, extracting a Mel spectrogram.
Firstly, extracting commonly used Mel spectrogram characteristics of an input sound signal as model input. In some embodiments, the input sound signal is a 10 second sound segment with a sample rate of 16kHz. The mel-frequency spectrum extraction process adopts a window length of 2048, a frame shift of 255, 128 mel-frequency domain filters, and maps the values to a natural log domain. Finally, a 10 second sound fragment, the Mel spectrogram X extracted i Is (648,128). Where 648 is the number of frames and 128 is the order of the mel-filter coefficients.
And step 1.2, extracting local features.
Inputting the extracted Mel spectrogram into a convolutional neural network model. The convolutional neural network consists of a series of convolutional blocks including convolutional layers, normalization layers, and nonlinear layers max pooling layers, as shown in fig. 2. In some embodiments, 7 convolution modules are used, the two-dimensional convolution layer convolution kernel size is (3,3), the step size is (1,1), and the number of stacked convolution filters is (16,32,64,128,128,128,128). The pooling sizes of the largest pooling layers were ((2,2), (2,2), (1,2), (1,2), (1,2), (1,2), (1,2)). All the used two-dimensional convolution layers, the maximum pooling layer, the batch normalization layer and the linear rectifying unit are standard components in a common neural network framework.
The input Mel spectrogram is mapped by a convolution neural network to obtain local characteristics H i . Wherein H i Is (157, 128), 157 is the time dimension, and 128 is the feature dimension.
And 2, extracting time domain features.
The extracted local features H i Input into a single-layer Transformer Encoder model. The Transformer Encoder model is composed of a full connection layer, a self-attention layer and a dropout layer, and the detailed configuration and parameters of the model are shown in FIG. 3. The number of attention heads shown in the figure is 16, the dimension of the linear mapping layer is 512, and dropout is0.2. The attention weight is normalized by adopting the proposed sparse normalization method, and other components are standard components in a common neural network framework. The network output tensor is denoted as M i Its dimension is (157,128).
Wherein the calculation process of the attention weight matrix in the self-attention layer is as follows:
Figure BDA0003098965500000041
wherein
Figure BDA0003098965500000042
Query and Key matrices in self-attribute, d k Is the characteristic dimension size.
The normalization operation on the obtained attention weight matrix is as follows:
2.1 column t of A is A t To A, a t The elements in (1) are arranged in descending order;
2.2 finding an intermediate parameter k satisfying the following conditions t
k t ∈max{k∈[T]|1+kA t,k >∑ j≤k A t,j } (2)
Where T represents the size of the time dimension, [ T [ [ T ]]={1,2...T},A t,k 、A t,j Are respectively vector A t The kth and jth elements of (a);
2.3 calculating the threshold value tau t
Figure BDA0003098965500000051
2.4 for A t Each element j in (1) finds:
A′ t,j =[A t,jt ] + (4)
wherein [ ·] + Is represented by [ ·] + =max{0,·}。
2.5, return to step 2.1 until T = T, resulting in a normalized attention weight matrix a'.
The characteristics of the process according to the invention are analyzed as follows: at a certain time t 1 And the acoustic event with the type S is subjected to time sequence modeling, the ideal situation is that only the characteristic weighted sum of all the moments belonging to the class S in the audio segment is carried out, and other moments t not belonging to the class S 2 The weight corresponding to the feature of (2) is 0, and the weight here means the time t 1 Characteristic of (D) and time t 2 Similarity of features of (a). In addition, in general, a larger attention weight value represents the similarity between the same kind of acoustic events, while a smaller attention weight value often represents the similarity between different kinds of acoustic events.
The normalized attention weight a' obtained by the above method can ignore a relatively small attention weight value compared to a normal normalized attention weight obtained by softmax transformation. Therefore, when the normalized attention weight is used for weighted summation of features at different time instants, the sparse attention weight a' may make the neural network less affected by acoustic event features of other categories when modeling acoustic event features of the same category. In addition, the invention does not simply set the small weight value to 0 directly by means of the card threshold value, but sets the relatively small weight value to 0 adaptively after all the weight values are considered comprehensively.
In conclusion, the method and the device are beneficial to modeling of the sound event by the neural network and improving the performance of the acoustic event detection system.
And step 3, characteristic classification.
Classifying the features by using a full-connection layer, wherein the hidden layer parameter is 128, the output dimension is (157,10), the activation function adopts a Sigmoid activation function, and the output matrix uses O i And (4) showing.
And 4, post-treatment.
Smoothing the output probability by using median filtering to obtain the prediction probability of the acoustic event
Figure BDA0003098965500000061
When in use
Figure BDA0003098965500000062
A value greater than 0.5 indicates that an acoustic event of class c has occurred at time t, whereas the opposite indicates that an event of class c has not occurred. Thus, the information about whether the sound event occurs at each moment is obtained, and then the occurrence time and the end time of the sound event can be obtained.
Examples
In order to verify the effectiveness of the method, the scheme of the first embodiment is adopted to verify on the acoustic event detection data set DESED which is internationally disclosed at present, and the proposed method is compared with the original baseline method. As shown in fig. 4, the method proposed by the present invention has better recognition performance for all ten types of acoustic event detection than the original baseline system. The average performance of the original system on the data set was 44.22% and the average performance of the proposed method was 47.65%, and the results exceeded the performance of the first single model of DCASE 2020 race task four. Therefore, the experimental results fully verify the effectiveness of the invention.
The second embodiment is as follows:
the embodiment is an acoustic event detection system based on a sparse self-attention mechanism, and the system is used for executing an acoustic event detection method based on the sparse self-attention mechanism.
The third concrete implementation mode:
the present embodiment is a storage medium having at least one instruction stored therein, where the at least one instruction is loaded and executed by a processor to implement a sparse auto-attention mechanism based acoustic event detection method.
The fourth concrete implementation mode is as follows:
the present embodiment is an apparatus comprising a processor and a memory, the storage medium having at least one instruction stored therein, the at least one instruction being loaded and executed by the processor to implement a sparse self-attention mechanism based acoustic event detection method.
The above-described calculation examples of the present invention are merely to describe the calculation model and the calculation flow of the present invention in detail, and are not intended to limit the embodiments of the present invention. It will be apparent to those skilled in the art that other variations and modifications of the present invention can be made based on the above description, and it is not intended to be exhaustive or to limit the invention to the precise form disclosed, and all such modifications and variations are possible and contemplated as falling within the scope of the invention.

Claims (10)

1. An acoustic event detection method based on a sparse self-attention mechanism comprises the following steps:
firstly, extracting a Mel spectrogram from an input audio signal, then respectively inputting the Mel spectrogram into a convolutional neural network to extract local features, and extracting time domain features by using a transform Encoder based on a sparse self-attention mechanism; finally, inputting the acoustic event data into a full connection layer for classification, carrying out post-processing on the result, and outputting the category and the start-stop time of each detected acoustic event;
the method is characterized in that the process of extracting the time domain features comprises the following steps:
the extracted local features H i Inputting the data into a single-layer Transformer Encoder model, and normalizing the attention weight by adopting a sparse normalization method; the normalization operation on the obtained attention weight matrix comprises the following steps:
2.1 column t of A is A t To A, a t The elements in (1) are arranged in descending order; a is an attention weight matrix in the self-attention layer;
2.2 finding an intermediate parameter k satisfying the following conditions t
k t ∈max{k∈[T]|1+kA t,k >∑ j≤k A t,j }
Where T represents the size of the time dimension, [ T [ [ T ]]={1,2...T},A t,k 、A t,j Are respectively vector A t The kth and jth elements of (a);
2.3 calculating the threshold value tau t
Figure FDA0003878061100000011
2.4 for A t Each element j in (1) solves:
A′ t,j =[A t,jt ] +
wherein [ ·] + Is represented by [ ·] + =max{0,·};
2.5, return to step 2.1 until T = T, resulting in a normalized attention weight matrix a'.
2. The method according to claim 1, wherein the attention weight matrix in the self-attention layer is:
Figure FDA0003878061100000012
wherein
Figure FDA0003878061100000013
Query and Key matrices in self-attribute, d k Is the characteristic dimension size.
3. The sparse attention mechanism based acoustic event detection method of claim 2, wherein the convolutional neural networks, each of which is input to a convolutional neural network to extract local features, are comprised of at least one convolutional module, the convolutional module comprising a convolutional layer, a normalization layer, a non-linear layer and a max-pooling layer.
4. The sparse self-attention mechanism-based acoustic event detection method according to claim 3, wherein the convolutional neural network for extracting local features consists of seven convolutional modules, and the number of the stacked convolutional filters of the convolutional neural network is 16,32,64, 128; the maximum pooling layer had a pooling size of (2,2), (2,2), (1,2), (1,2), (1,2), (1,2), (1,2);
the convolutional layers in each convolutional module are two-dimensional convolutional layers, the convolutional kernel size is (3,3), and the step size is (1,1).
5. The sparse self-attention mechanism-based acoustic event detection method according to claim 4, wherein the process of finally inputting to the full connection layer for classification comprises the following steps:
and classifying the features by using a full-connection layer, wherein the hidden layer parameter is 128, and the activation function adopts a Sigmoid activation function.
6. The sparse self-attention mechanism-based acoustic event detection method according to any one of claims 1 to 5, wherein the step of post-processing the result and outputting the category and start-stop time of each detected acoustic event comprises the steps of:
smoothing the output probability by using median filtering to obtain the prediction probability of the acoustic event
Figure FDA0003878061100000021
When in use
Figure FDA0003878061100000022
Greater than 0.5 indicates that the acoustic event of class c occurs at the t moment, and otherwise indicates that the event of class c does not occur; and then obtain the information whether the sound event happens at each moment, and then can obtain the occurrence and end time of the sound event.
7. The sparse self-attention mechanism-based acoustic event detection method of claim 6, wherein the process of extracting the Mel spectrogram from the input audio signal comprises the following steps:
the input sound signal is a sound segment of 10 seconds, and the sampling rate is 16kHz; in the Mel sound spectrum extraction process, 128 Mel domain filters with window length of 2048 and frame shift of 255 are adopted, and the numerical values are mapped to a natural log domain; finally, a 10 second sound segment, extracted melSpectrogram X i Is (648,128); where 648 is the number of frames and 128 is the order of the mel-filter coefficients.
8. A sparse self-attention mechanism based acoustic event detection system, wherein the system is configured to perform a sparse self-attention mechanism based acoustic event detection method of any one of claims 1 to 7.
9. A storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement a sparse self-attention mechanism based acoustic event detection method as claimed in any one of claims 1 to 7.
10. An apparatus comprising a processor and a memory, the memory having stored therein at least one instruction, the at least one instruction being loaded and executed by the processor to implement a sparse self-attention mechanism based acoustic event detection method as claimed in any one of claims 1 to 7.
CN202110619344.0A 2021-06-03 2021-06-03 Sparse self-attention mechanism-based acoustic event detection method, system, storage medium and equipment Active CN113362854B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110619344.0A CN113362854B (en) 2021-06-03 2021-06-03 Sparse self-attention mechanism-based acoustic event detection method, system, storage medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110619344.0A CN113362854B (en) 2021-06-03 2021-06-03 Sparse self-attention mechanism-based acoustic event detection method, system, storage medium and equipment

Publications (2)

Publication Number Publication Date
CN113362854A CN113362854A (en) 2021-09-07
CN113362854B true CN113362854B (en) 2022-11-15

Family

ID=77531749

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110619344.0A Active CN113362854B (en) 2021-06-03 2021-06-03 Sparse self-attention mechanism-based acoustic event detection method, system, storage medium and equipment

Country Status (1)

Country Link
CN (1) CN113362854B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116825131A (en) * 2022-06-24 2023-09-29 南方电网调峰调频发电有限公司储能科研院 Power plant equipment state auditory monitoring method integrating frequency band self-downward attention mechanism

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110600059A (en) * 2019-09-05 2019-12-20 Oppo广东移动通信有限公司 Acoustic event detection method and device, electronic equipment and storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111145718B (en) * 2019-12-30 2022-06-07 中国科学院声学研究所 Chinese mandarin character-voice conversion method based on self-attention mechanism
CN111899760B (en) * 2020-07-17 2024-05-07 北京达佳互联信息技术有限公司 Audio event detection method and device, electronic equipment and storage medium
US20220068462A1 (en) * 2020-08-28 2022-03-03 doc.ai, Inc. Artificial Memory for use in Cognitive Behavioral Therapy Chatbot
CN111933188B (en) * 2020-09-14 2021-02-05 电子科技大学 Sound event detection method based on convolutional neural network
US11756551B2 (en) * 2020-10-07 2023-09-12 Mitsubishi Electric Research Laboratories, Inc. System and method for producing metadata of an audio signal
CN112802484B (en) * 2021-04-12 2021-06-18 四川大学 Panda sound event detection method and system under mixed audio frequency
CN113223506B (en) * 2021-05-28 2022-05-20 思必驰科技股份有限公司 Speech recognition model training method and speech recognition method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110600059A (en) * 2019-09-05 2019-12-20 Oppo广东移动通信有限公司 Acoustic event detection method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN113362854A (en) 2021-09-07

Similar Documents

Publication Publication Date Title
Bayar et al. On the robustness of constrained convolutional neural networks to jpeg post-compression for image resampling detection
CN111476023A (en) Method and device for identifying entity relationship
Huang et al. A novel method for detecting image forgery based on convolutional neural network
CN109446804B (en) Intrusion detection method based on multi-scale feature connection convolutional neural network
CN111091839B (en) Voice awakening method and device, storage medium and intelligent device
CN110968845B (en) Detection method for LSB steganography based on convolutional neural network generation
CN111276133B (en) Audio recognition method, system, mobile terminal and storage medium
CN113362854B (en) Sparse self-attention mechanism-based acoustic event detection method, system, storage medium and equipment
CN116527357A (en) Web attack detection method based on gate control converter
CN116150509B (en) Threat information identification method, system, equipment and medium for social media network
CN113179250A (en) Web unknown threat detection method and system
CN111526144A (en) Abnormal flow detection method and system based on DVAE-Catboost
CN116229960B (en) Robust detection method, system, medium and equipment for deceptive voice
CN117375896A (en) Intrusion detection method and system based on multi-scale space-time feature residual fusion
CN116340746A (en) Feature selection method based on random forest improvement
CN107403618B (en) Audio event classification method based on stacking base sparse representation and computer equipment
Ramesh Babu et al. A novel framework design for semantic based image retrieval as a cyber forensic tool
CN116506210A (en) Network intrusion detection method and system based on flow characteristic fusion
WO2021088176A1 (en) Binary multi-band power distribution-based low signal-to-noise ratio sound event detection method
Liang et al. Image resampling detection based on convolutional neural network
CN114554491A (en) Wireless local area network intrusion detection method based on improved SSAE and DNN models
Dehdar et al. Image steganalysis using modified graph clustering based ant colony optimization and Random Forest
CN117633604A (en) Audio and video intelligent processing method and device, storage medium and electronic equipment
CN114547601B (en) Random forest intrusion detection method based on multi-layer classification strategy
Xin et al. Research on feature selection of intrusion detection based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant