CN110782878B - Attention mechanism-based multi-scale audio scene recognition method - Google Patents

Attention mechanism-based multi-scale audio scene recognition method Download PDF

Info

Publication number
CN110782878B
CN110782878B CN201910960088.4A CN201910960088A CN110782878B CN 110782878 B CN110782878 B CN 110782878B CN 201910960088 A CN201910960088 A CN 201910960088A CN 110782878 B CN110782878 B CN 110782878B
Authority
CN
China
Prior art keywords
scale
processing
audio
layer
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910960088.4A
Other languages
Chinese (zh)
Other versions
CN110782878A (en
Inventor
张涛
梁晋华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201910960088.4A priority Critical patent/CN110782878B/en
Publication of CN110782878A publication Critical patent/CN110782878A/en
Application granted granted Critical
Publication of CN110782878B publication Critical patent/CN110782878B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/285Memory allocation or algorithm optimisation to reduce hardware requirements

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

A multi-scale audio scene identification method based on an attention mechanism comprises the following steps: establishing a multi-scale audio scene recognition convolutional neural network model based on an attention mechanism, wherein the multi-scale audio scene recognition convolutional neural network model is used for accurately recognizing audio scenes with different frequency band sizes and different durations; inputting audio files containing different scene categories and a training set of corresponding scene categories into a multi-scale audio scene recognition convolutional neural network model based on an attention mechanism, and training the multi-scale audio scene recognition convolutional neural network model based on the attention mechanism; reading an audio file and preprocessing the audio file to obtain an audio signal segment; extracting a logarithmic Mel map from the audio signal segment; and inputting the logarithmic Mel diagram into a trained multi-scale audio scene recognition convolutional neural network model based on an attention mechanism to obtain a final scene type. The method has good identification accuracy on multi-scale sound scenes with different frequency band ranges and durations, and can be applied to embedded mobile equipment and the like.

Description

Attention mechanism-based multi-scale audio scene recognition method
Technical Field
The invention relates to an audio scene recognition method. In particular to a multi-scale audio scene recognition method based on an attention mechanism.
Background
Audio scene recognition is a type of method that lets machines recognize specific background information behind audio (e.g., parks, streets, or restaurants) by processing a recorded audio file or uploaded data stream in order for the machine to mimic a human.
In the field of machine learning, many different models and audio feature representation methods are proposed to solve the problem of scene recognition. Related studies to solve the problem of scene audio using neural networks have been made as early as 1997. Liu et al used Recurrent Neural Networks (RNNs) and nearest neighbor classifiers to distinguish five different classes of ambient sounds in 1998. However, due to the introduction of excessive parameters during the training process, the models of the above two neural networks have very high complexity, and the performance after training is poor. In the game hosted by IEEE AASP in 2013, many teams participating in the game attempt to distinguish 10 different classes of sound scenes using some conventional machine learning Methods, such as Gaussian Mixture Models (GMMs), Support Vector Machines (SVMs), Tree-based classification Methods (Tree-based Methods), and packet-based classification Methods (Bag-based Methods). Although these methods have low computational complexity, the conventional machine learning methods cannot achieve satisfactory audio scene recognition effect because their model structures are relatively simple and cannot fully utilize more and more data provided under the current big data trend.
In recent years, the application of Neural Networks and deep learning in the fields of pattern recognition and the like has been promoted by the proposal of Convolutional Neural Networks (CNNs). The idea of local perception and weight sharing can capture more features while reducing model parameters, thereby improving the performance of the network model. In 2017, Valenti et al apply CNN to the field of audio scene recognition, and obtain good effect. Kong et al in 2018 proposed a CNN structure for 8-layer convolution operations and achieved good performance in the decade challenge held by ieee aasp in 2018. However, in the existing methods based on CNNs, scene recognition is often performed by using a single-scale sound feature, so that the trained model is often suitable for a certain type or some types of special scenes, and the accuracy of overall scene recognition is not ideal.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a multi-scale audio scene recognition method based on an attention mechanism, which has higher accuracy and better real-time performance.
The technical scheme adopted by the invention is as follows: a multi-scale audio scene recognition method based on an attention mechanism comprises the following steps:
1) establishing a multi-scale audio scene recognition convolutional neural network model based on an attention mechanism, wherein the multi-scale audio scene recognition convolutional neural network model is used for accurately recognizing audio scenes with different frequency band sizes and different durations;
2) inputting audio files containing different scene categories and a training set of corresponding scene categories into a multi-scale audio scene recognition convolutional neural network model based on an attention mechanism, and training the multi-scale audio scene recognition convolutional neural network model based on the attention mechanism;
3) reading an audio file and preprocessing the audio file to obtain an audio signal segment;
4) extracting a logarithmic mel map from said audio signal segment;
5) and inputting the logarithmic Mel map into a trained multi-scale audio scene recognition convolutional neural network model based on an attention mechanism to obtain a final scene category.
The attention mechanism-based multi-scale audio scene recognition convolutional neural network model in the step 1) comprises the following components in series: the system comprises a feature extraction module, a feature processing module, an attention module and a weight distribution module, wherein the feature extraction module is used for extracting features of different scales of a received logarithmic Mel's diagram and is formed by an Xceptance model, the feature processing module is used for processing the features of different scales extracted by the feature extraction module to obtain feature vectors representing different scales, the attention module is used for fusing the feature vectors representing different scales and classifying scenes, and the weight distribution module is used for processing the features of the lowest scale output by the feature extraction module and outputting the processed features to the attention module.
Different scale features output by the second, third and fourth pooling layers of the feature extraction module are respectively sent to the feature processing module, and the scale feature at the bottommost layer output by the first pooling layer of the feature extraction module is sent to the weight distribution module.
The characteristic processing module comprises:
the first transverse connection structure is used for sequentially carrying out 1 × 1 convolution processing, 3 × 3 convolution processing and global pooling processing on the received upper-layer scale features to obtain upper-layer scale feature vectors, sending the upper-layer scale feature vectors to the attention module, and sending the upper-layer scale feature information subjected to the 1 × 1 convolution processing to the second transverse connection structure;
the second transverse connection structure is used for respectively carrying out 1 multiplied by 1 convolution processing on the received middle-layer scale features and carrying out up-sampling processing on the received upper-layer scale feature information, carrying out 3 multiplied by 3 convolution processing and global pooling processing on the middle-layer scale feature information obtained by adding the result of the 1 multiplied by 1 convolution processing and the result of the up-sampling processing to obtain middle-layer scale feature vectors, sending the middle-layer scale feature vectors to the attention module, and sending the middle-layer scale feature information to the third transverse connection structure;
and the third transverse connection structure is used for respectively carrying out 1 × 1 convolution processing on the received bottom-layer scale features and carrying out up-sampling processing on the received middle-layer scale feature information, then carrying out 3 × 3 convolution processing and global pooling processing on the bottom-layer scale feature information obtained by adding the result of the 1 × 1 convolution processing and the result of the up-sampling processing to obtain bottom-layer scale feature vectors, and sending the bottom-layer scale feature vectors to the attention module.
The weight distribution module comprises the following steps of: and performing 1 × 1 convolution processing, 3 × 3 convolution processing, global pooling processing and full-connection layer processing on the scale features of the bottommost layer to obtain three weight coefficients for distributing attention to different scales, and sending the weight coefficients to an attention module.
The attention module carries out weighted average on the upper-layer scale feature vector, the middle-layer scale feature vector and the bottom-layer scale feature vector output by the feature processing module by utilizing three weight coefficients output by the weight distribution module, and then carries out full-connection layer processing and classification processing in sequence to obtain the final scene category.
The preprocessing in the step 3) is to cut off the input signal into signal segments with fixed time length of 10 s.
The step 4) comprises the following steps:
(1) performing frame division and windowing on an input audio signal segment;
(2) enabling the obtained audio frame to pass through a Mel filter bank, calculating the energy passing through each Mel filter in each time step range in the audio frame, forming energy vectors by all the energy passing through the Mel filter in each time step range, and combining the energy vectors in all the time step ranges to obtain a two-dimensional Mel map corresponding to the audio frame;
(3) and carrying out logarithm processing on the two-dimensional Mel image to obtain a logarithm Mel image.
Compared with the traditional single-scale convolutional neural network method, the attention mechanism-based multi-scale audio scene identification method has good identification accuracy on multi-scale sound scenes with different frequency band ranges and different duration. Under the condition of equivalent model complexity, the method has higher overall accuracy. The method has the advantages of low model complexity, low memory consumption in practical application and better real-time performance, so the method can be applied to embedded mobile equipment and the like. In addition, due to the fact that a large number of tensor calculations are used, the method can be applied to GPU/TPU and the like to greatly improve processing speed.
Drawings
FIG. 1 is a flow chart of a multi-scale audio scene recognition method based on attention mechanism according to the present invention;
FIG. 2 is a schematic diagram of a multi-scale audio scene recognition convolutional neural network model based on an attention mechanism in the present invention;
FIG. 3 is a schematic diagram of a first cross-connect structure of feature processing modules in an attention-based multi-scale audio scene recognition convolutional neural network model;
FIG. 4 is a schematic diagram of a second cross-connect structure of feature processing modules in an attention-based multi-scale audio scene recognition convolutional neural network model;
FIG. 5 is a schematic diagram of a third cross-connect structure of feature processing modules in an attention-based multi-scale audio scene recognition convolutional neural network model;
FIG. 6 is a schematic diagram of a weight assignment module in a multi-scale audio scene recognition convolutional neural network model based on an attention mechanism;
FIG. 7 is a schematic diagram of an attention module in a multi-scale audio scene recognition convolutional neural network model based on an attention mechanism.
Detailed Description
The following describes a multi-scale audio scene recognition method based on attention mechanism in detail with reference to embodiments and drawings.
The invention discloses a multi-scale audio scene recognition method based on an attention mechanism, which is used for extracting sound characteristics of key acoustic events with different time lengths and frequency band distributions in a scene. On the basis, the invention allocates attention to the extracted sound features of each scale by using an attention mechanism, so that the sound features of each scale are fused into new sound features for identifying the sound scene.
As shown in fig. 1, the method for identifying a multi-scale audio scene based on an attention mechanism of the present invention includes the following steps:
1) establishing a multi-scale audio scene recognition convolutional neural network model based on an attention mechanism, wherein the multi-scale audio scene recognition convolutional neural network model is used for accurately recognizing audio scenes with different frequency band sizes and different durations; wherein the content of the first and second substances,
as shown in fig. 2, the convolutional neural network model for multi-scale audio scene recognition based on attention mechanism includes: the system comprises a feature extraction module 1 consisting of an Xception model and used for extracting features of different scales of a received logarithmic Mel's diagram, a feature processing module 2 used for processing the features of different scales extracted by the feature extraction module 1 to obtain feature vectors representing different scales, an attention module 4 used for fusing the feature vectors representing different scales and carrying out scene classification, and a weight distribution module 3 used for processing the scale features of the lowest layer output by the feature extraction module 1 and outputting the processed feature features to the attention module 4. Wherein the content of the first and second substances,
different scale features output by the second, third and fourth pooling layers of the feature extraction module 1 are respectively sent to the feature processing module 2, and the scale feature at the bottommost layer output by the first pooling layer of the feature extraction module 1 is sent to the weight distribution module 4.
The characteristic processing module 2 comprises:
as shown in fig. 3, the first transverse connection structure 2.1 sequentially performs 1 × 1 convolution processing, 3 × 3 convolution processing and global pooling processing on the received upper scale features to obtain upper scale feature vectors, sends the upper scale feature vectors to the attention module 4, and sends upper scale feature information subjected to 1 × 1 convolution processing to the second transverse connection structure 2.2;
as shown in fig. 4, the second transverse connection structure 2.2 performs 1 × 1 convolution processing on the received middle-layer scale features and performs upsampling processing on the received upper-layer scale feature information, and then performs 3 × 3 convolution processing and global pooling processing on the middle-layer scale feature information obtained by adding the result of the 1 × 1 convolution processing and the result of the upsampling processing to obtain a middle-layer scale feature vector, and sends the middle-layer scale feature vector to the attention module 4 and the third transverse connection structure 2.3;
as shown in fig. 5, the third transverse connection structure 2.3 performs 1 × 1 convolution processing on the received bottom-layer scale features and performs upsampling processing on the received middle-layer scale feature information, and then performs 3 × 3 convolution processing and global pooling processing on the bottom-layer scale feature information obtained by adding the result of the 1 × 1 convolution processing and the result of the upsampling processing to obtain a bottom-layer scale feature vector, which is sent to the attention module 4.
As shown in fig. 6, the weight assignment module 3 includes, in sequence: and performing 1 × 1 convolution processing, 3 × 3 convolution processing, global pooling processing and full-connection layer processing on the scale features of the bottommost layer to obtain three weight coefficients for distributing attention to different scales, and sending the weight coefficients to the attention module 4.
As shown in fig. 7, the attention module 4 performs weighted averaging on the upper-layer scale feature vector, the middle-layer scale feature vector, and the bottom-layer scale feature vector output by the feature processing module 2 by using three weight coefficients output by the weight distribution module 3, and then performs full-connected layer processing and classification processing in sequence to obtain a final scene category.
2) Inputting audio files containing different scene categories and a training set of corresponding scene categories into a multi-scale audio scene recognition convolutional neural network model based on an attention mechanism, and training the multi-scale audio scene recognition convolutional neural network model based on the attention mechanism;
3) reading an audio file and preprocessing the audio file to obtain an audio signal segment; the preprocessing is to cut off the input signal into a signal segment with a fixed time length of 10s, so as to facilitate the next processing.
4) Extracting a logarithmic mel map from said audio signal segment; the method comprises the following steps:
(1) performing frame division and windowing on an input audio signal segment; since a speech signal is a typical non-stationary signal, but the motion of a sound organ is very slow compared with the speed of sound wave vibration, it is generally considered that the speech signal is a stationary signal in a time period of 10ms to 30ms, and therefore, the signal to be measured is subjected to frame windowing.
(2) Enabling the obtained audio frame to pass through a Mel filter bank, calculating the energy passing through each Mel filter in each time step range in the audio frame, forming energy vectors by all the energy passing through the Mel filter in each time step range, and combining the energy vectors in all the time step ranges to obtain a two-dimensional Mel map corresponding to the audio frame;
the energy passing through each Mel filter in each time step range in the audio frame is calculated by the following formula:
Figure BDA0002228629770000041
where M is the number of Mel filters, H (k) is the transfer function of the Mel filters, and X (k) is the corresponding FFT amplitude value.
(3) And carrying out logarithm processing on the two-dimensional Mel image to obtain a logarithm Mel image.
5) And inputting the logarithmic Mel map into a trained multi-scale audio scene recognition convolutional neural network model based on an attention mechanism to obtain a final scene category.
Specific examples are given below:
for fast convergence and best performance, an Adam optimizer is used and an adaptive learning rate attenuation is set in the training process, and 32 samples are trained each time. The concrete measures are as follows:
1. reading audio signal and cutting off, each section is cut into voice segments with fixed time length of 10s
2. And performing frame windowing on the voice signal with fixed time length, and adding a Hamming window with 2048 points to 2048 sampling points in each frame.
3. And (3) performing feature extraction on the framed signal through a Mel filter bank, and taking a logarithm to obtain a logarithmic Mel map, wherein the number of the filters is 134, the window length of the filters is 1704 points, and 852 points are overlapped between frames.
4. And inputting the logarithmic Mel diagram into a multi-scale audio scene recognition convolutional neural network model based on an attention mechanism, and performing forward propagation to obtain a final scene type.
TABLE 1 comparison of various audio scene recognition algorithms
Figure BDA0002228629770000051
As shown in the table, under the same data set, the accuracy of the multi-scale audio scene recognition method based on the attention mechanism is obviously higher than that of the other two reference methods, so that the performance of the method is better.

Claims (6)

1. A multi-scale audio scene recognition method based on an attention mechanism is characterized by comprising the following steps:
1) establishing a multi-scale audio scene recognition convolutional neural network model based on an attention mechanism, wherein the multi-scale audio scene recognition convolutional neural network model is used for accurately recognizing audio scenes with different frequency band sizes and different durations;
the attention mechanism-based multi-scale audio scene recognition convolutional neural network model comprises the following components in series: the system comprises a feature extraction module (1) composed of an Xception model and used for extracting features of different scales of a received logarithmic Mel's diagram, a feature processing module (2) used for processing the features of different scales extracted by the feature extraction module (1) to obtain feature vectors representing different scales, an attention module (4) used for fusing the feature vectors representing different scales and carrying out scene classification, and a weight distribution module (3) used for processing the scale features of the bottom layer output by the feature extraction module (1) and outputting the processed feature features to the attention module (4); wherein, the characteristic processing module (2) comprises:
the first transverse connection structure (2.1) is used for sequentially carrying out 1 × 1 convolution processing, 3 × 3 convolution processing and global pooling processing on the received upper-layer scale features to obtain upper-layer scale feature vectors, sending the upper-layer scale feature vectors into the attention module (4), and sending the upper-layer scale feature information subjected to 1 × 1 convolution processing into the second transverse connection structure (2.2);
the second transverse connection structure (2.2) is used for respectively carrying out 1 x 1 convolution processing on the received middle-layer scale features and carrying out up-sampling processing on the received upper-layer scale feature information, then carrying out 3 x 3 convolution processing and global pooling processing on the middle-layer scale feature information obtained by adding the result of the 1 x 1 convolution processing and the result of the up-sampling processing to obtain a middle-layer scale feature vector, sending the middle-layer scale feature vector to the attention module (4), and sending the middle-layer scale feature information to the third transverse connection structure (2.3);
the third transverse connection structure (2.3) is used for respectively carrying out 1 multiplied by 1 convolution processing on the received bottom-layer scale features and carrying out up-sampling processing on the received middle-layer scale feature information, then carrying out 3 multiplied by 3 convolution processing and global pooling processing on the bottom-layer scale feature information obtained by adding the result of the 1 multiplied by 1 convolution processing and the result of the up-sampling processing to obtain bottom-layer scale feature vectors, and sending the bottom-layer scale feature vectors to the attention module (4);
2) inputting audio files containing different scene categories and a training set of corresponding scene categories into a multi-scale audio scene recognition convolutional neural network model based on an attention mechanism, and training the multi-scale audio scene recognition convolutional neural network model based on the attention mechanism;
3) reading an audio file and preprocessing the audio file to obtain an audio signal segment;
4) extracting a logarithmic mel map from said audio signal segment;
5) and inputting the logarithmic Mel map into a trained multi-scale audio scene recognition convolutional neural network model based on an attention mechanism to obtain a final scene category.
2. The attention mechanism-based multi-scale audio scene recognition method according to claim 1, wherein different scale features output by the second, third and fourth pooling layers of the feature extraction module (1) are respectively fed into the feature processing module (2), and the lowest scale feature output by the first pooling layer of the feature extraction module (1) is fed into the weight assignment module (4).
3. The attention-based multi-scale audio scene recognition method according to claim 1, wherein the weight assignment module (3) comprises sequentially: and carrying out 1 × 1 convolution processing, 3 × 3 convolution processing, global pooling processing and full-connection layer processing on the scale features of the bottommost layer to obtain three weight coefficients for distributing attention to different scales, and sending the weight coefficients to an attention module (4).
4. The method for identifying a multi-scale audio scene based on an attention mechanism as claimed in claim 1, wherein the attention module (4) performs weighted averaging on the upper-layer scale feature vector, the middle-layer scale feature vector and the bottom-layer scale feature vector output by the feature processing module (2) by using three weight coefficients output by the weight distribution module (3), and then sequentially performs full-connection layer processing and classification processing to obtain a final scene category.
5. The method as claimed in claim 1, wherein the preprocessing in step 3) is to perform truncation processing on the input signal into a signal segment with a fixed duration of 10 s.
6. The attention mechanism-based multi-scale audio scene recognition method according to claim 1, wherein the step 4) comprises:
(1) performing frame division and windowing on an input audio signal segment;
(2) enabling the obtained audio frame to pass through a Mel filter bank, calculating the energy passing through each Mel filter in each time step range in the audio frame, forming energy vectors by all the energy passing through the Mel filter in each time step range, and combining the energy vectors in all the time step ranges to obtain a two-dimensional Mel map corresponding to the audio frame;
(3) and carrying out logarithm processing on the two-dimensional Mel image to obtain a logarithm Mel image.
CN201910960088.4A 2019-10-10 2019-10-10 Attention mechanism-based multi-scale audio scene recognition method Active CN110782878B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910960088.4A CN110782878B (en) 2019-10-10 2019-10-10 Attention mechanism-based multi-scale audio scene recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910960088.4A CN110782878B (en) 2019-10-10 2019-10-10 Attention mechanism-based multi-scale audio scene recognition method

Publications (2)

Publication Number Publication Date
CN110782878A CN110782878A (en) 2020-02-11
CN110782878B true CN110782878B (en) 2022-04-05

Family

ID=69385078

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910960088.4A Active CN110782878B (en) 2019-10-10 2019-10-10 Attention mechanism-based multi-scale audio scene recognition method

Country Status (1)

Country Link
CN (1) CN110782878B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111309965B (en) * 2020-03-20 2024-02-13 腾讯科技(深圳)有限公司 Audio matching method, device, computer equipment and storage medium
CN111477250B (en) * 2020-04-07 2023-11-28 北京达佳互联信息技术有限公司 Audio scene recognition method, training method and device for audio scene recognition model
CN111816205B (en) * 2020-07-09 2023-06-20 中国人民解放军战略支援部队航天工程大学 Airplane audio-based intelligent recognition method for airplane models
CN112036467B (en) * 2020-08-27 2024-01-12 北京鹰瞳科技发展股份有限公司 Abnormal heart sound identification method and device based on multi-scale attention neural network
CN112633175A (en) * 2020-12-24 2021-04-09 哈尔滨理工大学 Single note real-time recognition algorithm based on multi-scale convolution neural network under complex environment
CN112885372B (en) * 2021-01-15 2022-08-09 国网山东省电力公司威海供电公司 Intelligent diagnosis method, system, terminal and medium for power equipment fault sound
CN112562741B (en) * 2021-02-20 2021-05-04 金陵科技学院 Singing voice detection method based on dot product self-attention convolution neural network
CN112700794B (en) * 2021-03-23 2021-06-22 北京达佳互联信息技术有限公司 Audio scene classification method and device, electronic equipment and storage medium
CN114245280B (en) * 2021-12-20 2023-06-23 清华大学深圳国际研究生院 Scene self-adaptive hearing aid audio enhancement system based on neural network
CN116825131A (en) * 2022-06-24 2023-09-29 南方电网调峰调频发电有限公司储能科研院 Power plant equipment state auditory monitoring method integrating frequency band self-downward attention mechanism
CN116030800A (en) * 2023-03-30 2023-04-28 南昌航天广信科技有限责任公司 Audio classification recognition method, system, computer and readable storage medium
CN116543795B (en) * 2023-06-29 2023-08-29 天津大学 Sound scene classification method based on multi-mode feature fusion

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108010514A (en) * 2017-11-20 2018-05-08 四川大学 A kind of method of speech classification based on deep neural network
CN109637522A (en) * 2018-12-26 2019-04-16 杭州电子科技大学 A kind of speech-emotion recognition method extracting deep space attention characteristics based on sound spectrograph
CN110085218A (en) * 2019-03-26 2019-08-02 天津大学 A kind of audio scene recognition method based on feature pyramid network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10679085B2 (en) * 2017-10-31 2020-06-09 University Of Florida Research Foundation, Incorporated Apparatus and method for detecting scene text in an image

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108010514A (en) * 2017-11-20 2018-05-08 四川大学 A kind of method of speech classification based on deep neural network
CN109637522A (en) * 2018-12-26 2019-04-16 杭州电子科技大学 A kind of speech-emotion recognition method extracting deep space attention characteristics based on sound spectrograph
CN110085218A (en) * 2019-03-26 2019-08-02 天津大学 A kind of audio scene recognition method based on feature pyramid network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Acoustic scene classification using convolutional neural networks and multi-scale multi-feature extraction;An Dang 等;《2018 IEEE International Conference on Consumer Electronics (ICCE)》;20180329;第1-4页 *
Feature Learning With Matrix Factorization Applied to Acoustic Scene Classification;Victor Bisot 等;《IEEE/ACM Transactions on Audio, Speech, and Language Processing》;20170523;第1216-1229页 *
基于多尺度特征融合的图像语义分割;华静 等;《中国计量大学学报》;20190930;第323-330页 *

Also Published As

Publication number Publication date
CN110782878A (en) 2020-02-11

Similar Documents

Publication Publication Date Title
CN110782878B (en) Attention mechanism-based multi-scale audio scene recognition method
CN107393542B (en) Bird species identification method based on two-channel neural network
Cummins et al. An image-based deep spectrum feature representation for the recognition of emotional speech
CN110085218A (en) A kind of audio scene recognition method based on feature pyramid network
CN110084202B (en) Video behavior identification method based on efficient three-dimensional convolution
CN110718235B (en) Abnormal sound detection method, electronic device and storage medium
CN110796027B (en) Sound scene recognition method based on neural network model of tight convolution
CN110428842A (en) Speech model training method, device, equipment and computer readable storage medium
CN108922513A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN111341303B (en) Training method and device of acoustic model, and voice recognition method and device
CN110600054B (en) Sound scene classification method based on network model fusion
CN109767776B (en) Deception voice detection method based on dense neural network
CN110148408A (en) A kind of Chinese speech recognition method based on depth residual error
CN109036460A (en) Method of speech processing and device based on multi-model neural network
CN110197665A (en) A kind of speech Separation and tracking for police criminal detection monitoring
CN110348357A (en) A kind of fast target detection method based on depth convolutional neural networks
CN103985381A (en) Voice frequency indexing method based on parameter fusion optimized decision
Wang et al. Cam++: A fast and efficient network for speaker verification using context-aware masking
CN117095694B (en) Bird song recognition method based on tag hierarchical structure attribute relationship
CN106327555A (en) Method and device for obtaining lip animation
CN114863937A (en) Hybrid birdsong identification method based on deep migration learning and XGboost
CN109036470A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN114898772A (en) Method for classifying acoustic scenes based on feature layering and improved ECAPA-TDNN
CN112735466B (en) Audio detection method and device
CN114267372A (en) Voice noise reduction method, system, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant