CN110796027B - Sound scene recognition method based on neural network model of tight convolution - Google Patents

Sound scene recognition method based on neural network model of tight convolution Download PDF

Info

Publication number
CN110796027B
CN110796027B CN201910960583.5A CN201910960583A CN110796027B CN 110796027 B CN110796027 B CN 110796027B CN 201910960583 A CN201910960583 A CN 201910960583A CN 110796027 B CN110796027 B CN 110796027B
Authority
CN
China
Prior art keywords
convolution
tight
network model
neural network
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910960583.5A
Other languages
Chinese (zh)
Other versions
CN110796027A (en
Inventor
张涛
冯国庆
梁晋华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201910960583.5A priority Critical patent/CN110796027B/en
Publication of CN110796027A publication Critical patent/CN110796027A/en
Application granted granted Critical
Publication of CN110796027B publication Critical patent/CN110796027B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2218/00Aspects of pattern recognition specially adapted for signal processing
    • G06F2218/12Classification; Matching
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Complex Calculations (AREA)

Abstract

A sound scene recognition method based on a neural network model of tight convolution comprises the following steps: establishing a neural network model for the tight convolution of sound scene classification; inputting training sets containing audio files of different scene categories and corresponding scene categories into a tightly-convolved neural network model for sound scene classification, and training the tightly-convolved neural network model for sound scene classification; reading an audio file and preprocessing to obtain an audio signal fragment; extracting a logarithmic mel-graph from said audio signal segment; inputting the logarithmic mel diagram into a trained neural network model for the tight convolution of the sound scene classification to obtain the final sound scene classification. The invention not only ensures that the effective characteristics are fully utilized to ensure that the accuracy is unchanged, but also simplifies the network model to reduce the memory consumption, is more efficient in the voice scene recognition, and better meets the performance requirement of the voice scene recognition equipment.

Description

Sound scene recognition method based on neural network model of tight convolution
Technical Field
The invention relates to a sound scene recognition method. In particular to a sound scene recognition method based on a neural network model of tight convolution.
Background
The sound scene recognition is a technology for processing collected sound signals and then making scene judgment, and is widely applied to aspects such as intelligent home, safety monitoring, audio retrieval and the like. In recent years, with the advent of various deep neural network frameworks, convolutional neural networks have been increasingly used in voice scene recognition. Most of application occasions of voice scene recognition require that a scene recognition function is realized in mobile equipment, but the neural network model is quite large, the performance requirement on hardware equipment is quite high, and the computing resource of the mobile equipment is difficult to meet the requirement of the large-scale neural network model. How to reduce the model size and increase the operation rate while maintaining better performance is a research focus. The method reduces unnecessary parameter calculation, and has important significance for designing a lightweight network capable of being applied to scene recognition in mobile equipment and adapting to real-time of scene recognition in production and life.
According to the distribution characteristics of information in space and channels, different methods for reducing information redundancy are proposed. Space pooling is a method capable of effectively reducing space information redundancy, and is commonly used in a mode of maximum pooling and average pooling, so that unnecessary characteristic information can be reduced, and a receptive field can be enlarged. The method for reducing the redundancy of information on the channels is commonly used for pruning the channels, namely, some channels are randomly discarded without considering the information carried by the feature map, and the method is simple and easy to operate, but the randomly discarded channels inevitably carry some important information, so that the accuracy of the neural network model is reduced.
The limitation of computing resources is always an important obstacle for the development of a neural network, and in an AlexNet network framework proposed by face++ in 2012, in order to solve the defect of the memory of a display card, the concept of packet convolution is proposed, and the network is divided into two parts which are respectively operated in two display cards at the same time. In the ShuffleNet paper published in 2017, shuffle grouping convolution is proposed based on the concept of grouping convolution, and grouping sub-groups after grouping are randomly disturbed and then regrouped, so that the parameter amount can be reduced through grouping convolution, and the flow of information between channels can be increased after the grouping is disturbed. The lightweight network framework MobileNet v1 published in 2017 by google has separable convolutions applied to it, and conventional convolutions are separated into deep convolutions and 1*1 convolutions, so that the parameter amount is greatly reduced.
Disclosure of Invention
The invention aims to solve the technical problem of providing a sound scene recognition method based on a neural network model with compact convolution, which can reduce the complexity of the model while maintaining better performance.
The technical scheme adopted by the invention is as follows: a sound scene recognition method based on a neural network model of tight convolution comprises the following steps:
1) Establishing a neural network model for the tight convolution of sound scene classification;
2) Inputting training sets containing audio files of different scene categories and corresponding scene categories into a tightly-convolved neural network model for sound scene classification, and training the tightly-convolved neural network model for sound scene classification;
3) Reading an audio file and preprocessing to obtain an audio signal fragment;
4) Extracting a logarithmic mel-graph from said audio signal segment;
5) Inputting the logarithmic mel diagram into a trained neural network model for the tight convolution of the sound scene classification to obtain the final sound scene classification.
The tightly-convoluted neural network model in the step 1) comprises the following components in series:
the first feature extraction module is used for carrying out feature extraction once on the received logarithmic mel diagram by adopting different convolution kernels and carrying out nonlinear transformation to obtain a first nonlinear feature diagram under the different convolution kernels;
the second feature extraction module is used for extracting secondary features of the first nonlinear feature map by adopting different convolution kernels and performing secondary nonlinear transformation to obtain a second nonlinear feature map under different convolution kernels;
the tight convolution module is formed by sequentially connecting n tight convolution units in series and is used for sequentially extracting depth features of the second nonlinear feature map by adopting different convolution kernels;
and the Softmax layer is used for carrying out weighted judgment on the finally extracted depth feature images under different convolution kernels and outputting a sound scene recognition result.
The first feature extraction module and the second feature extraction module have the same structure and both comprise: and the ReLU activation function layer is used for carrying out characteristic extraction on the received logarithmic Meier diagram by adopting different convolution kernels and carrying out nonlinear transformation on the extracted characteristic diagram under the different convolution kernels.
The n tightly-convolved units have the same structure and comprise tightly-convolved layers for extracting depth features of the received feature images by adopting different convolution kernels and a ReLU activation function layer for carrying out nonlinear transformation on the feature images under the different convolution kernels extracted by the tightly-convolved layers.
The tight convolution layer comprises the following components in series: the device comprises a depth convolution layer for extracting the feature images by using different convolution kernels, a channel compression layer for reducing the feature images under the different convolution kernels extracted by the depth convolution layer, and a 1X 1 convolution layer for carrying out 1X 1 convolution on the reduced feature images.
The channel compression layer comprises:
(1) Grouping received feature images under different convolution kernels extracted by a deep convolution layer, wherein each group has more than 2 feature images;
(2) Comparing or averaging the parameters of the same position of all the feature images of each group, and taking the maximum value or the obtained average value in the comparison result as the parameter of the same position of the new feature image, thereby obtaining the new feature image of the group;
(3) And outputting each new characteristic diagram.
The preprocessing in the step 3) is to cut the input signal into signal segments with fixed duration of 10 s.
Step 4) comprises:
(1) Framing and windowing an input audio signal segment;
(2) The obtained audio frames pass through a Mel filter group, the energy passing through each Mel filter in each time step range in the audio frames is calculated, all the energy passing through the Mel filters obtained in each time step range are formed into energy vectors, the energy vectors in all the time step ranges are combined, and a two-dimensional Mel diagram of the corresponding audio frames is obtained;
(3) And carrying out logarithmic processing on the two-dimensional Mel diagram to obtain a logarithmic Mel diagram.
Compared with the traditional convolutional neural network model, the voice scene recognition method based on the neural network model has the advantage that the parameter and floating point operation amount are greatly reduced under the condition of keeping the same accuracy. The lightweight voice scene recognition network has the advantages of small calculation amount and memory consumption, low power consumption and reduced requirement on computing resources, so that the lightweight voice scene recognition network has stronger practicability and can be better applied to mobile equipment. The invention can effectively reduce redundant characteristic information, avoid processing unnecessary characteristic information in the neural network, ensure that the effective characteristics are fully utilized to ensure that the accuracy is unchanged, simplify a network model to reduce the memory consumption, and realize more efficient and better meeting the performance requirements of the voice scene recognition equipment in voice scene recognition.
Drawings
FIG. 1 is a schematic diagram of the construction of a tightly-convoluted neural network model in a sound scene recognition method based on the tightly-convoluted neural network model of the present invention;
FIG. 2 is a schematic diagram of the construction of a tightly-convolved cell in a tightly-convolved neural network model;
fig. 3 is a schematic diagram of the composition of a channel compression layer in a tight convolution unit.
Detailed Description
A sound scene recognition method based on a neural network model of the present invention will be described in detail with reference to the embodiments and the accompanying drawings.
The invention discloses a sound scene recognition method based on a neural network model of tight convolution, which comprises the following steps:
1) Establishing a neural network model for the tight convolution of sound scene classification; as shown in fig. 1, the neural network model of the tight convolution includes the following components in series:
the first feature extraction module 1 is used for carrying out feature extraction once on the received logarithmic mel diagram by adopting different convolution kernels and carrying out nonlinear transformation to obtain a first nonlinear feature diagram under the different convolution kernels;
the second feature extraction module 2 is used for extracting secondary features of the first nonlinear feature map by adopting different convolution kernels and performing secondary nonlinear transformation to obtain a second nonlinear feature map under different convolution kernels;
the tight convolution module 3 is formed by sequentially connecting n tight convolution units 3.1, 3.2 and 3.n in series, and is used for sequentially extracting depth features of the second nonlinear feature map by adopting different convolution kernels;
and the Softmax layer 4 is used for carrying out weighted judgment on the depth feature images under the different convolution kernels extracted finally and outputting a sound scene recognition result. Wherein, the liquid crystal display device comprises a liquid crystal display device,
the first feature extraction module 1 and the second feature extraction module 2 have the same structure and both comprise: and the ReLU activation function layer is used for carrying out characteristic extraction on the received logarithmic Meier diagram by adopting different convolution kernels and carrying out nonlinear transformation on the extracted characteristic diagram under the different convolution kernels.
As shown in fig. 2, the n tightly-convolved units 3.1, 3.2, and 3.N have the same structure, and each tightly-convolved unit includes a tightly-convolved layer for extracting depth features from a received feature map by using different convolution kernels, and a ReLU activation function layer for performing nonlinear transformation on the feature map extracted by the tightly-convolved layer under the different convolution kernels. The tight convolution layer comprises the following components in series: a depth convolution layer 3.10 for feature map extraction with different convolution kernels, a channel compression layer 3.11 for downscaling the feature map under the different convolution kernels extracted by the depth convolution layer 3.10, and a 1 x 1 convolution layer 3.12 for 1 x 1 convolution of the downscaled feature map.
The channel compression layer 3.11 is used for reducing information redundancy among channels and reducing parameter calculation amount, different feature graphs on the channels can be fused, parameters which can more express information in two or more feature graphs are left, and parameters with small relative information amount are discarded, so that the information redundancy can be reduced, the parameter amount of the next convolution calculation is reduced, and the integrity of the features can be ensured. In order to further reduce the calculation amount, separable convolution is used for replacing common convolution, and the separable convolution is combined with channel compression, so that the complexity of a model can be well reduced under the condition that the accuracy of the model is not reduced. As shown in fig. 3, the channel compression layer 3.11 specifically includes:
(1) Grouping received characteristic diagrams under different convolution kernels extracted by a depth convolution layer (3.10), wherein each group has more than 2 characteristic diagrams;
(2) Comparing or averaging the parameters of the same position of all the feature images of each group, and taking the maximum value or the obtained average value in the comparison result as the parameter of the same position of the new feature image, thereby obtaining the new feature image of the group;
(3) And outputting each new characteristic diagram.
2) Inputting training sets containing audio files of different scene categories and corresponding scene categories into a tightly-convolved neural network model for sound scene classification, and training the tightly-convolved neural network model for sound scene classification;
3) Reading an audio file and preprocessing to obtain an audio signal fragment; the pretreatment is to cut the input signal into signal segments with fixed duration of 10 s.
4) Extracting a logarithmic mel-graph from said audio signal segment; comprising the following steps:
(1) Since the speech signal is a typical non-stationary signal, but the movement of the sound generating organ is very slow compared to the speed of sound wave vibration, it is generally considered that the speech signal is stationary in a period of 10 ms-30 ms, so that the input audio signal segments are framed and windowed;
(2) The obtained audio frames pass through a Mel filter group, the energy passing through each Mel filter in each time step range in the audio frames is calculated, all the energy passing through the Mel filters obtained in each time step range are formed into energy vectors, the energy vectors in all the time step ranges are combined, and a two-dimensional Mel diagram of the corresponding audio frames is obtained;
the energy passing through each Mel filter in each time step range in the audio frame is calculated by adopting the following formula:
where M is the number of Mel filters, H (k) is the transfer function of the Mel filters, and X (k) is the magnitude value of the corresponding FFT.
(3) And carrying out logarithmic processing on the two-dimensional Mel diagram to obtain a logarithmic Mel diagram.
5) Inputting the logarithmic mel diagram into a trained neural network model for the tight convolution of the sound scene classification to obtain the final sound scene classification.
To verify the effectiveness of a sound scene recognition method based on a tightly-convoluted neural network model of the present invention, this section will compare the effects of two network frameworks of MobileNet v1 and CNN8 with or without tightly-convoluted. The data set adopts DCASE2019, an Adam optimizer is used in the training process, and adaptive learning rate attenuation is set, and 32 samples are trained each time. The prediction process only involves forward propagation, and the specific measures are as follows:
1. reading an audio signal and performing truncation processing, wherein each section is truncated into a voice fragment with a fixed duration of 10 s;
2. carrying out frame-division windowing processing on a voice signal with fixed duration, wherein 2048 sampling points are added to each frame, and 2048 Hamming windows are added to each frame;
3. extracting features of the signals after framing through a Mel filter bank, taking logarithms, wherein the number of the filters is 134, the window length of the filters is 1704 points, and 852 points are overlapped between frames;
4. and inputting the Mel spectrogram into a tightly convolved neural network model, and performing forward propagation to obtain the final sound scene category.
For the MobileNet v1 network, except for the first layer convolution which keeps the original structure, all convolutions used by the other layers are replaced by tight convolutions, wherein the tight coefficient is set to be 4, and an average value taking method is adopted in the channel compression operation. And the full connection layer of the original MobileNet v1 is removed, so that the number of parameters can be further reduced.
For the CNN8 network, which is actually a simplified version of the VGG16 network, in order to retain more original characteristics, the first layer and the second layer of network still adopt common convolution, and the common convolution operation is replaced by the tight convolution in the rest convolution layers, wherein the tight coefficient is set to be 2, and a maximum value taking method is adopted in the channel compression operation.
Table 1 shows a comparison of six neural network models, under the condition that the width scaling factor is 1.0 at the same time in the MobileNet v1 network framework, the parameter quantity and floating point operation quantity of the MobileNet v1 network based on the tight convolution are about 1/4 of that of the common convolution network, and the accuracy is improved. Although there are some more parameter and floating point operations than the MobileNet v1 at a width scale factor of 0.5, the accuracy of MobileNet v 1.5 is two percent lower. This illustrates that the tight convolution processes the characteristic information more properly during channel compression, discarding redundant information, and retaining more valuable information, thus making the model more efficient. Compared with a CNN8 network based on the tight convolution and a CNN8 network based on the common convolution, under the condition of keeping the accuracy basically unchanged, the parameter quantity and the floating point operation quantity are reduced to about 1/14, and the effect of model compression is more obvious.
Table 1 comparison of various audio scene recognition algorithms

Claims (1)

1. A sound scene recognition method based on a neural network model of tight convolution is characterized by comprising the following steps:
1) Establishing a neural network model for the tight convolution of sound scene classification; the tightly-convoluted neural network model comprises the following components in series:
the first feature extraction module (1) is used for carrying out feature extraction once on the received logarithmic mel diagram by adopting different convolution kernels and carrying out nonlinear transformation to obtain a first nonlinear feature diagram under the different convolution kernels;
the second feature extraction module (2) is used for carrying out secondary feature extraction on the first nonlinear feature map by adopting different convolution kernels and carrying out secondary nonlinear transformation to obtain a second nonlinear feature map under different convolution kernels;
the tight convolution module (3) is formed by sequentially connecting n tight convolution units (3.1, 3.2 and 3.n) in series and is used for sequentially extracting depth features of the second nonlinear feature map by adopting different convolution kernels;
the Softmax layer (4) is used for carrying out weighted judgment on the finally extracted depth feature images under different convolution kernels and outputting a sound scene recognition result;
the first feature extraction module (1) and the second feature extraction module (2) have the same structure and both comprise: the ReLU activation function layer is used for carrying out nonlinear transformation on the feature graphs under the extracted different convolution kernels;
the n tight convolution units (3.1, 3.2, 3. N) have the same structure and comprise a tight convolution layer for extracting depth features of the received feature images by adopting different convolution kernels and a ReLU activation function layer for performing nonlinear transformation on the feature images extracted by the tight convolution layer under the different convolution kernels;
the tight convolution layer comprises the following components in series: a depth convolution layer (3.10) for extracting feature images using different convolution kernels, a channel compression layer (3.11) for reducing feature images under the different convolution kernels extracted by the depth convolution layer (3.10), and a 1 x 1 convolution layer (3.12) for performing 1 x 1 convolution on the reduced feature images;
the channel compression layer (3.11) comprises:
(1) Grouping received characteristic diagrams under different convolution kernels extracted by a depth convolution layer (3.10), wherein each group has more than 2 characteristic diagrams;
(2) Comparing or averaging the parameters of the same position of all the feature images of each group, and taking the maximum value or the obtained average value in the comparison result as the parameter of the same position of the new feature image, thereby obtaining the new feature image of the group;
(3) Outputting each group of new feature graphs;
2) Inputting training sets containing audio files of different scene categories and corresponding scene categories into a tightly-convolved neural network model for sound scene classification, and training the tightly-convolved neural network model for sound scene classification;
3) Reading an audio file and preprocessing to obtain an audio signal fragment;
the pretreatment is to cut the input signal into signal segments with fixed duration of 10 s;
4) Extracting a logarithmic mel-graph from said audio signal segment; comprising the following steps:
(1) Framing and windowing an input audio signal segment;
(2) The obtained audio frames pass through a Mel filter group, the energy passing through each Mel filter in each time step range in the audio frames is calculated, all the energy passing through the Mel filters obtained in each time step range are formed into energy vectors, the energy vectors in all the time step ranges are combined, and a two-dimensional Mel diagram of the corresponding audio frames is obtained;
(3) Carrying out logarithmic processing on the two-dimensional Mel diagram to obtain a logarithmic Mel diagram;
5) Inputting the logarithmic mel diagram into a trained neural network model for the tight convolution of the sound scene classification to obtain the final sound scene classification.
CN201910960583.5A 2019-10-10 2019-10-10 Sound scene recognition method based on neural network model of tight convolution Active CN110796027B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910960583.5A CN110796027B (en) 2019-10-10 2019-10-10 Sound scene recognition method based on neural network model of tight convolution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910960583.5A CN110796027B (en) 2019-10-10 2019-10-10 Sound scene recognition method based on neural network model of tight convolution

Publications (2)

Publication Number Publication Date
CN110796027A CN110796027A (en) 2020-02-14
CN110796027B true CN110796027B (en) 2023-10-17

Family

ID=69438941

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910960583.5A Active CN110796027B (en) 2019-10-10 2019-10-10 Sound scene recognition method based on neural network model of tight convolution

Country Status (1)

Country Link
CN (1) CN110796027B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111461172B (en) * 2020-03-04 2023-05-30 哈尔滨工业大学 Lightweight characteristic fusion method of hyperspectral remote sensing data based on two-dimensional point group convolution
CN112016639B (en) * 2020-11-02 2021-01-26 四川大学 Flexible separable convolution framework and feature extraction method and application thereof in VGG and ResNet
CN113539283A (en) * 2020-12-03 2021-10-22 腾讯科技(深圳)有限公司 Audio processing method and device based on artificial intelligence, electronic equipment and storage medium
CN112634928B (en) * 2020-12-08 2023-09-29 北京有竹居网络技术有限公司 Sound signal processing method and device and electronic equipment
CN112786057B (en) * 2021-02-23 2023-06-02 厦门熵基科技有限公司 Voiceprint recognition method and device, electronic equipment and storage medium
CN113281660A (en) * 2021-05-21 2021-08-20 张家港清研检测技术有限公司 Method for detecting unqualified battery cell in retired power battery pack
CN113793622B (en) * 2021-09-10 2023-08-29 中国科学院声学研究所 Audio scene recognition method, system and device

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107895192A (en) * 2017-12-06 2018-04-10 广州华多网络科技有限公司 Depth convolutional network compression method, storage medium and terminal
CN108231067A (en) * 2018-01-13 2018-06-29 福州大学 Sound scenery recognition methods based on convolutional neural networks and random forest classification
CN108520757A (en) * 2018-03-31 2018-09-11 华南理工大学 Music based on auditory properties is applicable in scene automatic classification method
CN109448702A (en) * 2018-10-30 2019-03-08 上海力声特医学科技有限公司 Artificial cochlea's auditory scene recognition methods
CN109978137A (en) * 2019-03-20 2019-07-05 厦门美图之家科技有限公司 A kind of processing method of convolutional neural networks
CN110085218A (en) * 2019-03-26 2019-08-02 天津大学 A kind of audio scene recognition method based on feature pyramid network
CN110188863A (en) * 2019-04-30 2019-08-30 杭州电子科技大学 A kind of convolution kernel and its compression algorithm of convolutional neural networks
CN110223715A (en) * 2019-05-07 2019-09-10 华南理工大学 It is a kind of based on sound event detection old solitary people man in activity estimation method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190066657A1 (en) * 2017-08-31 2019-02-28 National Institute Of Information And Communications Technology Audio data learning method, audio data inference method and recording medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107895192A (en) * 2017-12-06 2018-04-10 广州华多网络科技有限公司 Depth convolutional network compression method, storage medium and terminal
CN108231067A (en) * 2018-01-13 2018-06-29 福州大学 Sound scenery recognition methods based on convolutional neural networks and random forest classification
CN108520757A (en) * 2018-03-31 2018-09-11 华南理工大学 Music based on auditory properties is applicable in scene automatic classification method
CN109448702A (en) * 2018-10-30 2019-03-08 上海力声特医学科技有限公司 Artificial cochlea's auditory scene recognition methods
CN109978137A (en) * 2019-03-20 2019-07-05 厦门美图之家科技有限公司 A kind of processing method of convolutional neural networks
CN110085218A (en) * 2019-03-26 2019-08-02 天津大学 A kind of audio scene recognition method based on feature pyramid network
CN110188863A (en) * 2019-04-30 2019-08-30 杭州电子科技大学 A kind of convolution kernel and its compression algorithm of convolutional neural networks
CN110223715A (en) * 2019-05-07 2019-09-10 华南理工大学 It is a kind of based on sound event detection old solitary people man in activity estimation method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
卷积神经网络在异常声音识别中的研究;胡涛等;《信号处理》;357-367 *

Also Published As

Publication number Publication date
CN110796027A (en) 2020-02-14

Similar Documents

Publication Publication Date Title
CN110796027B (en) Sound scene recognition method based on neural network model of tight convolution
CN110782878B (en) Attention mechanism-based multi-scale audio scene recognition method
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
CN109841226B (en) Single-channel real-time noise reduction method based on convolution recurrent neural network
CN111933188B (en) Sound event detection method based on convolutional neural network
CN110808033B (en) Audio classification method based on dual data enhancement strategy
CN110033756B (en) Language identification method and device, electronic equipment and storage medium
CN109448719A (en) Establishment of Neural Model method and voice awakening method, device, medium and equipment
CN111508524B (en) Method and system for identifying voice source equipment
CN109559755A (en) A kind of sound enhancement method based on DNN noise classification
CN111599376A (en) Sound event detection method based on cavity convolution cyclic neural network
CN112183582A (en) Multi-feature fusion underwater target identification method
CN112735466B (en) Audio detection method and device
CN111862978A (en) Voice awakening method and system based on improved MFCC (Mel frequency cepstrum coefficient)
Yu Research on music emotion classification based on CNN-LSTM network
CN115035887A (en) Voice signal processing method, device, equipment and medium
CN113948107A (en) Engine fault diagnosis method based on end-to-end CNN fault diagnosis model
CN113488069A (en) Method and device for quickly extracting high-dimensional voice features based on generative countermeasure network
CN117524252B (en) Light-weight acoustic scene perception method based on drunken model
Zhang et al. Filamentary Convolution for Spoken Language Identification: A Brain-Inspired Approach
CN113505266B (en) Two-stage anchor-based dynamic video abstraction method
Cai et al. A Contrastive Semi-Supervised Learning Framework For Anomaly Sound Detection.
CN114464201A (en) Single-channel speech enhancement method based on attention mechanism and convolutional neural network
Cao et al. LightCAM: A Fast and Light Implementation of Context-Aware Masking based D-Tdnn for Speaker Verification
CN114997210A (en) Machine abnormal sound identification and detection method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant