CN110796027B - Sound scene recognition method based on neural network model of tight convolution - Google Patents
Sound scene recognition method based on neural network model of tight convolution Download PDFInfo
- Publication number
- CN110796027B CN110796027B CN201910960583.5A CN201910960583A CN110796027B CN 110796027 B CN110796027 B CN 110796027B CN 201910960583 A CN201910960583 A CN 201910960583A CN 110796027 B CN110796027 B CN 110796027B
- Authority
- CN
- China
- Prior art keywords
- convolution
- tight
- network model
- neural network
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000003062 neural network model Methods 0.000 title claims abstract description 38
- 238000000034 method Methods 0.000 title claims abstract description 23
- 238000010586 diagram Methods 0.000 claims abstract description 32
- 230000005236 sound signal Effects 0.000 claims abstract description 13
- 238000012549 training Methods 0.000 claims abstract description 9
- 239000012634 fragment Substances 0.000 claims abstract description 5
- 238000007781 pre-processing Methods 0.000 claims abstract description 5
- 238000000605 extraction Methods 0.000 claims description 19
- 230000006835 compression Effects 0.000 claims description 13
- 238000007906 compression Methods 0.000 claims description 13
- 230000009466 transformation Effects 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 7
- 230000004913 activation Effects 0.000 claims description 6
- 239000013598 vector Substances 0.000 claims description 6
- 238000012935 Averaging Methods 0.000 claims description 3
- 238000009432 framing Methods 0.000 claims description 3
- 230000006870 function Effects 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 5
- 238000007667 floating Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000011176 pooling Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 238000013138 pruning Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2218/00—Aspects of pattern recognition specially adapted for signal processing
- G06F2218/12—Classification; Matching
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Complex Calculations (AREA)
Abstract
A sound scene recognition method based on a neural network model of tight convolution comprises the following steps: establishing a neural network model for the tight convolution of sound scene classification; inputting training sets containing audio files of different scene categories and corresponding scene categories into a tightly-convolved neural network model for sound scene classification, and training the tightly-convolved neural network model for sound scene classification; reading an audio file and preprocessing to obtain an audio signal fragment; extracting a logarithmic mel-graph from said audio signal segment; inputting the logarithmic mel diagram into a trained neural network model for the tight convolution of the sound scene classification to obtain the final sound scene classification. The invention not only ensures that the effective characteristics are fully utilized to ensure that the accuracy is unchanged, but also simplifies the network model to reduce the memory consumption, is more efficient in the voice scene recognition, and better meets the performance requirement of the voice scene recognition equipment.
Description
Technical Field
The invention relates to a sound scene recognition method. In particular to a sound scene recognition method based on a neural network model of tight convolution.
Background
The sound scene recognition is a technology for processing collected sound signals and then making scene judgment, and is widely applied to aspects such as intelligent home, safety monitoring, audio retrieval and the like. In recent years, with the advent of various deep neural network frameworks, convolutional neural networks have been increasingly used in voice scene recognition. Most of application occasions of voice scene recognition require that a scene recognition function is realized in mobile equipment, but the neural network model is quite large, the performance requirement on hardware equipment is quite high, and the computing resource of the mobile equipment is difficult to meet the requirement of the large-scale neural network model. How to reduce the model size and increase the operation rate while maintaining better performance is a research focus. The method reduces unnecessary parameter calculation, and has important significance for designing a lightweight network capable of being applied to scene recognition in mobile equipment and adapting to real-time of scene recognition in production and life.
According to the distribution characteristics of information in space and channels, different methods for reducing information redundancy are proposed. Space pooling is a method capable of effectively reducing space information redundancy, and is commonly used in a mode of maximum pooling and average pooling, so that unnecessary characteristic information can be reduced, and a receptive field can be enlarged. The method for reducing the redundancy of information on the channels is commonly used for pruning the channels, namely, some channels are randomly discarded without considering the information carried by the feature map, and the method is simple and easy to operate, but the randomly discarded channels inevitably carry some important information, so that the accuracy of the neural network model is reduced.
The limitation of computing resources is always an important obstacle for the development of a neural network, and in an AlexNet network framework proposed by face++ in 2012, in order to solve the defect of the memory of a display card, the concept of packet convolution is proposed, and the network is divided into two parts which are respectively operated in two display cards at the same time. In the ShuffleNet paper published in 2017, shuffle grouping convolution is proposed based on the concept of grouping convolution, and grouping sub-groups after grouping are randomly disturbed and then regrouped, so that the parameter amount can be reduced through grouping convolution, and the flow of information between channels can be increased after the grouping is disturbed. The lightweight network framework MobileNet v1 published in 2017 by google has separable convolutions applied to it, and conventional convolutions are separated into deep convolutions and 1*1 convolutions, so that the parameter amount is greatly reduced.
Disclosure of Invention
The invention aims to solve the technical problem of providing a sound scene recognition method based on a neural network model with compact convolution, which can reduce the complexity of the model while maintaining better performance.
The technical scheme adopted by the invention is as follows: a sound scene recognition method based on a neural network model of tight convolution comprises the following steps:
1) Establishing a neural network model for the tight convolution of sound scene classification;
2) Inputting training sets containing audio files of different scene categories and corresponding scene categories into a tightly-convolved neural network model for sound scene classification, and training the tightly-convolved neural network model for sound scene classification;
3) Reading an audio file and preprocessing to obtain an audio signal fragment;
4) Extracting a logarithmic mel-graph from said audio signal segment;
5) Inputting the logarithmic mel diagram into a trained neural network model for the tight convolution of the sound scene classification to obtain the final sound scene classification.
The tightly-convoluted neural network model in the step 1) comprises the following components in series:
the first feature extraction module is used for carrying out feature extraction once on the received logarithmic mel diagram by adopting different convolution kernels and carrying out nonlinear transformation to obtain a first nonlinear feature diagram under the different convolution kernels;
the second feature extraction module is used for extracting secondary features of the first nonlinear feature map by adopting different convolution kernels and performing secondary nonlinear transformation to obtain a second nonlinear feature map under different convolution kernels;
the tight convolution module is formed by sequentially connecting n tight convolution units in series and is used for sequentially extracting depth features of the second nonlinear feature map by adopting different convolution kernels;
and the Softmax layer is used for carrying out weighted judgment on the finally extracted depth feature images under different convolution kernels and outputting a sound scene recognition result.
The first feature extraction module and the second feature extraction module have the same structure and both comprise: and the ReLU activation function layer is used for carrying out characteristic extraction on the received logarithmic Meier diagram by adopting different convolution kernels and carrying out nonlinear transformation on the extracted characteristic diagram under the different convolution kernels.
The n tightly-convolved units have the same structure and comprise tightly-convolved layers for extracting depth features of the received feature images by adopting different convolution kernels and a ReLU activation function layer for carrying out nonlinear transformation on the feature images under the different convolution kernels extracted by the tightly-convolved layers.
The tight convolution layer comprises the following components in series: the device comprises a depth convolution layer for extracting the feature images by using different convolution kernels, a channel compression layer for reducing the feature images under the different convolution kernels extracted by the depth convolution layer, and a 1X 1 convolution layer for carrying out 1X 1 convolution on the reduced feature images.
The channel compression layer comprises:
(1) Grouping received feature images under different convolution kernels extracted by a deep convolution layer, wherein each group has more than 2 feature images;
(2) Comparing or averaging the parameters of the same position of all the feature images of each group, and taking the maximum value or the obtained average value in the comparison result as the parameter of the same position of the new feature image, thereby obtaining the new feature image of the group;
(3) And outputting each new characteristic diagram.
The preprocessing in the step 3) is to cut the input signal into signal segments with fixed duration of 10 s.
Step 4) comprises:
(1) Framing and windowing an input audio signal segment;
(2) The obtained audio frames pass through a Mel filter group, the energy passing through each Mel filter in each time step range in the audio frames is calculated, all the energy passing through the Mel filters obtained in each time step range are formed into energy vectors, the energy vectors in all the time step ranges are combined, and a two-dimensional Mel diagram of the corresponding audio frames is obtained;
(3) And carrying out logarithmic processing on the two-dimensional Mel diagram to obtain a logarithmic Mel diagram.
Compared with the traditional convolutional neural network model, the voice scene recognition method based on the neural network model has the advantage that the parameter and floating point operation amount are greatly reduced under the condition of keeping the same accuracy. The lightweight voice scene recognition network has the advantages of small calculation amount and memory consumption, low power consumption and reduced requirement on computing resources, so that the lightweight voice scene recognition network has stronger practicability and can be better applied to mobile equipment. The invention can effectively reduce redundant characteristic information, avoid processing unnecessary characteristic information in the neural network, ensure that the effective characteristics are fully utilized to ensure that the accuracy is unchanged, simplify a network model to reduce the memory consumption, and realize more efficient and better meeting the performance requirements of the voice scene recognition equipment in voice scene recognition.
Drawings
FIG. 1 is a schematic diagram of the construction of a tightly-convoluted neural network model in a sound scene recognition method based on the tightly-convoluted neural network model of the present invention;
FIG. 2 is a schematic diagram of the construction of a tightly-convolved cell in a tightly-convolved neural network model;
fig. 3 is a schematic diagram of the composition of a channel compression layer in a tight convolution unit.
Detailed Description
A sound scene recognition method based on a neural network model of the present invention will be described in detail with reference to the embodiments and the accompanying drawings.
The invention discloses a sound scene recognition method based on a neural network model of tight convolution, which comprises the following steps:
1) Establishing a neural network model for the tight convolution of sound scene classification; as shown in fig. 1, the neural network model of the tight convolution includes the following components in series:
the first feature extraction module 1 is used for carrying out feature extraction once on the received logarithmic mel diagram by adopting different convolution kernels and carrying out nonlinear transformation to obtain a first nonlinear feature diagram under the different convolution kernels;
the second feature extraction module 2 is used for extracting secondary features of the first nonlinear feature map by adopting different convolution kernels and performing secondary nonlinear transformation to obtain a second nonlinear feature map under different convolution kernels;
the tight convolution module 3 is formed by sequentially connecting n tight convolution units 3.1, 3.2 and 3.n in series, and is used for sequentially extracting depth features of the second nonlinear feature map by adopting different convolution kernels;
and the Softmax layer 4 is used for carrying out weighted judgment on the depth feature images under the different convolution kernels extracted finally and outputting a sound scene recognition result. Wherein,,
the first feature extraction module 1 and the second feature extraction module 2 have the same structure and both comprise: and the ReLU activation function layer is used for carrying out characteristic extraction on the received logarithmic Meier diagram by adopting different convolution kernels and carrying out nonlinear transformation on the extracted characteristic diagram under the different convolution kernels.
As shown in fig. 2, the n tightly-convolved units 3.1, 3.2, and 3.N have the same structure, and each tightly-convolved unit includes a tightly-convolved layer for extracting depth features from a received feature map by using different convolution kernels, and a ReLU activation function layer for performing nonlinear transformation on the feature map extracted by the tightly-convolved layer under the different convolution kernels. The tight convolution layer comprises the following components in series: a depth convolution layer 3.10 for feature map extraction with different convolution kernels, a channel compression layer 3.11 for downscaling the feature map under the different convolution kernels extracted by the depth convolution layer 3.10, and a 1 x 1 convolution layer 3.12 for 1 x 1 convolution of the downscaled feature map.
The channel compression layer 3.11 is used for reducing information redundancy among channels and reducing parameter calculation amount, different feature graphs on the channels can be fused, parameters which can more express information in two or more feature graphs are left, and parameters with small relative information amount are discarded, so that the information redundancy can be reduced, the parameter amount of the next convolution calculation is reduced, and the integrity of the features can be ensured. In order to further reduce the calculation amount, separable convolution is used for replacing common convolution, and the separable convolution is combined with channel compression, so that the complexity of a model can be well reduced under the condition that the accuracy of the model is not reduced. As shown in fig. 3, the channel compression layer 3.11 specifically includes:
(1) Grouping received characteristic diagrams under different convolution kernels extracted by a depth convolution layer (3.10), wherein each group has more than 2 characteristic diagrams;
(2) Comparing or averaging the parameters of the same position of all the feature images of each group, and taking the maximum value or the obtained average value in the comparison result as the parameter of the same position of the new feature image, thereby obtaining the new feature image of the group;
(3) And outputting each new characteristic diagram.
2) Inputting training sets containing audio files of different scene categories and corresponding scene categories into a tightly-convolved neural network model for sound scene classification, and training the tightly-convolved neural network model for sound scene classification;
3) Reading an audio file and preprocessing to obtain an audio signal fragment; the pretreatment is to cut the input signal into signal segments with fixed duration of 10 s.
4) Extracting a logarithmic mel-graph from said audio signal segment; comprising the following steps:
(1) Since the speech signal is a typical non-stationary signal, but the movement of the sound generating organ is very slow compared to the speed of sound wave vibration, it is generally considered that the speech signal is stationary in a period of 10 ms-30 ms, so that the input audio signal segments are framed and windowed;
(2) The obtained audio frames pass through a Mel filter group, the energy passing through each Mel filter in each time step range in the audio frames is calculated, all the energy passing through the Mel filters obtained in each time step range are formed into energy vectors, the energy vectors in all the time step ranges are combined, and a two-dimensional Mel diagram of the corresponding audio frames is obtained;
the energy passing through each Mel filter in each time step range in the audio frame is calculated by adopting the following formula:
where M is the number of Mel filters, H (k) is the transfer function of the Mel filters, and X (k) is the magnitude value of the corresponding FFT.
(3) And carrying out logarithmic processing on the two-dimensional Mel diagram to obtain a logarithmic Mel diagram.
5) Inputting the logarithmic mel diagram into a trained neural network model for the tight convolution of the sound scene classification to obtain the final sound scene classification.
To verify the effectiveness of a sound scene recognition method based on a tightly-convoluted neural network model of the present invention, this section will compare the effects of two network frameworks of MobileNet v1 and CNN8 with or without tightly-convoluted. The data set adopts DCASE2019, an Adam optimizer is used in the training process, and adaptive learning rate attenuation is set, and 32 samples are trained each time. The prediction process only involves forward propagation, and the specific measures are as follows:
1. reading an audio signal and performing truncation processing, wherein each section is truncated into a voice fragment with a fixed duration of 10 s;
2. carrying out frame-division windowing processing on a voice signal with fixed duration, wherein 2048 sampling points are added to each frame, and 2048 Hamming windows are added to each frame;
3. extracting features of the signals after framing through a Mel filter bank, taking logarithms, wherein the number of the filters is 134, the window length of the filters is 1704 points, and 852 points are overlapped between frames;
4. and inputting the Mel spectrogram into a tightly convolved neural network model, and performing forward propagation to obtain the final sound scene category.
For the MobileNet v1 network, except for the first layer convolution which keeps the original structure, all convolutions used by the other layers are replaced by tight convolutions, wherein the tight coefficient is set to be 4, and an average value taking method is adopted in the channel compression operation. And the full connection layer of the original MobileNet v1 is removed, so that the number of parameters can be further reduced.
For the CNN8 network, which is actually a simplified version of the VGG16 network, in order to retain more original characteristics, the first layer and the second layer of network still adopt common convolution, and the common convolution operation is replaced by the tight convolution in the rest convolution layers, wherein the tight coefficient is set to be 2, and a maximum value taking method is adopted in the channel compression operation.
Table 1 shows a comparison of six neural network models, under the condition that the width scaling factor is 1.0 at the same time in the MobileNet v1 network framework, the parameter quantity and floating point operation quantity of the MobileNet v1 network based on the tight convolution are about 1/4 of that of the common convolution network, and the accuracy is improved. Although there are some more parameter and floating point operations than the MobileNet v1 at a width scale factor of 0.5, the accuracy of MobileNet v 1.5 is two percent lower. This illustrates that the tight convolution processes the characteristic information more properly during channel compression, discarding redundant information, and retaining more valuable information, thus making the model more efficient. Compared with a CNN8 network based on the tight convolution and a CNN8 network based on the common convolution, under the condition of keeping the accuracy basically unchanged, the parameter quantity and the floating point operation quantity are reduced to about 1/14, and the effect of model compression is more obvious.
Table 1 comparison of various audio scene recognition algorithms
Claims (1)
1. A sound scene recognition method based on a neural network model of tight convolution is characterized by comprising the following steps:
1) Establishing a neural network model for the tight convolution of sound scene classification; the tightly-convoluted neural network model comprises the following components in series:
the first feature extraction module (1) is used for carrying out feature extraction once on the received logarithmic mel diagram by adopting different convolution kernels and carrying out nonlinear transformation to obtain a first nonlinear feature diagram under the different convolution kernels;
the second feature extraction module (2) is used for carrying out secondary feature extraction on the first nonlinear feature map by adopting different convolution kernels and carrying out secondary nonlinear transformation to obtain a second nonlinear feature map under different convolution kernels;
the tight convolution module (3) is formed by sequentially connecting n tight convolution units (3.1, 3.2 and 3.n) in series and is used for sequentially extracting depth features of the second nonlinear feature map by adopting different convolution kernels;
the Softmax layer (4) is used for carrying out weighted judgment on the finally extracted depth feature images under different convolution kernels and outputting a sound scene recognition result;
the first feature extraction module (1) and the second feature extraction module (2) have the same structure and both comprise: the ReLU activation function layer is used for carrying out nonlinear transformation on the feature graphs under the extracted different convolution kernels;
the n tight convolution units (3.1, 3.2, 3. N) have the same structure and comprise a tight convolution layer for extracting depth features of the received feature images by adopting different convolution kernels and a ReLU activation function layer for performing nonlinear transformation on the feature images extracted by the tight convolution layer under the different convolution kernels;
the tight convolution layer comprises the following components in series: a depth convolution layer (3.10) for extracting feature images using different convolution kernels, a channel compression layer (3.11) for reducing feature images under the different convolution kernels extracted by the depth convolution layer (3.10), and a 1 x 1 convolution layer (3.12) for performing 1 x 1 convolution on the reduced feature images;
the channel compression layer (3.11) comprises:
(1) Grouping received characteristic diagrams under different convolution kernels extracted by a depth convolution layer (3.10), wherein each group has more than 2 characteristic diagrams;
(2) Comparing or averaging the parameters of the same position of all the feature images of each group, and taking the maximum value or the obtained average value in the comparison result as the parameter of the same position of the new feature image, thereby obtaining the new feature image of the group;
(3) Outputting each group of new feature graphs;
2) Inputting training sets containing audio files of different scene categories and corresponding scene categories into a tightly-convolved neural network model for sound scene classification, and training the tightly-convolved neural network model for sound scene classification;
3) Reading an audio file and preprocessing to obtain an audio signal fragment;
the pretreatment is to cut the input signal into signal segments with fixed duration of 10 s;
4) Extracting a logarithmic mel-graph from said audio signal segment; comprising the following steps:
(1) Framing and windowing an input audio signal segment;
(2) The obtained audio frames pass through a Mel filter group, the energy passing through each Mel filter in each time step range in the audio frames is calculated, all the energy passing through the Mel filters obtained in each time step range are formed into energy vectors, the energy vectors in all the time step ranges are combined, and a two-dimensional Mel diagram of the corresponding audio frames is obtained;
(3) Carrying out logarithmic processing on the two-dimensional Mel diagram to obtain a logarithmic Mel diagram;
5) Inputting the logarithmic mel diagram into a trained neural network model for the tight convolution of the sound scene classification to obtain the final sound scene classification.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910960583.5A CN110796027B (en) | 2019-10-10 | 2019-10-10 | Sound scene recognition method based on neural network model of tight convolution |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910960583.5A CN110796027B (en) | 2019-10-10 | 2019-10-10 | Sound scene recognition method based on neural network model of tight convolution |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110796027A CN110796027A (en) | 2020-02-14 |
CN110796027B true CN110796027B (en) | 2023-10-17 |
Family
ID=69438941
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910960583.5A Active CN110796027B (en) | 2019-10-10 | 2019-10-10 | Sound scene recognition method based on neural network model of tight convolution |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110796027B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111461172B (en) * | 2020-03-04 | 2023-05-30 | 哈尔滨工业大学 | Lightweight characteristic fusion method of hyperspectral remote sensing data based on two-dimensional point group convolution |
CN112016639B (en) * | 2020-11-02 | 2021-01-26 | 四川大学 | Flexible separable convolution framework and feature extraction method and application thereof in VGG and ResNet |
CN113539283B (en) * | 2020-12-03 | 2024-07-16 | 腾讯科技(深圳)有限公司 | Audio processing method and device based on artificial intelligence, electronic equipment and storage medium |
CN112634928B (en) * | 2020-12-08 | 2023-09-29 | 北京有竹居网络技术有限公司 | Sound signal processing method and device and electronic equipment |
CN112786057B (en) * | 2021-02-23 | 2023-06-02 | 厦门熵基科技有限公司 | Voiceprint recognition method and device, electronic equipment and storage medium |
CN113281660A (en) * | 2021-05-21 | 2021-08-20 | 张家港清研检测技术有限公司 | Method for detecting unqualified battery cell in retired power battery pack |
CN113793622B (en) * | 2021-09-10 | 2023-08-29 | 中国科学院声学研究所 | Audio scene recognition method, system and device |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107895192A (en) * | 2017-12-06 | 2018-04-10 | 广州华多网络科技有限公司 | Depth convolutional network compression method, storage medium and terminal |
CN108231067A (en) * | 2018-01-13 | 2018-06-29 | 福州大学 | Sound scenery recognition methods based on convolutional neural networks and random forest classification |
CN108520757A (en) * | 2018-03-31 | 2018-09-11 | 华南理工大学 | Music based on auditory properties is applicable in scene automatic classification method |
CN109448702A (en) * | 2018-10-30 | 2019-03-08 | 上海力声特医学科技有限公司 | Artificial cochlea's auditory scene recognition methods |
CN109978137A (en) * | 2019-03-20 | 2019-07-05 | 厦门美图之家科技有限公司 | A kind of processing method of convolutional neural networks |
CN110085218A (en) * | 2019-03-26 | 2019-08-02 | 天津大学 | A kind of audio scene recognition method based on feature pyramid network |
CN110188863A (en) * | 2019-04-30 | 2019-08-30 | 杭州电子科技大学 | A kind of convolution kernel and its compression algorithm of convolutional neural networks |
CN110223715A (en) * | 2019-05-07 | 2019-09-10 | 华南理工大学 | It is a kind of based on sound event detection old solitary people man in activity estimation method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190066657A1 (en) * | 2017-08-31 | 2019-02-28 | National Institute Of Information And Communications Technology | Audio data learning method, audio data inference method and recording medium |
-
2019
- 2019-10-10 CN CN201910960583.5A patent/CN110796027B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107895192A (en) * | 2017-12-06 | 2018-04-10 | 广州华多网络科技有限公司 | Depth convolutional network compression method, storage medium and terminal |
CN108231067A (en) * | 2018-01-13 | 2018-06-29 | 福州大学 | Sound scenery recognition methods based on convolutional neural networks and random forest classification |
CN108520757A (en) * | 2018-03-31 | 2018-09-11 | 华南理工大学 | Music based on auditory properties is applicable in scene automatic classification method |
CN109448702A (en) * | 2018-10-30 | 2019-03-08 | 上海力声特医学科技有限公司 | Artificial cochlea's auditory scene recognition methods |
CN109978137A (en) * | 2019-03-20 | 2019-07-05 | 厦门美图之家科技有限公司 | A kind of processing method of convolutional neural networks |
CN110085218A (en) * | 2019-03-26 | 2019-08-02 | 天津大学 | A kind of audio scene recognition method based on feature pyramid network |
CN110188863A (en) * | 2019-04-30 | 2019-08-30 | 杭州电子科技大学 | A kind of convolution kernel and its compression algorithm of convolutional neural networks |
CN110223715A (en) * | 2019-05-07 | 2019-09-10 | 华南理工大学 | It is a kind of based on sound event detection old solitary people man in activity estimation method |
Non-Patent Citations (1)
Title |
---|
卷积神经网络在异常声音识别中的研究;胡涛等;《信号处理》;357-367 * |
Also Published As
Publication number | Publication date |
---|---|
CN110796027A (en) | 2020-02-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110796027B (en) | Sound scene recognition method based on neural network model of tight convolution | |
CN110782878B (en) | Attention mechanism-based multi-scale audio scene recognition method | |
CN110600017B (en) | Training method of voice processing model, voice recognition method, system and device | |
CN111933188B (en) | Sound event detection method based on convolutional neural network | |
CN110245608B (en) | Underwater target identification method based on half tensor product neural network | |
CN110390952B (en) | City sound event classification method based on dual-feature 2-DenseNet parallel connection | |
CN109890043B (en) | Wireless signal noise reduction method based on generative countermeasure network | |
CN107393542A (en) | A kind of birds species identification method based on binary channels neutral net | |
CN110033756B (en) | Language identification method and device, electronic equipment and storage medium | |
CN110808033A (en) | Audio classification method based on dual data enhancement strategy | |
CN115602152B (en) | Voice enhancement method based on multi-stage attention network | |
CN111508524B (en) | Method and system for identifying voice source equipment | |
CN109559755A (en) | A kind of sound enhancement method based on DNN noise classification | |
CN112183582A (en) | Multi-feature fusion underwater target identification method | |
CN111862978A (en) | Voice awakening method and system based on improved MFCC (Mel frequency cepstrum coefficient) | |
CN115035887A (en) | Voice signal processing method, device, equipment and medium | |
US12079703B2 (en) | Convolution-augmented transformer models | |
CN108564967B (en) | Mel energy voiceprint feature extraction method for crying detection system | |
CN114283829A (en) | Voice enhancement method based on dynamic gate control convolution cyclic network | |
CN116434759B (en) | Speaker identification method based on SRS-CL network | |
CN112397090A (en) | Real-time sound classification method and system based on FPGA | |
Yu | Research on music emotion classification based on CNN-LSTM network | |
CN113488069B (en) | Speech high-dimensional characteristic rapid extraction method and device based on generation type countermeasure network | |
CN114997210A (en) | Machine abnormal sound identification and detection method based on deep learning | |
CN115328661A (en) | Computing power balance execution method and chip based on voice and image characteristics |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |