CN113851115A - Complex sound identification method based on one-dimensional convolutional neural network - Google Patents

Complex sound identification method based on one-dimensional convolutional neural network Download PDF

Info

Publication number
CN113851115A
CN113851115A CN202111044338.3A CN202111044338A CN113851115A CN 113851115 A CN113851115 A CN 113851115A CN 202111044338 A CN202111044338 A CN 202111044338A CN 113851115 A CN113851115 A CN 113851115A
Authority
CN
China
Prior art keywords
neural network
convolutional neural
dimensional
layer
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111044338.3A
Other languages
Chinese (zh)
Inventor
殷波
杜泽华
魏志强
董西峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ocean University of China
Original Assignee
Ocean University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ocean University of China filed Critical Ocean University of China
Priority to CN202111044338.3A priority Critical patent/CN113851115A/en
Publication of CN113851115A publication Critical patent/CN113851115A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Signal Processing (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a complex sound identification method based on a one-dimensional convolutional neural network, which adopts a random completion algorithm to process complex sounds, fills original data to the same length and is used for the input of the one-dimensional convolutional neural network; and embedding a pre-emphasis module and a simplified attention mechanism module in a basic frame of the one-dimensional convolutional neural network, wherein the pre-emphasis module is arranged at an input part of the one-dimensional convolutional neural network and is used for pre-emphasizing input data and optimizing a participating network model, the simplified attention mechanism module is arranged at the deep layer of the one-dimensional convolutional neural network, and global features with attention are obtained by utilizing a global average pooling function and a sigmoid function. The method of the invention optimizes the network model and obtains good identification effect.

Description

Complex sound identification method based on one-dimensional convolutional neural network
Technical Field
The invention belongs to the technical field of audio processing, relates to a complex sound identification technology, and particularly relates to a complex sound identification method based on a one-dimensional convolutional neural network.
Background
The complex sound refers to non-language sounds in the environment, the sound source is complex and various, the signal itself has non-stationarity and is often accompanied by extremely interfering background noise and the like, so that the sound characteristics of different sound scenes are not obvious enough or the similarity of the characteristics is very high, and the complex sound identification can automatically identify the specific types of the complex sounds in the environment, such as child playing, car whistling, street music and the like. In the field of sound classification, such as speech classification and music classification, very high accuracy has been achieved, but in the field of complex sound recognition, due to the non-stationarity of the signal itself, the speech or music classification scheme is obviously not suitable for solving such problems, and therefore an effective recognition model for complex sounds needs to be provided.
At present, there are three main methods for solving the problem of complex sound classification by combining with neural network according to the difference of input data: based on the original signal, artificial features and a variety of input data. The first method is to directly use the original signal to carry out network training, and the method has the advantages that the characteristic extraction of the signal is not needed manually, the operation flow is greatly simplified, and the model is simple and convenient to popularize; the second method is to process the original data and artificially extract some features of the sound signal, such as a spectrogram, a mel frequency cepstrum coefficient and the like. The third is a multi-input complex network, the original sound signal and the manually extracted features are used as the input part of the network, and the method has the advantages that the original features (time sequence features) and frequency domain features of the signal can be combined, so that the defect of insufficient single data features is overcome, but the model is complex, has high requirements on the hardware of the platform and is inconvenient to apply.
Deep learning models based on original audio signals are used by many scholars to solve complex voice recognition problems, such as the complex voice recognition model based on one-dimensional convolutional neural network proposed by Dai et al, which achieves better recognition accuracy. However, the deep learning model is difficult to effectively extract the features of the original signal, and the model proposed by the prior art is complex and needs to be further optimized. Solving complex sound problems based on the original audio signal is thus a very big challenge. In order to achieve a good recognition effect, the following problems still exist in the existing scheme:
(1) problem of raw data inconsistency
In the actual data processing process, there are some cases that audio durations in data sets (such as an UrbanSound8K data set and data collected in an actual environment) are not consistent, and a one-dimensional convolutional neural network model requires a fixed input data length, so that data padding needs to be used, and common data padding methods include cubic spline interpolation, a zero padding method and the like. There are many audio time lengths in the data set that are very different from the target length, for example, the actual time length is 1 second, and the target length is 4 seconds, obviously, cubic spline interpolation is not suitable for this case, and the zero filling method is too simple, data will lose much information, and the more zeros filled in may mask valid information. Therefore, the invention provides a random completion algorithm, which is based on original data and can enrich data characteristics while filling data.
(2) Attention mechanism
The attention mechanism can enable the model to pay attention to useful information, and can further improve the performance of the model. The invention provides a simplified attention mechanism facing a one-dimensional convolutional neural network, which obtains an attention feature vector by weighting global features and multiplying the global features by an original feature vector.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a complex voice recognition method based on a one-dimensional convolutional neural network, firstly, a random completion algorithm is provided, uneven original audio data are filled to the same length and input into a network model, then the network model is optimized, a pre-emphasis technology and a simplified attention mechanism are introduced into the neural network for training, and finally, a complex voice recognition model is constructed.
In order to solve the technical problems, the invention adopts the technical scheme that:
a complex sound identification method based on a one-dimensional convolution neural network adopts a random completion algorithm to process complex sounds, and original data are filled to the same length and used for input of the one-dimensional convolution neural network; and embedding a pre-emphasis module and a simplified attention mechanism module in a basic frame of the one-dimensional convolutional neural network, wherein the pre-emphasis module is arranged at an input part of the one-dimensional convolutional neural network and is used for pre-emphasizing input data and optimizing a participating network model, the simplified attention mechanism module is arranged at the deep layer of the one-dimensional convolutional neural network, and global features with attention are obtained by utilizing a global average pooling function and a sigmoid function.
Further, the detailed steps of the complex voice recognition method based on the one-dimensional convolutional neural network are as follows:
firstly, processing original data: filling the original data by adopting a random padding algorithm to obtain cut original audio with consistent length after random padding, and taking the original audio as input data of a one-dimensional convolution neural network;
secondly, pre-emphasis: pre-emphasis is carried out on input data through a pre-emphasis module, and then the input data are processed through a layer of convolution layer;
and thirdly, one-dimensional convolution neural network: obtaining a characteristic vector through one-dimensional convolutional neural network processing, wherein the one-dimensional convolutional neural network structure adopts a convolutional layer with the same number of two channels, and then a pooling layer is stacked for three times, so that 6 layers of convolutional structures are obtained;
fourthly, attention mechanism: inputting the feature vector into a simplified attention mechanism module to obtain the feature with attention;
fifthly, output classification: and finally, outputting a final recognition result through a two-layer full connection structure and a softmax classification function.
Further, the random completion algorithm specifically comprises the following steps:
(1) dividing all samples into two categories of more than or equal to N/2 seconds and less than N/2 seconds, wherein the target length of the samples is N seconds;
randomly selecting a starting point which can be supplemented to N seconds at one time for the samples of which the time is more than or equal to N/2 seconds, then intercepting the starting point to a required length, and finally filling the intercepted audio segment at the tail end of the original audio to complete the supplementation;
(2) and directly copying the whole sample until the length of the sample is more than or equal to N seconds for the sample less than N/2 seconds, and finally cutting the sample into the sample of N seconds.
Furthermore, the pre-emphasis module has two layers of convolution structures, initial values of convolution kernels of the first layer are set to be-0.97 and 1 and are continuously stacked, initial values of convolution kernels of the second layer are set to be 1, and pre-emphasis coefficients are further adjusted.
Further, the number of convolution kernels of each layer of the pre-emphasis module is set to be 1.
Further, the simplified attention mechanism is that firstly, global average pooling is used for compressing the features into one-dimensional features consistent with the number of channels, global features of the model are obtained, then the features are input into a sigmoid function to obtain weights corresponding to each channel, and finally the weights and the one-dimensional features obtained by the original global average pooling are multiplied to obtain new global features which are the features with attention;
the expression for the attention mechanism is as follows:
Figure BDA0003250601470000031
wherein F is the deep output characteristic of the one-dimensional convolution neural network, W is the weight vector, FOIs a global feature with attention.
Compared with the prior art, the invention has the advantages that:
(1) the random completion method designed by the invention can fill the uneven original data to the same length, is convenient for the input of a network model, makes up for the singleness of the zero filling method, depends on the original data to supplement the original data, furthest retains the characteristics of the time sequence and the like of the original data, provides more useful characteristics and obviously contributes to the improvement of the classification performance.
(2) The pre-emphasis module designed by the invention combines the pre-emphasis technology into the convolutional neural network by utilizing the convolution operation of the convolutional layer, provides a buffer space for the previous pre-emphasis layer by adding a convolutional layer with a convolutional kernel initial value of 1 and a kernel length of 1, can further properly adjust the network, simultaneously lightens the tuning burden of the next one-dimensional convolutional neural network and improves the performance.
(3) The simplified attention mechanism designed by the invention obtains the global characteristics with attention by utilizing the global average pooling and sigmoid functions, and is beneficial to model classification.
(4) The key points are combined and an end-to-end complex sound identification model based on the one-dimensional convolutional neural network is constructed, and the model can well acquire the characteristics of original complex sound and obtain a good identification effect.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow chart of complex speech recognition according to the present invention;
FIG. 2 is a comparison of the original data after the random completion method and the zero padding method of the present invention;
FIG. 3 is a diagram of a pre-emphasis module of the present invention;
FIG. 4 is a simplified attention mechanism model architecture diagram of the present invention.
Detailed Description
The invention is further described with reference to the following figures and specific embodiments.
The embodiment provides a complex sound identification method based on a one-dimensional convolutional neural network, which comprises the following two aspects: on one hand, the method adopts a random completion algorithm to process complex sound, and fills original data to the same length for the input of a one-dimensional convolution neural network. On the other hand, the optimization network model structure: a pre-emphasis module and a simplified attention mechanism module are embedded in the basic framework of the one-dimensional convolutional neural network. The pre-emphasis module is arranged at the input part of the one-dimensional convolutional neural network and is used for pre-emphasizing input data and participating in network model tuning; the simplified attention mechanism module is arranged at the deep layer of the one-dimensional convolutional neural network, and global features with attention are obtained by utilizing a global average pooling function and a sigmoid function.
In conjunction with the complex voice recognition flowchart shown in fig. 1, the detailed steps are as follows:
firstly, processing original data: and filling the original data by adopting a random padding algorithm to obtain the original audio with consistent length after being cut and padded randomly, and taking the original audio as the input data of the one-dimensional convolution neural network.
The random completion algorithm comprises the following specific steps:
(1) assuming that the target length of a sample is 4 seconds, dividing all samples into two categories of more than or equal to 2 seconds and less than 2 seconds;
randomly selecting a starting point which can be supplemented for 4 seconds at one time for samples more than or equal to 2 seconds, then intercepting the starting point to a required length, and finally filling the intercepted audio segment at the tail end of the original audio to complete the supplementation;
(2) for samples less than 2 seconds, the whole sample is directly copied until the length is more than or equal to 4 seconds, and finally the sample is cut into samples of 4 seconds.
The pair of raw data after the random padding method and the method of padding zeros is shown in fig. 2. The pseudo code for the random completion algorithm is as follows.
Figure BDA0003250601470000051
Secondly, pre-emphasis: the input data is pre-emphasized through a pre-emphasis module, and then is processed through a convolution layer with a large convolution kernel.
The pre-emphasis module has two layers of convolution structures, initial values of convolution kernels of the first layer are set to be-0.97 and 1 and are stacked continuously, initial values of convolution kernels of the second layer are 1, and pre-emphasis coefficients can be further adjusted. Fig. 3 is a diagram of a pre-emphasis module, which aims to pre-emphasize input data without extracting features, so that the number of convolution kernels in each layer is set to 1. In the process of model learning, the pre-emphasis module also participates in the network for tuning.
And thirdly, one-dimensional convolution neural network: and processing by a one-dimensional convolutional neural network to obtain a feature vector, wherein the one-dimensional convolutional neural network structure adopts a convolutional layer with the same number of two channels, and then a pooling layer is stacked for three times, so that 6 layers of convolutional structures are obtained.
Fourthly, attention mechanism: the feature vectors are input into a simplified attention mechanism module to obtain attention-bearing features.
As shown in fig. 4, the attention mechanism is placed at a deep layer of a one-dimensional convolutional neural network, first, features are compressed into one-dimensional features consistent with the number of channels by using Global Average Pooling (GAP), the Global features of the model can be obtained in this step, then the features are input into a sigmoid function to obtain weights corresponding to each channel, and finally, the weights and the one-dimensional features obtained by the original GAP are multiplied to obtain new Global features, which are features with attention.
The expression for the attention mechanism is as follows:
Figure BDA0003250601470000061
wherein F is the deep output characteristic of the one-dimensional convolution neural network, W is the weight vector, FOIs a global feature with attention.
Fifthly, output classification: and finally, outputting a final recognition result through a two-layer full connection structure and a softmax classification function.
With reference to fig. 1, the present invention integrates a stochastic completion algorithm, a pre-emphasis module, and a simplified attention mechanism to obtain a complex voice recognition model, and performs stochastic completion on input audio data to obtain a clipped and completed original audio, and uses the original audio as input data of a network; then the pre-emphasis module performs pre-emphasis on the input data, and then the input data passes through a convolution layer with a large convolution kernel; then, the traditional one-dimensional convolution neural network structure adopts a convolution layer with the same number of two channels and a pooling layer, and the convolution layer is stacked for three times to form a total of 6 layers of convolution structures. In addition, the first three layers further increase the receptive field of the model by using dilation convolution with dilation coefficients of 2, 3 and 4 respectively; then inputting the feature vector into a simplified attention mechanism module to obtain the feature with attention; and finally, outputting a final recognition result through a two-layer full connection structure and a softmax classification function.
The model structure and parameters are shown in table 1, with a sample rate of 16kHz and a sample length of 4 seconds as an example.
TABLE 1 model Structure and parameter Table
Figure BDA0003250601470000071
Experimental configuration and results:
1. loss function
The model uses a classical cross-entropy loss function, the formula is as follows:
H(p,q)=-∑xp (x) log q (x) equation (2)
Wherein p represents the distribution of the real samples, and q is the sample prediction distribution of the trained model.
2. Optimization algorithm
The optimizer algorithm uses a stochastic gradient descent method with momentum of 0.9, and the acceleration is updated as follows:
vt=γvt-1+ lr grad formula (3)
Where v is the acceleration, γ is the momentum coefficient, typically set to 0.9, lr is the learning rate, and grad is the gradient.
3. Learning rate
The learning rate attenuation adopts discrete decline, and the specific parameters are as follows:
Figure BDA0003250601470000081
the batch is set to 64 during model training for 200 rounds of training.
4. Results of the experiment
The recognition accuracy of the complex sound recognition model based on the one-dimensional convolutional neural network on ESC10, ESC50 and UrbanSound8K data sets respectively reaches 84.4%, 73.8% and 88.6%, and the method is effective.
It is understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art should understand that they can make various changes, modifications, additions and substitutions within the spirit and scope of the present invention.

Claims (6)

1. A complex sound identification method based on a one-dimensional convolution neural network is characterized in that a random completion algorithm is adopted to process complex sounds, and original data are filled to the same length and used for input of the one-dimensional convolution neural network; and embedding a pre-emphasis module and a simplified attention mechanism module in a basic frame of the one-dimensional convolutional neural network, wherein the pre-emphasis module is arranged at an input part of the one-dimensional convolutional neural network and is used for pre-emphasizing input data and optimizing a participating network model, the simplified attention mechanism module is arranged at the deep layer of the one-dimensional convolutional neural network, and global features with attention are obtained by utilizing a global average pooling function and a sigmoid function.
2. The complex voice recognition method based on the one-dimensional convolutional neural network of claim 1, wherein the detailed steps are as follows:
firstly, processing original data: filling the original data by adopting a random padding algorithm to obtain cut original audio with consistent length after random padding, and taking the original audio as input data of a one-dimensional convolution neural network;
secondly, pre-emphasis: pre-emphasis is carried out on input data through a pre-emphasis module, and then the input data are processed through a layer of convolution layer;
and thirdly, one-dimensional convolution neural network: obtaining a characteristic vector through one-dimensional convolutional neural network processing, wherein the one-dimensional convolutional neural network structure adopts a convolutional layer with the same number of two channels, and then a pooling layer is stacked for three times, so that 6 layers of convolutional structures are obtained;
fourthly, attention mechanism: inputting the feature vector into a simplified attention mechanism module to obtain the feature with attention;
fifthly, output classification: and finally, outputting a final recognition result through a two-layer full connection structure and a softmax classification function.
3. The method for complex voice recognition based on one-dimensional convolutional neural network of claim 2, wherein the random completion algorithm comprises the following specific steps:
(1) dividing all samples into two categories of more than or equal to N/2 seconds and less than N/2 seconds, wherein the target length of the samples is N seconds;
randomly selecting a starting point which can be supplemented to N seconds at one time for the samples of which the time is more than or equal to N/2 seconds, then intercepting the starting point to a required length, and finally filling the intercepted audio segment at the tail end of the original audio to complete the supplementation;
(2) and directly copying the whole sample until the length of the sample is more than or equal to N seconds for the sample less than N/2 seconds, and finally cutting the sample into the sample of N seconds.
4. The method of claim 2, wherein the pre-emphasis module has a two-layer convolution structure, the initial values of the convolution kernels of the first layer are set to-0.97 and 1 and are stacked continuously, the initial value of the convolution kernel of the second layer is 1, and the pre-emphasis coefficient is further adjusted.
5. The method of claim 4, wherein the number of convolution kernels of each layer of the pre-emphasis module is set to 1.
6. The complex sound identification method based on the one-dimensional convolutional neural network as claimed in claim 2, wherein the simplified attention mechanism is that firstly, global average pooling is used to compress the features into one-dimensional features consistent with the channel number, global features of the model are obtained, then the features are input into a sigmoid function to obtain weights corresponding to each channel, and finally the weights and the one-dimensional features obtained by the original global average pooling are multiplied to obtain new global features, wherein the features are features with attention;
the expression for the attention mechanism is as follows:
Figure FDA0003250601460000021
wherein F is the deep output characteristic of the one-dimensional convolution neural network, W is the weight vector, FOIs a global feature with attention.
CN202111044338.3A 2021-09-07 2021-09-07 Complex sound identification method based on one-dimensional convolutional neural network Pending CN113851115A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111044338.3A CN113851115A (en) 2021-09-07 2021-09-07 Complex sound identification method based on one-dimensional convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111044338.3A CN113851115A (en) 2021-09-07 2021-09-07 Complex sound identification method based on one-dimensional convolutional neural network

Publications (1)

Publication Number Publication Date
CN113851115A true CN113851115A (en) 2021-12-28

Family

ID=78973314

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111044338.3A Pending CN113851115A (en) 2021-09-07 2021-09-07 Complex sound identification method based on one-dimensional convolutional neural network

Country Status (1)

Country Link
CN (1) CN113851115A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160093293A1 (en) * 2014-09-26 2016-03-31 Samsung Electronics Co., Ltd. Method and device for preprocessing speech signal
CN110047506A (en) * 2019-04-19 2019-07-23 杭州电子科技大学 A kind of crucial audio-frequency detection based on convolutional neural networks and Multiple Kernel Learning SVM
CN110070888A (en) * 2019-05-07 2019-07-30 颐保医疗科技(上海)有限公司 A kind of Parkinson's audio recognition method based on convolutional neural networks
CN111160438A (en) * 2019-12-24 2020-05-15 浙江大学 Acoustic garbage classification method adopting one-dimensional convolutional neural network
CN112199548A (en) * 2020-09-28 2021-01-08 华南理工大学 Music audio classification method based on convolution cyclic neural network
CN112863550A (en) * 2021-03-01 2021-05-28 德鲁动力科技(成都)有限公司 Crying detection method and system based on attention residual learning
US20210256386A1 (en) * 2020-02-13 2021-08-19 Soundhound, Inc. Neural acoustic model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160093293A1 (en) * 2014-09-26 2016-03-31 Samsung Electronics Co., Ltd. Method and device for preprocessing speech signal
CN110047506A (en) * 2019-04-19 2019-07-23 杭州电子科技大学 A kind of crucial audio-frequency detection based on convolutional neural networks and Multiple Kernel Learning SVM
CN110070888A (en) * 2019-05-07 2019-07-30 颐保医疗科技(上海)有限公司 A kind of Parkinson's audio recognition method based on convolutional neural networks
CN111160438A (en) * 2019-12-24 2020-05-15 浙江大学 Acoustic garbage classification method adopting one-dimensional convolutional neural network
US20210256386A1 (en) * 2020-02-13 2021-08-19 Soundhound, Inc. Neural acoustic model
CN112199548A (en) * 2020-09-28 2021-01-08 华南理工大学 Music audio classification method based on convolution cyclic neural network
CN112863550A (en) * 2021-03-01 2021-05-28 德鲁动力科技(成都)有限公司 Crying detection method and system based on attention residual learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
POOI SHIANG TAN: "Acoustic Event Detection with MobileNet and 1D Convolutional Neural Network", 《2020 IEEE 2ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE IN ENGINEERING AND TECHNOLOGY》, 31 December 2020 (2020-12-31), pages 1 - 6 *
XIFENG DONG: "Environment Sound Event Classification With a Two-Stream Convolutional Neural Network", 《IEEE ACCESS》, 31 July 2020 (2020-07-31), pages 125714 - 125721, XP011799543, DOI: 10.1109/ACCESS.2020.3007906 *
刘航;汪西莉;: "基于注意力机制的遥感图像分割模型", 激光与光电子学进展, no. 04, 31 December 2020 (2020-12-31) *

Similar Documents

Publication Publication Date Title
CN110867181B (en) Multi-target speech enhancement method based on SCNN and TCNN joint estimation
CN111564160B (en) Voice noise reduction method based on AEWGAN
US7848924B2 (en) Method, apparatus and computer program product for providing voice conversion using temporal dynamic features
KR101420557B1 (en) Parametric speech synthesis method and system
CN106782511A (en) Amendment linear depth autoencoder network audio recognition method
KR101807961B1 (en) Method and apparatus for processing speech signal based on lstm and dnn
CN108133702A (en) A kind of deep neural network speech enhan-cement model based on MEE Optimality Criterias
CN111899757A (en) Single-channel voice separation method and system for target speaker extraction
CN111814448B (en) Pre-training language model quantization method and device
CN112259119B (en) Music source separation method based on stacked hourglass network
JP2023546099A (en) Audio generator, audio signal generation method, and audio generator learning method
CN111798875A (en) VAD implementation method based on three-value quantization compression
CN114267372A (en) Voice noise reduction method, system, electronic device and storage medium
JPH08123484A (en) Method and device for signal synthesis
WO2020141108A1 (en) Method, apparatus and system for hybrid speech synthesis
JP2022539867A (en) Audio separation method and device, electronic equipment
CN114141237A (en) Speech recognition method, speech recognition device, computer equipment and storage medium
CN113539293A (en) Single-channel voice separation method based on convolutional neural network and joint optimization
CN111724809A (en) Vocoder implementation method and device based on variational self-encoder
CN113436607B (en) Quick voice cloning method
CN111354367A (en) Voice processing method and device and computer storage medium
CN110120228A (en) Audio general steganalysis method and system based on sonograph and depth residual error network
WO2024072700A1 (en) Switchable noise reduction profiles
CN113851115A (en) Complex sound identification method based on one-dimensional convolutional neural network
CN113094544B (en) Music recommendation method based on DCNN joint feature representation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination