CN112863550B - Crying detection method and system based on attention residual learning - Google Patents
Crying detection method and system based on attention residual learning Download PDFInfo
- Publication number
- CN112863550B CN112863550B CN202110224859.0A CN202110224859A CN112863550B CN 112863550 B CN112863550 B CN 112863550B CN 202110224859 A CN202110224859 A CN 202110224859A CN 112863550 B CN112863550 B CN 112863550B
- Authority
- CN
- China
- Prior art keywords
- block2
- block4
- crying
- data
- attention mechanism
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 206010011469 Crying Diseases 0.000 title claims abstract description 55
- 238000001514 detection method Methods 0.000 title claims abstract description 13
- 230000007246 mechanism Effects 0.000 claims abstract description 40
- 238000012549 training Methods 0.000 claims abstract description 34
- 238000000034 method Methods 0.000 claims abstract description 32
- 238000013528 artificial neural network Methods 0.000 claims abstract description 23
- 238000012795 verification Methods 0.000 claims abstract description 6
- 238000011176 pooling Methods 0.000 claims description 19
- 238000001228 spectrum Methods 0.000 claims description 13
- 238000007781 pre-processing Methods 0.000 claims description 11
- 238000000605 extraction Methods 0.000 claims description 10
- 230000006870 function Effects 0.000 claims description 10
- 238000005070 sampling Methods 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 7
- 230000004907 flux Effects 0.000 claims description 6
- 239000011541 reaction mixture Substances 0.000 claims 1
- 238000013527 convolutional neural network Methods 0.000 abstract description 8
- 238000010200 validation analysis Methods 0.000 abstract description 3
- 230000003595 spectral effect Effects 0.000 description 11
- 230000005236 sound signal Effects 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 6
- 230000008859 change Effects 0.000 description 5
- 230000003321 amplification Effects 0.000 description 4
- 238000003199 nucleic acid amplification method Methods 0.000 description 4
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000009432 framing Methods 0.000 description 3
- 230000008034 disappearance Effects 0.000 description 2
- 230000037433 frameshift Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000035807 sensation Effects 0.000 description 2
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 description 1
- 241000282472 Canis lupus familiaris Species 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000004378 air conditioning Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000994 depressogenic effect Effects 0.000 description 1
- 210000005069 ears Anatomy 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000000214 mouth Anatomy 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000000474 nursing effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Multimedia (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Child & Adolescent Psychology (AREA)
- General Health & Medical Sciences (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a crying detection method and a crying detection system based on attention residual error learning, which comprises S1, collecting crying data; s2, dividing the crying data into a training set and a verification set; s3, training the constructed residual error neural network based on the attention mechanism by adopting a training set to obtain a trained residual error neural network based on the attention mechanism; and the training results are evaluated by adopting a validation set. The method for introducing the residual network solves the problem that the gradient of the CNN model with large layer number disappears, and introduces the attention mechanism to enable the residual model to be capable of adding the characteristics of being capable of expressing crying, so that the accuracy rate of crying identification in a real scene can be improved, and the generalization capability in a practical scene is improved.
Description
Technical Field
The invention relates to the technical field of voice recognition, in particular to a crying detection method and system based on attention residual learning.
Background
Because the existing four-foot voice recognition lacks abnormal sound detection, particularly in family companion dogs, crying is a main way for infants to express themselves, and the automatic detection of the crying of the infants plays an important role in the field of family companionship and can effectively reduce the burden of parents of nursing. There have been many studies on feature and model selection and mechanisms for baby cry, and conventional machine learning methods such as SVM and classification of spectrogram using CNN model are commonly used.
The traditional machine learning method such as SVM generally depends on the selection of characteristics, the quality of the characteristic selection determines the quality of a recognition result, the characteristics are difficult to be comprehensively reflected by the selection of the characteristics, a convolutional neural network can learn the characteristics from a spectrogram, but the training is difficult due to the deepening of the layer number, the result effect of a shallow CNN model used for detecting the infant crying is poor, and the crying recognition is mainly challenged by the uncertainty and instability of noise in the actual environment.
In an actual environment containing unstable noise, if only a single or too few features are used, the recognition rate of cry recognition is too low, but in a model adopting complex features, a deep CNN network model has a potential gradient disappearance problem.
Disclosure of Invention
The invention provides a crying detection method and system based on attention residual error learning to solve the technical problems.
The invention is realized by the following technical scheme:
the crying detection method based on attention residual learning comprises the following steps:
s1, collecting crying data;
s2, dividing the crying data into a training set and a verification set;
s3, training the constructed residual error neural network based on the attention mechanism by adopting a training set to obtain a trained residual error neural network; and the training results are evaluated by adopting a validation set.
Further, the residual neural network comprises a Block1, a first Block2, a second Block2, a first Block4, a second Block4, a third Block4 and a Block5 which are sequentially connected in series; the output of the Block1 is connected with the input of the second Block2 through a skip connection unit, and the input of the second Block2 is connected with the input of the first Block4 through a skip connection unit;
the first Block2, the second Block2, the first Block4, the second Block4, the third Block4 and the Block5 all introduce a hybrid attention mechanism.
Further, the Block1 includes a two-dimensional convolutional layer for achieving 2-fold down-sampling;
the first Block4, the second Block4 and the third Block4 all comprise a third Block2 and a fourth Block 2; the third Block2 is connected in series with a fourth Block 2;
the first Block2, the second Block2, the third Block2 and the fourth Block2 all comprise two-dimensional convolutional layers, and a mixing attention mechanism is introduced behind the second two-dimensional convolutional layer;
the Block5 includes two-dimensional convolutional layers and a sigmoid layer, the first of which introduces a hybrid attention mechanism in front of the convolutional layers.
Further, the input of the third Block2 is connected with the input of the fourth Block2 through a skip connection unit comprising a Block 3;
the Block3 includes two parallel two-dimensional pooling layers and a configure layer for combining and outputting the outputs of the two-dimensional pooling layers in the last tensor dimension.
Further, the pooling area of the two-dimensional pooling layer in Block3 included in the third Block4 is used to implement 2-fold downsampling, padding is used to make the input image area equal to the input image area, and the contitate layer is used to combine and output the outputs of the two parallel two-dimensional pooling layers in the last tensor dimension.
Further, it is characterized in that: the formula of the mixed attention mechanism is as follows:
S=σ((F up (F res (F res (F dn (U))+F up (F res (F res (F dn (F res (F dn (U))))))))*
W 1 +b 1 )*W 2 +b 2 (1)
in the formula (1), F dn Denotes maximum pooling, F up Representing a two-line interpolation, S being the resulting attention mechanism weight, F res Representing a residual error mechanism calculation process, wherein sigma represents a sigmoid function; w is a 1 、w 2 Is the convolution kernel weight; b is a mixture of 1 、b 2 Is the convolution kernel bias.
Furthermore, the number of convolution kernels of the two-dimensional convolution layer of Block1 is 24;
the convolution kernel size, the number and the step number of the two-dimensional convolution layers in the first Block2 and the second Block2 are the same;
the number of convolution kernels of the two-dimensional convolution layers contained in the third Block2 and the fourth Block2 in the first Block4 is 48;
the number of convolution kernels of the two-dimensional convolution layers contained in the third Block2 and the fourth Block2 in the second Block4 is 96;
the number of convolution kernels of the two-dimensional convolution layers contained in the third Block2 and the fourth Block2 in the third Block4 is 192;
the third Block2 is used to implement 2 times downsampling;
the number of convolution kernels of the first two-dimensional convolution layer of Block5 is increased to 768; the convolution kernel size of the second two-dimensional convolution layer is 1 and the number is 1.
Further, in S1, the sample is amplified according to the signal-to-noise ratio.
Further, the step S2 is preceded by preprocessing the collected crying data, where the preprocessing includes two ways:
the first method is as follows: pre-emphasis is carried out on the voice signals;
the second method comprises the following steps: the speech signal is framed and windowed.
Further, in S3, feature extraction is performed on data in a training set, and the extracted audio features are used for training a residual neural network;
the audio features comprise at least one of short-time zero crossing rate, short-time average energy, short-time average amplitude, energy entropy, frequency spectrum centroid, spectrum entropy, frequency spectrum flux, mel-frequency cepstrum coefficient and chromatogram.
Cry detection system based on attention residual learning, including:
the first data acquisition module: the device is used for collecting sound data to be detected;
the second data acquisition module: the device is used for collecting sample data;
a data preprocessing module: the method is used for preprocessing sample data:
a feature extraction module: for extracting audio features in sample data:
a crying model module: the system is used for training the audio features in the training sample data by using a residual error neural network algorithm based on an attention mechanism to obtain a crying model;
crying identification module: and the crying model is used for inputting the data of the sound to be detected into the crying model for calculation, and determining whether the data of the sound to be detected is crying or not.
Compared with the prior art, the invention has the following beneficial effects:
the method of introducing the residual network solves the problem that the gradient of the CNN model with extremely large layer number disappears, and introduces an attention mechanism to enable the residual model to be more added with the characteristic of being capable of expressing crying; the method can improve the accuracy of crying recognition in a real scene and improve the generalization capability in a real scene.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention.
FIG. 1 is a flow chart of model training;
FIG. 2 is a schematic diagram of a residual block;
FIG. 3 is a block diagram of a residual neural network based on an attention mechanism;
FIG. 4 is a Block diagram of Block 1;
FIG. 5 is a Block diagram of Block 2;
FIG. 6 is a Block diagram of Block 3;
FIG. 7 is a Block diagram of Block 5.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.
As shown in fig. 1, the crying detection method based on attention residual learning disclosed by the invention comprises the following steps:
s1, collecting crying data;
s2, dividing the crying data into a training set and a verification set;
s3, training the constructed residual error neural network based on the attention mechanism by adopting a training set to obtain a trained residual error neural network based on the attention mechanism; and the training results are evaluated by adopting a validation set.
And inputting the sound data to be detected into a trained residual neural network based on an attention mechanism, and identifying whether the sound data to be detected is crying.
Based on the above method, the present invention discloses an embodiment.
Example 1
As shown in fig. 1, the present embodiment includes the following steps:
In this embodiment, the data set sources mainly include three:
450 clear crying data collected from the "donata cry" project on the GitHub; collecting 40 Crying data from Crying baby category in ESC-50 dataset; and manually recording 400 pieces of crying data from the network.
All data were unmuted, each 5 seconds long, and negative samples were taken from other categories of the ESC-50 data set. Therefore, there were 890 pieces of data for the positive sample (baby crying), and 900 pieces of data were collected for the negative sample. So that the positive and negative samples are relatively equalized.
Because the number of data samples is small, in order to better conform to the practical application environment, the embodiment also provides a method for performing data amplification on the collected sample data, which specifically comprises the following steps:
common indoor environmental noise in a family room, such as air conditioning sound and the like, is selected from UrbanSound8K, tests show that different signal-to-noise ratios can cause different performances of models, the accuracy of the models is obviously deteriorated when the noise intensity is larger (the signal-to-noise ratio is lower), and finally, the signal-to-noise ratio is selected to be 35dB for sample amplification.
Signal-to-noise ratio: refers to the ratio of the power of the useful speech signal and the noise signal doped with the signal in a speech signal. The signal-to-noise ratio can be calculated using equation (1):
in the formula (1), s (n) is a speech signal, and r (n) is a noise signal.
In this embodiment, 50% of sample data is finally selected for sample amplification, so that 1335 positive samples and 1350 negative samples are obtained.
According to the method, noise is added to the data according to different signal-to-noise ratios, sample data amplification is achieved, the accuracy rate of crying recognition in a real scene can be improved, and the generalization capability in a practical scene is improved.
And 2, preprocessing data.
In this embodiment, two methods are mainly adopted for data preprocessing: pre-emphasis, framing, and windowing.
2.1 pre-emphasis.
In the process of generating sound in the oral cavity, the energy of the sound is concentrated at low frequency, high-frequency signals are filtered out in the processing process, the attenuation of the high frequency is larger in the voice generating process, the pre-emphasis operation is to make up the attenuation of the high-frequency part, the specific operation is to send the audio signals into a first-order FIR high-pass filtering improved dynamic domain, so that the voice signal frequency spectrum after pre-emphasis is flatter, and the expression of the pre-emphasis is as follows:
H(z)=1-αz -1 (2)
in the formula (2), α is a constant representing the pre-emphasis coefficient, determines the pre-emphasis intensity, and has a value range of 0.9< α < 1.
2.2 framing and windowing.
In an audio signal, the frequency changes with time, and features cannot be directly extracted from the whole audio segment, and generally, it is considered that the voice signal is divided into 10 ms-30 ms voice segments, and stationarity exists in a short time.
Framing is typically achieved by windowing, the windowing formula:
S w (n)=S(n)W(n) (3)
in the formula (3), s (n) represents an original signal, and w (n) represents a window function.
Commonly used window functions are rectangular windows, Hamming windows, Hanning windows.
Wherein, the rectangular window formula:
hamming window formula:
hanning window formula:
this embodiment takes a Hamming window as the window function. And the values chosen in window length and frame shift are: the window length is 2048 points, the frame shift is 1024 points, and the best effect is achieved in the subsequent feature extraction.
And 3, constructing a training set and a verification set.
And 4, performing feature extraction and feature combination on the data in the training connection.
Although the neural network has the capability of extracting information contained in data with confidence, it is very difficult to directly process an original audio signal, so that feature engineering is very necessary, good feature extraction can greatly improve the recognition performance of the neural network, the training accuracy and efficiency are improved, the feature extraction of voice is very mature, and the common voice features are as follows:
1, short-time zero-crossing rate: the number of zero crossings of the signal in unit time is defined as a zero crossing rate, and the short-time zero crossing rate can be visually corresponding to the number of times of the signal waveform passing through a time axis.
2, short-time average energy: the short-time average energy can be used for assisting in distinguishing unvoiced sound and voiced sound, and can be used for dividing voiced sound and unvoiced sound under the conditions that the signal-to-noise ratio is high, the signal is pure, and noise components are few, so that silent segments are cut.
The mathematical definition of the short-time average energy is a weighted sum of squares of the signal amplitudes within a frame, which is mathematically represented as:
in the formula (7), x (m) represents a sound signal, and w (.) represents a window function.
Short-time average amplitude: the short-time average energy needs to calculate the sum of squares of signal sampling values, the square calculation is too sensitive to signal flatness, and if high level is met during specific calculation, the short-time average energy is easy to sharply increase, even overflow is generated. To overcome this drawback, the short-term average amplitude is replaced by the sum of the squares with the sum of the absolute values, and the change in sound intensity can also be measured. The mathematical expression is as follows:
in equation (8), x (m) represents the sound signal, and w (·) represents the window function.
4, energy entropy: the energy entropy can describe the time variation degree of the audio signal and can be used as an audio characteristic. This feature has a higher value if there is a sudden change in the energy envelope of the signal.
5, spectrum centroid: the spectral centroid represents which frequency band the sound energy is concentrated in. The higher the value of the spectral centroid, the more concentrated the energy representing the signal is in the higher frequencies. The auditory sensation of the sound with more low-frequency components is lower and more depressed, the mass center of the frequency spectrum is relatively lower, the auditory sensation of the sound with more high-frequency components is higher and more cheerful, and the mass center of the frequency spectrum is relatively higher.
6, spectral entropy: the spectral entropy allows to detect the complexity contained in the audio signal, the greater the complexity, the greater the spectral entropy. The mathematical expression is as follows:
in the formula (9), f (w) is a function of the spectral density in a frame signal.
7, spectral flux: spectral flux can quantify the change in spectrum over time, with a spectrally stable or nearly constant signal having a low spectral flux, such as white gaussian noise, and with an abrupt spectral change having a high spectral flux.
8, mel-frequency cepstrum coefficient: the mel frequency cepstrum coefficient is an important characteristic in speech processing, and is characterized in that the logarithmic power of a signal is subjected to linear cosine change on a nonlinear mel frequency scale, and the mel frequency cepstrum coefficient is also called MFCC (Mel frequency cepstrum coefficient), and the MFCC can reflect the nonlinear characteristic of the auditory frequency of human ears. The mathematical expression is as follows:
in the formula (10), f is a linear frequency and has a unit of HZ.
9, chromatogram: the chromatogram map divides the whole frequency spectrum into 12 frequency bands, corresponding to music octave syllables, which can be divided according to different chromaticities.
The results obtained by training with different combinations of features are shown in table 1:
table 1: lifting capability meter for model by different characteristic combination
In table 1: MSG stands for logarithmic Mel-map, MFCC stands for Mel cepstrum coefficient, CG stands for chromatogram, and ZCR stands for zero-crossing rate.
Therefore, the audio features finally selected in this embodiment are: and the characteristic combination of a logarithmic Mel spectrum, a Mel cepstrum coefficient, a chromatogram and a zero crossing rate.
And 5, designing a residual error neural network based on an attention mechanism, and training the residual error neural network by adopting a training set.
The performance of the convolutional neural network is strongly related to the depth of the network, the deeper network structure can improve the recognition effect, however, in practice, after the depth of the convolutional network reaches a certain depth, the performance of the model can not improve any more, and even the performance becomes worse, and the phenomenon is called gradient vanishing. Residual blocks are added to the convolutional network and the residual units can be connected layer-by-layer, so that in the deep convolutional network, the output of some layers can be directly transferred across the middle layer to the following layers.
As shown in fig. 2, the residual block passes the input to the output through the function r (x) and adds the input to the output f (x), and the learning objective of the network changes accordingly, which is no longer the overall output h (x), and becomes the difference between the output and the input.
The present embodiment designs a residual neural network based on attention mechanism as shown in fig. 3, which includes: the device comprises a Block1, a first Block2, a second Block2, a first Block4, a second Block4, a third Block4 and a Block5 which are sequentially connected in series.
The output of Block1 is connected to the input of a second Block2 through a hopping connection unit, and the input of the second Block2 is connected to the input of the first Block4 through a hopping connection unit; the first Block4, the second Block4 and the third Block4 all comprise a third Block2, a fourth Block2 and a Block 3. The third Block2 is connected in series with a fourth Block2, the input of the third Block2 being connected to the input of the fourth Block2 via a hopping connection unit comprising Block 3.
As shown in FIG. 4, Block1 includes a Batch Normalization layer and a two-dimensional convolution layer (Conv2D) to achieve 2-fold down-sampling. The two-dimensional convolutional layer convolution kernel is 3 multiplied by 3 in size, 24 in number and (1,2) in steps, and 2 times of downsampling is realized.
As shown in FIG. 5, Block2 includes two-dimensional convolutional layers, the second of which introduces a hybrid-attentive mechanism (interplated-attn). The formula of the mixed attention mechanism is as follows:
S=σ((F up (F res (F res (F dn (U))+F up (F res (F res (F dn (F res (F dn (U))))))))*
W 1 +b 1 )*W 2 +b 2 (11)
in the formula (11), F dn Denotes maximum pooling, F up Representing a two-line interpolation, S being the resulting attention mechanism weight, F res Representing a residual error mechanism calculation process, wherein sigma represents a sigmoid function; w is a 1 、w 2 Is the convolution kernel weight; b 1 、b 2 Is the convolution kernel bias.
A mixed attention mechanism is introduced, the number of channels is unchanged from input to output of each layer of the network, the module reduces the dimensionality of the space dimensionality by means of down-sampling, the receptive field of convolution extraction features is increased, the region where high-frequency features in an input image are located can be effectively inferred, then up-sampling is conducted by means of interpolation, and therefore the dimensionality is enlarged, and meanwhile the feature region is better located.
As shown in fig. 6, Block3 includes two parallel two-dimensional pooling layers and a configure layer for combining and outputting the outputs of the two-dimensional pooling layers in the last tensor dimension.
In block3 included in third block4, the pooling region of maxporoling 2d is used to implement 2-fold downsampling, padding is used to make the input image area equal to the input image area, and the concatenate combines and outputs the outputs of the two parallel two-dimensional pooling layers in the last tensor dimension.
As shown in FIG. 7, Block5 includes two-dimensional convolutional layers and a sigmoid layer, the front of the first two-dimensional convolutional layer of which also introduces a hybrid attention mechanism.
In this embodiment, the two-dimensional convolutional layer in Block1 is to realize 2-fold down-sampling.
The number of convolution cores and the number of steps of the two-dimensional convolution layers in the first block2 and the second block2 are the same.
However, the number of two-dimensional convolutional layers contained in the two blocks 2 in the first block4 increases to 48; the number of convolution kernels of two blocks 2 in the second block4 is increased to 96; the number of convolution kernels of two blocks 2 in the third block4 is increased to 192, and each time the number of convolution kernels in the blocks 2 in the three blocks 4 is the first block2, the down-sampling is performed by a factor of 2.
The number of convolution kernels of the first two-dimensional convolution layer of Block5 increases to 768, and the size of convolution kernels of the second two-dimensional convolution layer is 1, and the number is 1. And finally outputting a prediction result through GlobavalagePooling 2D and sigmoid of 1 dimension, and judging whether the cry is crying or not.
The recognition capability of the final model obtained in this example is shown in table 2:
table 2: comparison table of residual error network of the invention and without using attention mechanism
Model (model) | Model score |
Residual error network without attention mechanism | 96.5% |
Residual error network with attention mechanism | 98.6% |
As can be seen from table 2, after the attention mechanism is added to the residual error network, the residual error network performs better, and the residual error network itself also solves the problem of gradient disappearance possibly caused by too deep convolutional neural network.
The invention discloses a crying detection system based on attention residual error learning, which comprises:
the first data acquisition module: the device is used for collecting sound data to be detected;
the second data acquisition module: the device is used for collecting sample data;
a data preprocessing module: the method is used for preprocessing sample data:
a feature extraction module: for extracting audio features in sample data:
a crying model module: the system is used for training the audio features in the training sample data by using a residual error neural network algorithm based on an attention mechanism to obtain a crying model;
crying identification module: and the crying model is used for inputting the data of the sound to be detected into the crying model for calculation, and determining whether the data of the sound to be detected is crying or not.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (7)
1. The crying detection method based on attention residual learning is characterized by comprising the following steps of: the method comprises the following steps:
s1, collecting crying data;
s2, dividing the crying data into a training set and a verification set;
s3, training the constructed residual error neural network based on the attention mechanism by adopting a training set to obtain a trained residual error neural network; evaluating the training result by adopting a verification set;
the residual neural network comprises a Block1, a first Block2, a second Block2, a first Block4, a second Block4, a third Block4 and a Block5 which are sequentially connected in series; the output of the Block1 is connected with the input of the second Block2 through a skip connection unit, and the input of the second Block2 is connected with the input of the first Block4 through a skip connection unit;
a mixing attention mechanism is introduced into the first Block2, the second Block2, the first Block4, the second Block4, the third Block4 and the Block 5;
the Block1 includes a two-dimensional convolutional layer for implementing 2-fold down-sampling;
the first Block4, the second Block4 and the third Block4 all comprise a third Block2 and a fourth Block 2; the third Block2 is connected in series with a fourth Block 2;
the first Block2, the second Block2, the third Block2 and the fourth Block2 all comprise two-dimensional convolutional layers, and a mixing attention mechanism is introduced behind the second two-dimensional convolutional layer;
the Block5 includes two-dimensional convolutional layers and a sigmoid layer, a mixed attention mechanism is introduced in front of the first two-dimensional convolutional layer;
the input of the third Block2 is connected to the input of a fourth Block2 through a hopping connection unit comprising Block 3;
the Block3 includes two parallel two-dimensional pooling layers and a configure layer for combining and outputting the outputs of the two-dimensional pooling layers in the last tensor dimension.
2. The method of claim 1, wherein the method comprises: the pooling region of the two-dimensional pooling layer in Block3 contained in the third Block4 is used to implement 2-fold down-sampling, padding is used to make the input image area equal to the input image area, and the contitate layer is used to combine and output the outputs of the two parallel two-dimensional pooling layers in the last tensor dimension.
3. The method of claim 1 or 2, wherein the method comprises: the formula of the mixed attention mechanism is as follows:
4. The method of claim 1 or 2, wherein the method comprises: in S1, the sample is amplified according to the signal-to-noise ratio.
5. The method of claim 1, wherein the method comprises: the preprocessing of the collected crying data before S2 includes two ways:
the first method is as follows: pre-emphasis is carried out on the voice signals;
the second method comprises the following steps: speech signals are framed and windowed.
6. The method of claim 1 or 5, wherein the method comprises: in the step S3, feature extraction is performed on the data in the training set, and the extracted audio features are used for training a residual error neural network;
the audio features comprise at least one of short-time zero crossing rate, short-time average energy, short-time average amplitude, energy entropy, frequency spectrum centroid, spectrum entropy, frequency spectrum flux, mel-frequency cepstrum coefficient and chromatogram.
7. Cry detecting system based on attention residual learning, its characterized in that: the method comprises the following steps:
the first data acquisition module: the device is used for collecting sound data to be detected;
the second data acquisition module: the device is used for collecting sample data;
a data preprocessing module: the method is used for preprocessing sample data:
a feature extraction module: for extracting audio features in sample data:
a crying model module: the system is used for training the audio features in the training sample data by using a residual error neural network algorithm based on an attention mechanism to obtain a crying model;
crying identification module: the crying model is used for inputting the data of the sound to be detected into the crying model for calculation, and determining whether the data of the sound to be detected is crying or not;
the residual neural network comprises a Block1, a first Block2, a second Block2, a first Block4, a second Block4, a third Block4 and a Block5 which are sequentially connected in series; the output of the Block1 is connected with the input of the second Block2 through a skip connection unit, and the input of the second Block2 is connected with the input of the first Block4 through a skip connection unit;
a mixing attention mechanism is introduced into the first Block2, the second Block2, the first Block4, the second Block4, the third Block4 and the Block 5;
the Block1 includes a two-dimensional convolution layer to implement 2-fold down-sampling;
the first Block4, the second Block4 and the third Block4 all comprise a third Block2 and a fourth Block 2; the third Block2 is connected in series with a fourth Block 2;
the first Block2, the second Block2, the third Block2 and the fourth Block2 all comprise two-dimensional convolutional layers, and a mixing attention mechanism is introduced behind the second two-dimensional convolutional layer;
the Block5 includes two-dimensional convolutional layers and a sigmoid layer, a mixed attention mechanism is introduced in front of the first two-dimensional convolutional layer;
the input of the third Block2 is connected to the input of a fourth Block2 through a hopping connection unit comprising Block 3;
the Block3 includes two parallel two-dimensional pooling layers and a configure layer for combining and outputting the outputs of the two-dimensional pooling layers in the last tensor dimension.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110224859.0A CN112863550B (en) | 2021-03-01 | 2021-03-01 | Crying detection method and system based on attention residual learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110224859.0A CN112863550B (en) | 2021-03-01 | 2021-03-01 | Crying detection method and system based on attention residual learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112863550A CN112863550A (en) | 2021-05-28 |
CN112863550B true CN112863550B (en) | 2022-08-16 |
Family
ID=75990713
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110224859.0A Active CN112863550B (en) | 2021-03-01 | 2021-03-01 | Crying detection method and system based on attention residual learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112863550B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112992121B (en) * | 2021-03-01 | 2022-07-12 | 德鲁动力科技(成都)有限公司 | Voice enhancement method based on attention residual error learning |
CN113851115A (en) * | 2021-09-07 | 2021-12-28 | 中国海洋大学 | Complex sound identification method based on one-dimensional convolutional neural network |
CN114333898A (en) * | 2021-12-10 | 2022-04-12 | 科大讯飞股份有限公司 | Sound event detection method, device and system and readable storage medium |
CN116386661B (en) * | 2023-06-05 | 2023-08-08 | 成都启英泰伦科技有限公司 | Crying detection model training method based on dual attention and data enhancement |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111477221A (en) * | 2020-05-28 | 2020-07-31 | 中国科学技术大学 | Speech recognition system using bidirectional time sequence convolution and self-attention mechanism network |
CN112992121A (en) * | 2021-03-01 | 2021-06-18 | 德鲁动力科技(成都)有限公司 | Voice enhancement method based on attention residual error learning |
CN113012714A (en) * | 2021-02-22 | 2021-06-22 | 哈尔滨工程大学 | Acoustic event detection method based on pixel attention mechanism capsule network model |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107818779A (en) * | 2017-09-15 | 2018-03-20 | 北京理工大学 | A kind of infant's crying sound detection method, apparatus, equipment and medium |
CN108511002B (en) * | 2018-01-23 | 2020-12-01 | 太仓鸿羽智能科技有限公司 | Method for recognizing sound signal of dangerous event, terminal and computer readable storage medium |
CN110110729B (en) * | 2019-03-20 | 2022-08-30 | 中国地质大学(武汉) | Building example mask extraction method for realizing remote sensing image based on U-shaped CNN model |
WO2020222985A1 (en) * | 2019-04-30 | 2020-11-05 | The Trustees Of Dartmouth College | System and method for attention-based classification of high-resolution microscopy images |
CN110600059B (en) * | 2019-09-05 | 2022-03-15 | Oppo广东移动通信有限公司 | Acoustic event detection method and device, electronic equipment and storage medium |
CN110675405B (en) * | 2019-09-12 | 2022-06-03 | 电子科技大学 | Attention mechanism-based one-shot image segmentation method |
KR102276964B1 (en) * | 2019-10-14 | 2021-07-14 | 고려대학교 산학협력단 | Apparatus and Method for Classifying Animal Species Noise Robust |
CN111859954A (en) * | 2020-07-01 | 2020-10-30 | 腾讯科技(深圳)有限公司 | Target object identification method, device, equipment and computer readable storage medium |
CN112382311B (en) * | 2020-11-16 | 2022-08-19 | 谭昊玥 | Infant crying intention identification method and device based on hybrid neural network |
CN112382302A (en) * | 2020-12-02 | 2021-02-19 | 漳州立达信光电子科技有限公司 | Baby cry identification method and terminal equipment |
-
2021
- 2021-03-01 CN CN202110224859.0A patent/CN112863550B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111477221A (en) * | 2020-05-28 | 2020-07-31 | 中国科学技术大学 | Speech recognition system using bidirectional time sequence convolution and self-attention mechanism network |
CN113012714A (en) * | 2021-02-22 | 2021-06-22 | 哈尔滨工程大学 | Acoustic event detection method based on pixel attention mechanism capsule network model |
CN112992121A (en) * | 2021-03-01 | 2021-06-18 | 德鲁动力科技(成都)有限公司 | Voice enhancement method based on attention residual error learning |
Also Published As
Publication number | Publication date |
---|---|
CN112863550A (en) | 2021-05-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112863550B (en) | Crying detection method and system based on attention residual learning | |
CN112992121B (en) | Voice enhancement method based on attention residual error learning | |
Dewi et al. | The study of baby crying analysis using MFCC and LFCC in different classification methods | |
CN103489454B (en) | Based on the sound end detecting method of wave configuration feature cluster | |
Sroka et al. | Human and machine consonant recognition | |
CN111292762A (en) | Single-channel voice separation method based on deep learning | |
JPH08509556A (en) | Method and system for detecting and generating transients in acoustic signals | |
CN103054586B (en) | Chinese speech automatic audiometric method based on Chinese speech audiometric dynamic word list | |
JP2015096921A (en) | Acoustic signal processing device and method | |
CN112185342A (en) | Voice conversion and model training method, device and system and storage medium | |
Roy et al. | DeepLPC-MHANet: Multi-head self-attention for augmented Kalman filter-based speech enhancement | |
Roy et al. | DeepLPC: A deep learning approach to augmented Kalman filter-based single-channel speech enhancement | |
Wu et al. | Research on acoustic feature extraction of crying for early screening of children with autism | |
CN103258537A (en) | Method utilizing characteristic combination to identify speech emotions and device thereof | |
Eklund | Data augmentation techniques for robust audio analysis | |
CN114255780B (en) | Noise robust blind reverberation time estimation method based on deep neural network | |
CN115565550A (en) | Baby crying emotion identification method based on characteristic diagram light convolution transformation | |
Adam et al. | Wavelet cesptral coefficients for isolated speech recognition | |
CN116386589A (en) | Deep learning voice reconstruction method based on smart phone acceleration sensor | |
CN114302301B (en) | Frequency response correction method and related product | |
Cai et al. | The best input feature when using convolutional neural network for cough recognition | |
Zhang et al. | URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement | |
CN114446316A (en) | Audio separation method, and training method, device and equipment of audio separation model | |
CN107039046B (en) | Voice sound effect mode detection method based on feature fusion | |
Gupta et al. | Morse wavelet transform-based features for voice liveness detection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |