CN112863550B - Crying detection method and system based on attention residual learning - Google Patents

Crying detection method and system based on attention residual learning Download PDF

Info

Publication number
CN112863550B
CN112863550B CN202110224859.0A CN202110224859A CN112863550B CN 112863550 B CN112863550 B CN 112863550B CN 202110224859 A CN202110224859 A CN 202110224859A CN 112863550 B CN112863550 B CN 112863550B
Authority
CN
China
Prior art keywords
block2
block4
crying
data
attention mechanism
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110224859.0A
Other languages
Chinese (zh)
Other versions
CN112863550A (en
Inventor
李学生
李晨
朱麒宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Delu Power Technology Chengdu Co ltd
Original Assignee
Delu Power Technology Chengdu Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Delu Power Technology Chengdu Co ltd filed Critical Delu Power Technology Chengdu Co ltd
Priority to CN202110224859.0A priority Critical patent/CN112863550B/en
Publication of CN112863550A publication Critical patent/CN112863550A/en
Application granted granted Critical
Publication of CN112863550B publication Critical patent/CN112863550B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a crying detection method and a crying detection system based on attention residual error learning, which comprises S1, collecting crying data; s2, dividing the crying data into a training set and a verification set; s3, training the constructed residual error neural network based on the attention mechanism by adopting a training set to obtain a trained residual error neural network based on the attention mechanism; and the training results are evaluated by adopting a validation set. The method for introducing the residual network solves the problem that the gradient of the CNN model with large layer number disappears, and introduces the attention mechanism to enable the residual model to be capable of adding the characteristics of being capable of expressing crying, so that the accuracy rate of crying identification in a real scene can be improved, and the generalization capability in a practical scene is improved.

Description

Crying detection method and system based on attention residual learning
Technical Field
The invention relates to the technical field of voice recognition, in particular to a crying detection method and system based on attention residual learning.
Background
Because the existing four-foot voice recognition lacks abnormal sound detection, particularly in family companion dogs, crying is a main way for infants to express themselves, and the automatic detection of the crying of the infants plays an important role in the field of family companionship and can effectively reduce the burden of parents of nursing. There have been many studies on feature and model selection and mechanisms for baby cry, and conventional machine learning methods such as SVM and classification of spectrogram using CNN model are commonly used.
The traditional machine learning method such as SVM generally depends on the selection of characteristics, the quality of the characteristic selection determines the quality of a recognition result, the characteristics are difficult to be comprehensively reflected by the selection of the characteristics, a convolutional neural network can learn the characteristics from a spectrogram, but the training is difficult due to the deepening of the layer number, the result effect of a shallow CNN model used for detecting the infant crying is poor, and the crying recognition is mainly challenged by the uncertainty and instability of noise in the actual environment.
In an actual environment containing unstable noise, if only a single or too few features are used, the recognition rate of cry recognition is too low, but in a model adopting complex features, a deep CNN network model has a potential gradient disappearance problem.
Disclosure of Invention
The invention provides a crying detection method and system based on attention residual error learning to solve the technical problems.
The invention is realized by the following technical scheme:
the crying detection method based on attention residual learning comprises the following steps:
s1, collecting crying data;
s2, dividing the crying data into a training set and a verification set;
s3, training the constructed residual error neural network based on the attention mechanism by adopting a training set to obtain a trained residual error neural network; and the training results are evaluated by adopting a validation set.
Further, the residual neural network comprises a Block1, a first Block2, a second Block2, a first Block4, a second Block4, a third Block4 and a Block5 which are sequentially connected in series; the output of the Block1 is connected with the input of the second Block2 through a skip connection unit, and the input of the second Block2 is connected with the input of the first Block4 through a skip connection unit;
the first Block2, the second Block2, the first Block4, the second Block4, the third Block4 and the Block5 all introduce a hybrid attention mechanism.
Further, the Block1 includes a two-dimensional convolutional layer for achieving 2-fold down-sampling;
the first Block4, the second Block4 and the third Block4 all comprise a third Block2 and a fourth Block 2; the third Block2 is connected in series with a fourth Block 2;
the first Block2, the second Block2, the third Block2 and the fourth Block2 all comprise two-dimensional convolutional layers, and a mixing attention mechanism is introduced behind the second two-dimensional convolutional layer;
the Block5 includes two-dimensional convolutional layers and a sigmoid layer, the first of which introduces a hybrid attention mechanism in front of the convolutional layers.
Further, the input of the third Block2 is connected with the input of the fourth Block2 through a skip connection unit comprising a Block 3;
the Block3 includes two parallel two-dimensional pooling layers and a configure layer for combining and outputting the outputs of the two-dimensional pooling layers in the last tensor dimension.
Further, the pooling area of the two-dimensional pooling layer in Block3 included in the third Block4 is used to implement 2-fold downsampling, padding is used to make the input image area equal to the input image area, and the contitate layer is used to combine and output the outputs of the two parallel two-dimensional pooling layers in the last tensor dimension.
Further, it is characterized in that: the formula of the mixed attention mechanism is as follows:
S=σ((F up (F res (F res (F dn (U))+F up (F res (F res (F dn (F res (F dn (U))))))))*
W 1 +b 1 )*W 2 +b 2 (1)
in the formula (1), F dn Denotes maximum pooling, F up Representing a two-line interpolation, S being the resulting attention mechanism weight, F res Representing a residual error mechanism calculation process, wherein sigma represents a sigmoid function; w is a 1 、w 2 Is the convolution kernel weight; b is a mixture of 1 、b 2 Is the convolution kernel bias.
Furthermore, the number of convolution kernels of the two-dimensional convolution layer of Block1 is 24;
the convolution kernel size, the number and the step number of the two-dimensional convolution layers in the first Block2 and the second Block2 are the same;
the number of convolution kernels of the two-dimensional convolution layers contained in the third Block2 and the fourth Block2 in the first Block4 is 48;
the number of convolution kernels of the two-dimensional convolution layers contained in the third Block2 and the fourth Block2 in the second Block4 is 96;
the number of convolution kernels of the two-dimensional convolution layers contained in the third Block2 and the fourth Block2 in the third Block4 is 192;
the third Block2 is used to implement 2 times downsampling;
the number of convolution kernels of the first two-dimensional convolution layer of Block5 is increased to 768; the convolution kernel size of the second two-dimensional convolution layer is 1 and the number is 1.
Further, in S1, the sample is amplified according to the signal-to-noise ratio.
Further, the step S2 is preceded by preprocessing the collected crying data, where the preprocessing includes two ways:
the first method is as follows: pre-emphasis is carried out on the voice signals;
the second method comprises the following steps: the speech signal is framed and windowed.
Further, in S3, feature extraction is performed on data in a training set, and the extracted audio features are used for training a residual neural network;
the audio features comprise at least one of short-time zero crossing rate, short-time average energy, short-time average amplitude, energy entropy, frequency spectrum centroid, spectrum entropy, frequency spectrum flux, mel-frequency cepstrum coefficient and chromatogram.
Cry detection system based on attention residual learning, including:
the first data acquisition module: the device is used for collecting sound data to be detected;
the second data acquisition module: the device is used for collecting sample data;
a data preprocessing module: the method is used for preprocessing sample data:
a feature extraction module: for extracting audio features in sample data:
a crying model module: the system is used for training the audio features in the training sample data by using a residual error neural network algorithm based on an attention mechanism to obtain a crying model;
crying identification module: and the crying model is used for inputting the data of the sound to be detected into the crying model for calculation, and determining whether the data of the sound to be detected is crying or not.
Compared with the prior art, the invention has the following beneficial effects:
the method of introducing the residual network solves the problem that the gradient of the CNN model with extremely large layer number disappears, and introduces an attention mechanism to enable the residual model to be more added with the characteristic of being capable of expressing crying; the method can improve the accuracy of crying recognition in a real scene and improve the generalization capability in a real scene.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention.
FIG. 1 is a flow chart of model training;
FIG. 2 is a schematic diagram of a residual block;
FIG. 3 is a block diagram of a residual neural network based on an attention mechanism;
FIG. 4 is a Block diagram of Block 1;
FIG. 5 is a Block diagram of Block 2;
FIG. 6 is a Block diagram of Block 3;
FIG. 7 is a Block diagram of Block 5.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.
As shown in fig. 1, the crying detection method based on attention residual learning disclosed by the invention comprises the following steps:
s1, collecting crying data;
s2, dividing the crying data into a training set and a verification set;
s3, training the constructed residual error neural network based on the attention mechanism by adopting a training set to obtain a trained residual error neural network based on the attention mechanism; and the training results are evaluated by adopting a validation set.
And inputting the sound data to be detected into a trained residual neural network based on an attention mechanism, and identifying whether the sound data to be detected is crying.
Based on the above method, the present invention discloses an embodiment.
Example 1
As shown in fig. 1, the present embodiment includes the following steps:
step 1, collecting crying data samples.
In this embodiment, the data set sources mainly include three:
450 clear crying data collected from the "donata cry" project on the GitHub; collecting 40 Crying data from Crying baby category in ESC-50 dataset; and manually recording 400 pieces of crying data from the network.
All data were unmuted, each 5 seconds long, and negative samples were taken from other categories of the ESC-50 data set. Therefore, there were 890 pieces of data for the positive sample (baby crying), and 900 pieces of data were collected for the negative sample. So that the positive and negative samples are relatively equalized.
Because the number of data samples is small, in order to better conform to the practical application environment, the embodiment also provides a method for performing data amplification on the collected sample data, which specifically comprises the following steps:
common indoor environmental noise in a family room, such as air conditioning sound and the like, is selected from UrbanSound8K, tests show that different signal-to-noise ratios can cause different performances of models, the accuracy of the models is obviously deteriorated when the noise intensity is larger (the signal-to-noise ratio is lower), and finally, the signal-to-noise ratio is selected to be 35dB for sample amplification.
Signal-to-noise ratio: refers to the ratio of the power of the useful speech signal and the noise signal doped with the signal in a speech signal. The signal-to-noise ratio can be calculated using equation (1):
Figure BDA0002956856490000051
in the formula (1), s (n) is a speech signal, and r (n) is a noise signal.
In this embodiment, 50% of sample data is finally selected for sample amplification, so that 1335 positive samples and 1350 negative samples are obtained.
According to the method, noise is added to the data according to different signal-to-noise ratios, sample data amplification is achieved, the accuracy rate of crying recognition in a real scene can be improved, and the generalization capability in a practical scene is improved.
And 2, preprocessing data.
In this embodiment, two methods are mainly adopted for data preprocessing: pre-emphasis, framing, and windowing.
2.1 pre-emphasis.
In the process of generating sound in the oral cavity, the energy of the sound is concentrated at low frequency, high-frequency signals are filtered out in the processing process, the attenuation of the high frequency is larger in the voice generating process, the pre-emphasis operation is to make up the attenuation of the high-frequency part, the specific operation is to send the audio signals into a first-order FIR high-pass filtering improved dynamic domain, so that the voice signal frequency spectrum after pre-emphasis is flatter, and the expression of the pre-emphasis is as follows:
H(z)=1-αz -1 (2)
in the formula (2), α is a constant representing the pre-emphasis coefficient, determines the pre-emphasis intensity, and has a value range of 0.9< α < 1.
2.2 framing and windowing.
In an audio signal, the frequency changes with time, and features cannot be directly extracted from the whole audio segment, and generally, it is considered that the voice signal is divided into 10 ms-30 ms voice segments, and stationarity exists in a short time.
Framing is typically achieved by windowing, the windowing formula:
S w (n)=S(n)W(n) (3)
in the formula (3), s (n) represents an original signal, and w (n) represents a window function.
Commonly used window functions are rectangular windows, Hamming windows, Hanning windows.
Wherein, the rectangular window formula:
Figure BDA0002956856490000061
hamming window formula:
Figure BDA0002956856490000062
hanning window formula:
Figure BDA0002956856490000063
this embodiment takes a Hamming window as the window function. And the values chosen in window length and frame shift are: the window length is 2048 points, the frame shift is 1024 points, and the best effect is achieved in the subsequent feature extraction.
And 3, constructing a training set and a verification set.
And 4, performing feature extraction and feature combination on the data in the training connection.
Although the neural network has the capability of extracting information contained in data with confidence, it is very difficult to directly process an original audio signal, so that feature engineering is very necessary, good feature extraction can greatly improve the recognition performance of the neural network, the training accuracy and efficiency are improved, the feature extraction of voice is very mature, and the common voice features are as follows:
1, short-time zero-crossing rate: the number of zero crossings of the signal in unit time is defined as a zero crossing rate, and the short-time zero crossing rate can be visually corresponding to the number of times of the signal waveform passing through a time axis.
2, short-time average energy: the short-time average energy can be used for assisting in distinguishing unvoiced sound and voiced sound, and can be used for dividing voiced sound and unvoiced sound under the conditions that the signal-to-noise ratio is high, the signal is pure, and noise components are few, so that silent segments are cut.
The mathematical definition of the short-time average energy is a weighted sum of squares of the signal amplitudes within a frame, which is mathematically represented as:
Figure BDA0002956856490000064
in the formula (7), x (m) represents a sound signal, and w (.) represents a window function.
Short-time average amplitude: the short-time average energy needs to calculate the sum of squares of signal sampling values, the square calculation is too sensitive to signal flatness, and if high level is met during specific calculation, the short-time average energy is easy to sharply increase, even overflow is generated. To overcome this drawback, the short-term average amplitude is replaced by the sum of the squares with the sum of the absolute values, and the change in sound intensity can also be measured. The mathematical expression is as follows:
Figure BDA0002956856490000071
in equation (8), x (m) represents the sound signal, and w (·) represents the window function.
4, energy entropy: the energy entropy can describe the time variation degree of the audio signal and can be used as an audio characteristic. This feature has a higher value if there is a sudden change in the energy envelope of the signal.
5, spectrum centroid: the spectral centroid represents which frequency band the sound energy is concentrated in. The higher the value of the spectral centroid, the more concentrated the energy representing the signal is in the higher frequencies. The auditory sensation of the sound with more low-frequency components is lower and more depressed, the mass center of the frequency spectrum is relatively lower, the auditory sensation of the sound with more high-frequency components is higher and more cheerful, and the mass center of the frequency spectrum is relatively higher.
6, spectral entropy: the spectral entropy allows to detect the complexity contained in the audio signal, the greater the complexity, the greater the spectral entropy. The mathematical expression is as follows:
Figure BDA0002956856490000072
in the formula (9), f (w) is a function of the spectral density in a frame signal.
7, spectral flux: spectral flux can quantify the change in spectrum over time, with a spectrally stable or nearly constant signal having a low spectral flux, such as white gaussian noise, and with an abrupt spectral change having a high spectral flux.
8, mel-frequency cepstrum coefficient: the mel frequency cepstrum coefficient is an important characteristic in speech processing, and is characterized in that the logarithmic power of a signal is subjected to linear cosine change on a nonlinear mel frequency scale, and the mel frequency cepstrum coefficient is also called MFCC (Mel frequency cepstrum coefficient), and the MFCC can reflect the nonlinear characteristic of the auditory frequency of human ears. The mathematical expression is as follows:
Figure BDA0002956856490000073
Figure BDA0002956856490000074
in the formula (10), f is a linear frequency and has a unit of HZ.
9, chromatogram: the chromatogram map divides the whole frequency spectrum into 12 frequency bands, corresponding to music octave syllables, which can be divided according to different chromaticities.
The results obtained by training with different combinations of features are shown in table 1:
table 1: lifting capability meter for model by different characteristic combination
Figure BDA0002956856490000081
In table 1: MSG stands for logarithmic Mel-map, MFCC stands for Mel cepstrum coefficient, CG stands for chromatogram, and ZCR stands for zero-crossing rate.
Therefore, the audio features finally selected in this embodiment are: and the characteristic combination of a logarithmic Mel spectrum, a Mel cepstrum coefficient, a chromatogram and a zero crossing rate.
And 5, designing a residual error neural network based on an attention mechanism, and training the residual error neural network by adopting a training set.
The performance of the convolutional neural network is strongly related to the depth of the network, the deeper network structure can improve the recognition effect, however, in practice, after the depth of the convolutional network reaches a certain depth, the performance of the model can not improve any more, and even the performance becomes worse, and the phenomenon is called gradient vanishing. Residual blocks are added to the convolutional network and the residual units can be connected layer-by-layer, so that in the deep convolutional network, the output of some layers can be directly transferred across the middle layer to the following layers.
As shown in fig. 2, the residual block passes the input to the output through the function r (x) and adds the input to the output f (x), and the learning objective of the network changes accordingly, which is no longer the overall output h (x), and becomes the difference between the output and the input.
The present embodiment designs a residual neural network based on attention mechanism as shown in fig. 3, which includes: the device comprises a Block1, a first Block2, a second Block2, a first Block4, a second Block4, a third Block4 and a Block5 which are sequentially connected in series.
The output of Block1 is connected to the input of a second Block2 through a hopping connection unit, and the input of the second Block2 is connected to the input of the first Block4 through a hopping connection unit; the first Block4, the second Block4 and the third Block4 all comprise a third Block2, a fourth Block2 and a Block 3. The third Block2 is connected in series with a fourth Block2, the input of the third Block2 being connected to the input of the fourth Block2 via a hopping connection unit comprising Block 3.
As shown in FIG. 4, Block1 includes a Batch Normalization layer and a two-dimensional convolution layer (Conv2D) to achieve 2-fold down-sampling. The two-dimensional convolutional layer convolution kernel is 3 multiplied by 3 in size, 24 in number and (1,2) in steps, and 2 times of downsampling is realized.
As shown in FIG. 5, Block2 includes two-dimensional convolutional layers, the second of which introduces a hybrid-attentive mechanism (interplated-attn). The formula of the mixed attention mechanism is as follows:
S=σ((F up (F res (F res (F dn (U))+F up (F res (F res (F dn (F res (F dn (U))))))))*
W 1 +b 1 )*W 2 +b 2 (11)
in the formula (11), F dn Denotes maximum pooling, F up Representing a two-line interpolation, S being the resulting attention mechanism weight, F res Representing a residual error mechanism calculation process, wherein sigma represents a sigmoid function; w is a 1 、w 2 Is the convolution kernel weight; b 1 、b 2 Is the convolution kernel bias.
A mixed attention mechanism is introduced, the number of channels is unchanged from input to output of each layer of the network, the module reduces the dimensionality of the space dimensionality by means of down-sampling, the receptive field of convolution extraction features is increased, the region where high-frequency features in an input image are located can be effectively inferred, then up-sampling is conducted by means of interpolation, and therefore the dimensionality is enlarged, and meanwhile the feature region is better located.
As shown in fig. 6, Block3 includes two parallel two-dimensional pooling layers and a configure layer for combining and outputting the outputs of the two-dimensional pooling layers in the last tensor dimension.
In block3 included in third block4, the pooling region of maxporoling 2d is used to implement 2-fold downsampling, padding is used to make the input image area equal to the input image area, and the concatenate combines and outputs the outputs of the two parallel two-dimensional pooling layers in the last tensor dimension.
As shown in FIG. 7, Block5 includes two-dimensional convolutional layers and a sigmoid layer, the front of the first two-dimensional convolutional layer of which also introduces a hybrid attention mechanism.
In this embodiment, the two-dimensional convolutional layer in Block1 is to realize 2-fold down-sampling.
The number of convolution cores and the number of steps of the two-dimensional convolution layers in the first block2 and the second block2 are the same.
However, the number of two-dimensional convolutional layers contained in the two blocks 2 in the first block4 increases to 48; the number of convolution kernels of two blocks 2 in the second block4 is increased to 96; the number of convolution kernels of two blocks 2 in the third block4 is increased to 192, and each time the number of convolution kernels in the blocks 2 in the three blocks 4 is the first block2, the down-sampling is performed by a factor of 2.
The number of convolution kernels of the first two-dimensional convolution layer of Block5 increases to 768, and the size of convolution kernels of the second two-dimensional convolution layer is 1, and the number is 1. And finally outputting a prediction result through GlobavalagePooling 2D and sigmoid of 1 dimension, and judging whether the cry is crying or not.
The recognition capability of the final model obtained in this example is shown in table 2:
table 2: comparison table of residual error network of the invention and without using attention mechanism
Model (model) Model score
Residual error network without attention mechanism 96.5%
Residual error network with attention mechanism 98.6%
As can be seen from table 2, after the attention mechanism is added to the residual error network, the residual error network performs better, and the residual error network itself also solves the problem of gradient disappearance possibly caused by too deep convolutional neural network.
The invention discloses a crying detection system based on attention residual error learning, which comprises:
the first data acquisition module: the device is used for collecting sound data to be detected;
the second data acquisition module: the device is used for collecting sample data;
a data preprocessing module: the method is used for preprocessing sample data:
a feature extraction module: for extracting audio features in sample data:
a crying model module: the system is used for training the audio features in the training sample data by using a residual error neural network algorithm based on an attention mechanism to obtain a crying model;
crying identification module: and the crying model is used for inputting the data of the sound to be detected into the crying model for calculation, and determining whether the data of the sound to be detected is crying or not.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (7)

1. The crying detection method based on attention residual learning is characterized by comprising the following steps of: the method comprises the following steps:
s1, collecting crying data;
s2, dividing the crying data into a training set and a verification set;
s3, training the constructed residual error neural network based on the attention mechanism by adopting a training set to obtain a trained residual error neural network; evaluating the training result by adopting a verification set;
the residual neural network comprises a Block1, a first Block2, a second Block2, a first Block4, a second Block4, a third Block4 and a Block5 which are sequentially connected in series; the output of the Block1 is connected with the input of the second Block2 through a skip connection unit, and the input of the second Block2 is connected with the input of the first Block4 through a skip connection unit;
a mixing attention mechanism is introduced into the first Block2, the second Block2, the first Block4, the second Block4, the third Block4 and the Block 5;
the Block1 includes a two-dimensional convolutional layer for implementing 2-fold down-sampling;
the first Block4, the second Block4 and the third Block4 all comprise a third Block2 and a fourth Block 2; the third Block2 is connected in series with a fourth Block 2;
the first Block2, the second Block2, the third Block2 and the fourth Block2 all comprise two-dimensional convolutional layers, and a mixing attention mechanism is introduced behind the second two-dimensional convolutional layer;
the Block5 includes two-dimensional convolutional layers and a sigmoid layer, a mixed attention mechanism is introduced in front of the first two-dimensional convolutional layer;
the input of the third Block2 is connected to the input of a fourth Block2 through a hopping connection unit comprising Block 3;
the Block3 includes two parallel two-dimensional pooling layers and a configure layer for combining and outputting the outputs of the two-dimensional pooling layers in the last tensor dimension.
2. The method of claim 1, wherein the method comprises: the pooling region of the two-dimensional pooling layer in Block3 contained in the third Block4 is used to implement 2-fold down-sampling, padding is used to make the input image area equal to the input image area, and the contitate layer is used to combine and output the outputs of the two parallel two-dimensional pooling layers in the last tensor dimension.
3. The method of claim 1 or 2, wherein the method comprises: the formula of the mixed attention mechanism is as follows:
Figure DEST_PATH_IMAGE002
(1)
in the formula (1), the reaction mixture is,
Figure DEST_PATH_IMAGE004
the maximum pooling is indicated by the number of pools,
Figure DEST_PATH_IMAGE006
representing a two-line interpolation, S is the resulting attention mechanism weight,
Figure DEST_PATH_IMAGE008
representing a sigmoid function;
Figure DEST_PATH_IMAGE010
Figure DEST_PATH_IMAGE012
is the convolution kernel weight;
Figure DEST_PATH_IMAGE014
Figure DEST_PATH_IMAGE016
is the convolution kernel bias.
4. The method of claim 1 or 2, wherein the method comprises: in S1, the sample is amplified according to the signal-to-noise ratio.
5. The method of claim 1, wherein the method comprises: the preprocessing of the collected crying data before S2 includes two ways:
the first method is as follows: pre-emphasis is carried out on the voice signals;
the second method comprises the following steps: speech signals are framed and windowed.
6. The method of claim 1 or 5, wherein the method comprises: in the step S3, feature extraction is performed on the data in the training set, and the extracted audio features are used for training a residual error neural network;
the audio features comprise at least one of short-time zero crossing rate, short-time average energy, short-time average amplitude, energy entropy, frequency spectrum centroid, spectrum entropy, frequency spectrum flux, mel-frequency cepstrum coefficient and chromatogram.
7. Cry detecting system based on attention residual learning, its characterized in that: the method comprises the following steps:
the first data acquisition module: the device is used for collecting sound data to be detected;
the second data acquisition module: the device is used for collecting sample data;
a data preprocessing module: the method is used for preprocessing sample data:
a feature extraction module: for extracting audio features in sample data:
a crying model module: the system is used for training the audio features in the training sample data by using a residual error neural network algorithm based on an attention mechanism to obtain a crying model;
crying identification module: the crying model is used for inputting the data of the sound to be detected into the crying model for calculation, and determining whether the data of the sound to be detected is crying or not;
the residual neural network comprises a Block1, a first Block2, a second Block2, a first Block4, a second Block4, a third Block4 and a Block5 which are sequentially connected in series; the output of the Block1 is connected with the input of the second Block2 through a skip connection unit, and the input of the second Block2 is connected with the input of the first Block4 through a skip connection unit;
a mixing attention mechanism is introduced into the first Block2, the second Block2, the first Block4, the second Block4, the third Block4 and the Block 5;
the Block1 includes a two-dimensional convolution layer to implement 2-fold down-sampling;
the first Block4, the second Block4 and the third Block4 all comprise a third Block2 and a fourth Block 2; the third Block2 is connected in series with a fourth Block 2;
the first Block2, the second Block2, the third Block2 and the fourth Block2 all comprise two-dimensional convolutional layers, and a mixing attention mechanism is introduced behind the second two-dimensional convolutional layer;
the Block5 includes two-dimensional convolutional layers and a sigmoid layer, a mixed attention mechanism is introduced in front of the first two-dimensional convolutional layer;
the input of the third Block2 is connected to the input of a fourth Block2 through a hopping connection unit comprising Block 3;
the Block3 includes two parallel two-dimensional pooling layers and a configure layer for combining and outputting the outputs of the two-dimensional pooling layers in the last tensor dimension.
CN202110224859.0A 2021-03-01 2021-03-01 Crying detection method and system based on attention residual learning Active CN112863550B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110224859.0A CN112863550B (en) 2021-03-01 2021-03-01 Crying detection method and system based on attention residual learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110224859.0A CN112863550B (en) 2021-03-01 2021-03-01 Crying detection method and system based on attention residual learning

Publications (2)

Publication Number Publication Date
CN112863550A CN112863550A (en) 2021-05-28
CN112863550B true CN112863550B (en) 2022-08-16

Family

ID=75990713

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110224859.0A Active CN112863550B (en) 2021-03-01 2021-03-01 Crying detection method and system based on attention residual learning

Country Status (1)

Country Link
CN (1) CN112863550B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112992121B (en) * 2021-03-01 2022-07-12 德鲁动力科技(成都)有限公司 Voice enhancement method based on attention residual error learning
CN113851115A (en) * 2021-09-07 2021-12-28 中国海洋大学 Complex sound identification method based on one-dimensional convolutional neural network
CN114333898A (en) * 2021-12-10 2022-04-12 科大讯飞股份有限公司 Sound event detection method, device and system and readable storage medium
CN116386661B (en) * 2023-06-05 2023-08-08 成都启英泰伦科技有限公司 Crying detection model training method based on dual attention and data enhancement

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111477221A (en) * 2020-05-28 2020-07-31 中国科学技术大学 Speech recognition system using bidirectional time sequence convolution and self-attention mechanism network
CN112992121A (en) * 2021-03-01 2021-06-18 德鲁动力科技(成都)有限公司 Voice enhancement method based on attention residual error learning
CN113012714A (en) * 2021-02-22 2021-06-22 哈尔滨工程大学 Acoustic event detection method based on pixel attention mechanism capsule network model

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107818779A (en) * 2017-09-15 2018-03-20 北京理工大学 A kind of infant's crying sound detection method, apparatus, equipment and medium
CN108511002B (en) * 2018-01-23 2020-12-01 太仓鸿羽智能科技有限公司 Method for recognizing sound signal of dangerous event, terminal and computer readable storage medium
CN110110729B (en) * 2019-03-20 2022-08-30 中国地质大学(武汉) Building example mask extraction method for realizing remote sensing image based on U-shaped CNN model
WO2020222985A1 (en) * 2019-04-30 2020-11-05 The Trustees Of Dartmouth College System and method for attention-based classification of high-resolution microscopy images
CN110600059B (en) * 2019-09-05 2022-03-15 Oppo广东移动通信有限公司 Acoustic event detection method and device, electronic equipment and storage medium
CN110675405B (en) * 2019-09-12 2022-06-03 电子科技大学 Attention mechanism-based one-shot image segmentation method
KR102276964B1 (en) * 2019-10-14 2021-07-14 고려대학교 산학협력단 Apparatus and Method for Classifying Animal Species Noise Robust
CN111859954A (en) * 2020-07-01 2020-10-30 腾讯科技(深圳)有限公司 Target object identification method, device, equipment and computer readable storage medium
CN112382311B (en) * 2020-11-16 2022-08-19 谭昊玥 Infant crying intention identification method and device based on hybrid neural network
CN112382302A (en) * 2020-12-02 2021-02-19 漳州立达信光电子科技有限公司 Baby cry identification method and terminal equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111477221A (en) * 2020-05-28 2020-07-31 中国科学技术大学 Speech recognition system using bidirectional time sequence convolution and self-attention mechanism network
CN113012714A (en) * 2021-02-22 2021-06-22 哈尔滨工程大学 Acoustic event detection method based on pixel attention mechanism capsule network model
CN112992121A (en) * 2021-03-01 2021-06-18 德鲁动力科技(成都)有限公司 Voice enhancement method based on attention residual error learning

Also Published As

Publication number Publication date
CN112863550A (en) 2021-05-28

Similar Documents

Publication Publication Date Title
CN112863550B (en) Crying detection method and system based on attention residual learning
CN112992121B (en) Voice enhancement method based on attention residual error learning
Dewi et al. The study of baby crying analysis using MFCC and LFCC in different classification methods
CN103489454B (en) Based on the sound end detecting method of wave configuration feature cluster
Sroka et al. Human and machine consonant recognition
CN111292762A (en) Single-channel voice separation method based on deep learning
JPH08509556A (en) Method and system for detecting and generating transients in acoustic signals
CN103054586B (en) Chinese speech automatic audiometric method based on Chinese speech audiometric dynamic word list
JP2015096921A (en) Acoustic signal processing device and method
CN112185342A (en) Voice conversion and model training method, device and system and storage medium
Roy et al. DeepLPC-MHANet: Multi-head self-attention for augmented Kalman filter-based speech enhancement
Roy et al. DeepLPC: A deep learning approach to augmented Kalman filter-based single-channel speech enhancement
Wu et al. Research on acoustic feature extraction of crying for early screening of children with autism
CN103258537A (en) Method utilizing characteristic combination to identify speech emotions and device thereof
Eklund Data augmentation techniques for robust audio analysis
CN114255780B (en) Noise robust blind reverberation time estimation method based on deep neural network
CN115565550A (en) Baby crying emotion identification method based on characteristic diagram light convolution transformation
Adam et al. Wavelet cesptral coefficients for isolated speech recognition
CN116386589A (en) Deep learning voice reconstruction method based on smart phone acceleration sensor
CN114302301B (en) Frequency response correction method and related product
Cai et al. The best input feature when using convolutional neural network for cough recognition
Zhang et al. URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement
CN114446316A (en) Audio separation method, and training method, device and equipment of audio separation model
CN107039046B (en) Voice sound effect mode detection method based on feature fusion
Gupta et al. Morse wavelet transform-based features for voice liveness detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant