CN112863550A

CN112863550A - Crying detection method and system based on attention residual learning

Info

Publication number: CN112863550A
Application number: CN202110224859.0A
Authority: CN
Inventors: 李学生; 李晨; 朱麒宇
Original assignee: Delu Power Technology Chengdu Co Ltd
Current assignee: Delu Power Technology Chengdu Co Ltd
Priority date: 2021-03-01
Filing date: 2021-03-01
Publication date: 2021-05-28
Anticipated expiration: 2041-03-01
Also published as: CN112863550B

Abstract

The invention relates to a crying detection method and a crying detection system based on attention residual error learning, which comprises S1, collecting crying data; s2, dividing the crying data into a training set and a verification set; s3, training the constructed residual error neural network based on the attention mechanism by adopting a training set to obtain a trained residual error neural network based on the attention mechanism; and the training results are evaluated by adopting a validation set. The method for introducing the residual network solves the problem that the gradient of the CNN model with large layer number disappears, and introduces the attention mechanism to enable the residual model to be capable of adding the characteristics of being capable of expressing crying, so that the accuracy rate of crying identification in a real scene can be improved, and the generalization capability in a practical scene is improved.

Description

Crying detection method and system based on attention residual learning

Technical Field

The invention relates to the technical field of voice recognition, in particular to a crying detection method and system based on attention residual learning.

Background

Because the existing four-foot voice recognition lacks abnormal sound detection, particularly in family companion dogs, crying is a main way for infants to express themselves, and the automatic detection of the crying of the infants plays an important role in the field of family companionship and can effectively reduce the burden of parents of nursing. There have been many studies on feature and model selection and mechanisms for baby cry, and conventional machine learning methods such as SVM and classification of spectrogram using CNN model are commonly used.

The traditional machine learning method such as SVM generally depends on the selection of characteristics, the quality of the selected characteristics determines the quality of a recognition result, the characteristics are difficult to be selected so as to comprehensively reflect the characteristics of infant crying, a convolutional neural network can learn the characteristics from a spectrogram, the training is difficult due to the deepening of the layers, the result effect of a shallow CNN model used for infant crying detection is poor, and the main challenge of crying recognition in the actual environment is the uncertainty and instability of noise.

In an actual environment containing unstable noise, if only a single or too few features are used, the recognition rate of cry recognition is too low, but in a model adopting complex features, a deep CNN network model has a potential gradient disappearance problem.

Disclosure of Invention

The invention provides a crying detection method and system based on attention residual error learning to solve the technical problems.

The invention is realized by the following technical scheme:

the crying detection method based on attention residual learning comprises the following steps:

s1, collecting crying data;

s2, dividing the crying data into a training set and a verification set;

s3, training the constructed residual error neural network based on the attention mechanism by adopting a training set to obtain a trained residual error neural network; and the training results are evaluated by adopting a validation set.

Further, the residual neural network comprises a Block1, a first Block2, a second Block2, a first Block4, a second Block4, a third Block4 and a Block5 which are sequentially connected in series; the output of the Block1 is connected with the input of the second Block2 through a skip connection unit, and the input of the second Block2 is connected with the input of the first Block4 through a skip connection unit;

the first Block2, the second Block2, the first Block4, the second Block4, the third Block4 and the Block5 all introduce a mixing attention mechanism.

Further, the Block1 includes a two-dimensional convolutional layer for achieving 2-fold down-sampling;

the first Block4, the second Block4 and the third Block4 all comprise a third Block2 and a fourth Block 2; the third Block2 is connected in series with a fourth Block 2;

the first Block2, the second Block2, the third Block2 and the fourth Block2 all comprise two-dimensional convolutional layers, and a mixing attention mechanism is introduced behind the second two-dimensional convolutional layer;

the Block5 includes two-dimensional convolutional layers and a sigmoid layer, the first of which introduces a hybrid attention mechanism in front of the convolutional layers.

Further, the input of the third Block2 is connected with the input of the fourth Block2 through a skip connection unit comprising a Block 3;

the Block3 includes two parallel two-dimensional pooling layers and a configure layer for combining and outputting the outputs of the two-dimensional pooling layers in the last tensor dimension.

Further, the pooling area of the two-dimensional pooling layer in Block3 included in the third Block4 is used to implement 2-fold downsampling, padding is used to make the input image area equal to the input image area, and the contitate layer is used to combine and output the outputs of the two parallel two-dimensional pooling layers in the last tensor dimension.

Further, it is characterized in that: the formula of the mixed attention mechanism is as follows:

S＝σ((F_up(F_res(F_res(F_dn(U))+F_up(F_res(F_res(F_dn(F_res(F_dn(U))))))))*W₁+b₁)*W₂+b₂ (1)

in the formula (1), F_dnDenotes maximum pooling, F_upRepresenting a two-line interpolation, S being the resulting attention mechanism weight, F_resRepresenting a residual error mechanism calculation process, wherein sigma represents a sigmoid function; w is a₁、w₂Is the convolution kernel weight; b₁、b₂Is the convolution kernel bias.

Furthermore, the number of convolution kernels of the two-dimensional convolution layer of Block1 is 24;

the convolution kernel size, the number and the step number of the two-dimensional convolution layers in the first Block2 and the second Block2 are the same;

the number of convolution kernels of the two-dimensional convolution layers contained in the third Block2 and the fourth Block2 in the first Block4 is 48;

the number of convolution kernels of the two-dimensional convolution layers contained in the third Block2 and the fourth Block2 in the second Block4 is 96;

the number of convolution kernels of the two-dimensional convolution layers contained in the third Block2 and the fourth Block2 in the third Block4 is 192;

the third Block2 is used to implement 2 times downsampling;

the number of convolution kernels of the first two-dimensional convolution layer of Block5 is increased to 768; the convolution kernel size of the second two-dimensional convolution layer is 1 and the number is 1.

Further, in S1, the sample is amplified according to the signal-to-noise ratio.

Further, the step S2 is preceded by preprocessing the collected crying data, where the preprocessing includes two ways:

the first method is as follows: pre-emphasis is carried out on the voice signals;

the second method comprises the following steps: the speech signal is framed and windowed.

Further, in S3, first, feature extraction is performed on data in the training set, and the extracted audio features are used for training a residual neural network;

the audio features comprise at least one of short-time zero crossing rate, short-time average energy, short-time average amplitude, energy entropy, frequency spectrum centroid, spectrum entropy, frequency spectrum flux, mel-frequency cepstrum coefficient and chromatogram.

Cry detection system based on attention residual learning, including:

the first data acquisition module: the device is used for collecting sound data to be detected;

the second data acquisition module: the device is used for collecting sample data;

a data preprocessing module: the method is used for preprocessing sample data:

a feature extraction module: for extracting audio features in sample data:

a crying model module: the system is used for training the audio features in the training sample data by using a residual error neural network algorithm based on an attention mechanism to obtain a crying model;

crying identification module: and the crying model is used for inputting the data of the sound to be detected into the crying model for calculation, and determining whether the data of the sound to be detected is crying or not.

Compared with the prior art, the invention has the following beneficial effects:

the method of introducing the residual network solves the problem that the gradient of the CNN model with extremely large layer number disappears, and introduces an attention mechanism to enable the residual model to be more added with the characteristic of being capable of expressing crying; the method can improve the accuracy of crying recognition in a real scene and improve the generalization capability in a real scene.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention.

FIG. 1 is a flow chart of model training;

FIG. 2 is a schematic diagram of a residual block;

FIG. 3 is a block diagram of a residual neural network based on an attention mechanism;

FIG. 4 is a Block diagram of Block 1;

FIG. 5 is a Block diagram of Block 2;

FIG. 6 is a Block diagram of Block 3;

FIG. 7 is a Block diagram of Block 5.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.

As shown in fig. 1, the crying detection method based on attention residual learning disclosed by the invention comprises the following steps:

s1, collecting crying data;

s2, dividing the crying data into a training set and a verification set;

s3, training the constructed residual error neural network based on the attention mechanism by adopting a training set to obtain a trained residual error neural network based on the attention mechanism; and the training results are evaluated by adopting a validation set.

And inputting the sound data to be detected into a trained residual error neural network based on an attention mechanism, and identifying whether the sound data is crying or not.

Based on the above method, the present invention discloses an embodiment.

Example 1

As shown in fig. 1, the present embodiment includes the following steps:

step 1, collecting crying data samples.

In this embodiment, the data set sources mainly include three:

450 clear crying data collected from the "donata cry" project on the GitHub; collecting 40 Crying data from Crying baby category in ESC-50 dataset; and manually recording 400 pieces of crying data from the network.

All data were unmuted, each 5 seconds long, and negative samples were taken from other categories of the ESC-50 data set. Therefore, there were 890 pieces of data for the positive sample (baby crying), and 900 pieces of data were collected for the negative sample. So that the positive and negative samples are relatively equalized.

Because the number of data samples is small, in order to better conform to the practical application environment, the embodiment also provides a method for performing data amplification on the collected sample data, which specifically comprises the following steps:

common indoor environmental noise in a family room, such as air conditioning sound and the like, is selected from UrbanSound8K, tests show that different signal-to-noise ratios can cause different performances of models, the accuracy of the models is obviously deteriorated when the noise intensity is larger (the signal-to-noise ratio is lower), and finally, the signal-to-noise ratio is selected to be 35dB for sample amplification.

Signal-to-noise ratio: refers to the ratio of the power of the useful speech signal and the noise signal doped with the signal in a speech signal. The signal-to-noise ratio can be calculated using equation (1):

in the formula (1), s (n) is a speech signal, and r (n) is a noise signal.

In this embodiment, 50% of sample data is finally selected for sample amplification, so that 1335 positive samples and 1350 negative samples are obtained.

According to the method, noise is added to the data according to different signal-to-noise ratios, sample data amplification is achieved, the accuracy rate of crying recognition in a real scene can be improved, and the generalization capability in a practical scene is improved.

And 2, preprocessing data.

In this embodiment, two methods are mainly adopted for data preprocessing: pre-emphasis, framing, and windowing.

2.1 pre-emphasis.

In the process of generating sound in the oral cavity, the energy of the sound is concentrated at low frequency, high-frequency signals are filtered out in the processing process, the attenuation of the high frequency is larger in the voice generating process, the pre-emphasis operation is to make up the attenuation of the high-frequency part, the specific operation is to send the audio signals into a first-order FIR high-pass filtering improved dynamic domain, so that the voice signal frequency spectrum after pre-emphasis is flatter, and the expression of the pre-emphasis is as follows:

H(z)＝1-αz^-1 (2)

in the formula (2), α is a constant representing the pre-emphasis coefficient, determines the pre-emphasis intensity, and has a value range of 0.9< α < 1.

2.2 framing and windowing.

In an audio signal, the frequency changes with time, and features cannot be directly extracted from the whole audio segment, and generally, it is considered that the voice signal is divided into 10 ms-30 ms voice segments, and stationarity exists in a short time.

Framing is typically achieved by windowing, the windowing formula:

S_w(n)＝S(n)W(n) (3)

in the formula (3), s (n) represents an original signal, and w (n) represents a window function.

Commonly used window functions are rectangular windows, Hamming windows, Hanning windows.

Wherein, the rectangular window formula:

hamming window formula:

hanning window formula:

this embodiment takes a Hamming window as the window function. And the values chosen in window length and frame shift are: the window length is 2048 points, the frame shift is 1024 points, and the best effect is achieved in the subsequent feature extraction.

And 3, constructing a training set and a verification set.

And 4, performing feature extraction and feature combination on the data in the training connection.

Although the neural network has the capability of extracting information contained in data with confidence, it is very difficult to directly process an original audio signal, so that feature engineering is very necessary, good feature extraction can greatly improve the recognition performance of the neural network, the training accuracy and efficiency are improved, the feature extraction of voice is very mature, and the common voice features are as follows:

1, short-time zero-crossing rate: the number of zero crossings of the signal in unit time is defined as a zero crossing rate, and the short-time zero crossing rate can be visually corresponding to the number of times of the signal waveform passing through a time axis.

2, short-time average energy: the short-time average energy can be used for assisting in distinguishing unvoiced sound and voiced sound, and can be used for dividing voiced and unvoiced segments under the conditions of high signal-to-noise ratio, pure signals and less noise components, so that silent segments are cut.

The mathematical definition of the short-time average energy is a weighted sum of squares of the signal amplitudes within a frame, which is mathematically represented as:

in the formula (7), x (m) represents a sound signal, and w (.) represents a window function.

Short-time average amplitude: the short-time average energy needs to calculate the sum of squares of signal sampling values, the square calculation is too sensitive to signal flatness, and if high level is met during specific calculation, the short-time average energy is easy to sharply increase, even overflow is generated. To overcome this drawback, the short-term average amplitude is replaced by the sum of the squares with the sum of the absolute values, and the change in sound intensity can also be measured. The mathematical expression is as follows:

in the formula (8), x (m) represents a sound signal, and w (.) represents a window function.

4, energy entropy: the energy entropy can describe the time variation degree of the audio signal and can be used as an audio characteristic. This feature has a higher value if there is a sudden change in the energy envelope of the signal.

5, spectrum centroid: the spectral centroid represents which frequency band the sound energy is concentrated in. The higher the value of the spectral centroid, the more concentrated the energy representing the signal is in the higher frequencies. The auditory sensation of the sound with more low-frequency components is lower and more depressed, the mass center of the frequency spectrum is relatively lower, the auditory sensation of the sound with more high-frequency components is higher and more cheerful, and the mass center of the frequency spectrum is relatively higher.

6, spectral entropy: the spectral entropy allows to detect the complexity contained in the audio signal, the greater the complexity, the greater the spectral entropy. The mathematical expression is as follows:

in the formula (9), f (w) is a function of the spectral density in a frame signal.

7, spectral flux: spectral flux can quantify the change in spectrum over time, with a spectrally stable or nearly constant signal having a low spectral flux, such as white gaussian noise, and with an abrupt spectral change having a high spectral flux.

8, mel-frequency cepstrum coefficient: the mel frequency cepstrum coefficient is an important characteristic in speech processing, and is characterized in that the logarithmic power of a signal is subjected to linear cosine change on a nonlinear mel frequency scale, and the mel frequency cepstrum coefficient is also called MFCC (Mel frequency cepstrum coefficient), and the MFCC can reflect the nonlinear characteristic of the auditory frequency of human ears. The mathematical expression is as follows:

in the formula (10), f is a linear frequency and has a unit of HZ.

9, chromatogram: the chromatogram map divides the whole frequency spectrum into 12 frequency bands, corresponding to music octave syllables, which can be divided according to different chromaticities.

The results obtained by training with different combinations of features are shown in table 1:

table 1: lifting capability meter for model by different characteristic combination

In table 1: MSG stands for logarithmic Mel-map, MFCC stands for Mel cepstrum coefficient, CG stands for chromatogram, and ZCR stands for zero-crossing rate.

Therefore, the audio features finally selected in this embodiment are: and the characteristic combination of a logarithmic Mel spectrum, a Mel cepstrum coefficient, a chromatogram and a zero crossing rate.

And 5, designing a residual error neural network based on an attention mechanism, and training the residual error neural network by adopting a training set.

The performance of the convolutional neural network is strongly related to the depth of the network, the deeper network structure can improve the recognition effect, however, in practice, after the depth of the convolutional network reaches a certain depth, the performance of the model can not improve any more, and even the performance becomes worse, and the phenomenon is called gradient vanishing. Residual blocks are added to the convolutional network and the residual units can be connected layer-by-layer, so that in the deep convolutional network, the output of some layers can be directly transferred across the middle layer to the following layers.

As shown in fig. 2, the residual block passes the input to the output through the function r (x) and adds the input to the output f (x), and the learning objective of the network changes accordingly, which is no longer the overall output h (x), and becomes the difference between the output and the input.

The present embodiment designs a residual neural network based on attention mechanism as shown in fig. 3, which includes: the device comprises a Block1, a first Block2, a second Block2, a first Block4, a second Block4, a third Block4 and a Block5 which are sequentially connected in series.

The output of Block1 is connected to the input of a second Block2 through a hopping connection unit, and the input of the second Block2 is connected to the input of the first Block4 through a hopping connection unit; the first Block4, the second Block4 and the third Block4 all comprise a third Block2, a fourth Block2 and a Block 3. The third Block2 is connected in series with a fourth Block2, the input of the third Block2 being connected to the input of the fourth Block2 via a hopping connection unit comprising Block 3.

As shown in FIG. 4, Block1 includes a Batch Normalization layer and a two-dimensional convolution layer (Conv2D) to achieve 2-fold down-sampling. The two-dimensional convolutional layer convolution kernel is 3 multiplied by 3 in size, 24 in number and (1,2) in steps, and 2 times of downsampling is realized.

As shown in FIG. 5, Block2 includes two-dimensional convolutional layers, the second of which introduces a hybrid-attentive mechanism (interplated-attn). The formula of the mixed attention mechanism is as follows:

S＝σ((F_up(F_res(F_res(F_dn(U))+F_up(F_res(F_res(F_dn(F_res(F_dn(U))))))))*W₁+b₁)*W₂+b₂ (11)

in the formula (11), F_dnDenotes maximum pooling, F_upRepresenting a two-line interpolation, S being the resulting attention mechanism weight, F_resRepresenting a residual error mechanism calculation process, wherein sigma represents a sigmoid function; w is a₁、w₂Is the convolution kernel weight; b₁、b₂Is the convolution kernel bias.

A mixed attention mechanism is introduced, the number of channels is unchanged from input to output of each layer of the network, the module reduces the dimensionality of the space dimensionality by means of down-sampling, the receptive field of convolution extraction features is increased, the region where high-frequency features in an input image are located can be effectively inferred, then up-sampling is conducted by means of interpolation, and therefore the dimensionality is enlarged, and meanwhile the feature region is better located.

As shown in fig. 6, Block3 includes two parallel two-dimensional pooling layers and a configure layer for combining and outputting the outputs of the two-dimensional pooling layers in the last tensor dimension.

In block3 included in third block4, the pooling region of maxporoling 2d is used to implement 2-fold downsampling, padding is used to make the input image area equal to the input image area, and the concatenate combines and outputs the outputs of the two parallel two-dimensional pooling layers in the last tensor dimension.

As shown in FIG. 7, Block5 includes two-dimensional convolutional layers and a sigmoid layer, the front of the first two-dimensional convolutional layer of which also introduces a hybrid attention mechanism.

In this embodiment, the two-dimensional convolutional layer in Block1 is to realize 2-fold down-sampling.

The number of convolution cores and the number of steps of the two-dimensional convolution layers in the first block2 and the second block2 are the same.

However, the number of two-dimensional convolutional layers contained in the two blocks 2 in the first block4 increases to 48; the number of convolution kernels of two blocks 2 in the second block4 is increased to 96; the number of convolution kernels of two blocks 2 in the third block4 is increased to 192, and each time the number of convolution kernels in the blocks 2 in the three blocks 4 is the first block2, the down-sampling is performed by a factor of 2.

The number of convolution kernels of the first two-dimensional convolution layer of Block5 increases to 768, and the size of convolution kernels of the second two-dimensional convolution layer is 1, and the number is 1. And finally outputting a prediction result through GlobavalagePooling 2D and sigmoid of 1 dimension, and judging whether the cry is crying or not.

The recognition capability of the final model obtained in this example is shown in table 2:

table 2: comparison table of residual error network of the invention and without using attention mechanism

Model (model)	Model score
		Residual error network without attention mechanism	96.5％
Residual error network with attention mechanism	98.6％

As can be seen from table 2, after the attention mechanism is added to the residual error network, the residual error network performs better, and the residual error network itself also solves the problem of gradient disappearance possibly caused by too deep convolutional neural network.

The invention discloses a crying detection system based on attention residual error learning, which comprises:

a data preprocessing module: the method is used for preprocessing sample data:

a feature extraction module: for extracting audio features in sample data:

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. The crying detection method based on attention residual learning is characterized by comprising the following steps of: the method comprises the following steps:

s1, collecting crying data;

s2, dividing the crying data into a training set and a verification set;

2. The method of claim 1, wherein the method comprises: the residual neural network comprises a Block1, a first Block2, a second Block2, a first Block4, a second Block4, a third Block4 and a Block5 which are sequentially connected in series; the output of the Block1 is connected with the input of the second Block2 through a skip connection unit, and the input of the second Block2 is connected with the input of the first Block4 through a skip connection unit;

3. The method of claim 2, wherein the method comprises: the Block1 includes a two-dimensional convolutional layer for implementing 2-fold down-sampling;

4. The method of claim 2, wherein the method comprises: the input of the third Block2 is connected to the input of a fourth Block2 through a hopping connection unit comprising Block 3;

5. The method of claim 4, wherein the method comprises: the pooling region of the two-dimensional pooling layer in Block3 contained in the third Block4 is used to implement 2-fold down-sampling, padding is used to make the input image area equal to the input image area, and the contitate layer is used to combine and output the outputs of the two parallel two-dimensional pooling layers in the last tensor dimension.

6. The method of claim 2, 3, 4 or 5, wherein the method comprises: the formula of the mixed attention mechanism is as follows:

（1）

in the formula (1), the reaction mixture is,

the maximum pooling is indicated by the number of pools,

representing a two-line interpolation, S is the resulting attention mechanism weight,

representing a sigmoid function;

、

is the convolution kernel weight;

、

is the convolution kernel bias.

7. The method of claim 3, 4 or 5, wherein the method comprises: in S1, the sample is amplified according to the signal-to-noise ratio.

8. The method of claim 1, wherein the method comprises: the preprocessing of the collected crying data before S2 includes two ways:

9. The method of claim 1 or 8, wherein the method comprises: in the step S3, feature extraction is performed on the data in the training set, and the extracted audio features are used for training a residual error neural network;

10. Cry detecting system based on attention residual learning, its characterized in that: the method comprises the following steps:

a data preprocessing module: the method is used for preprocessing sample data:

a feature extraction module: for extracting audio features in sample data: