CN114595772A

CN114595772A - Infant crying classification method based on Transformer fusion model

Info

Publication number: CN114595772A
Application number: CN202210236093.2A
Authority: CN
Inventors: 李彬; 江波; 王妍
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2022-03-11
Filing date: 2022-03-11
Publication date: 2022-06-07

Abstract

The invention discloses a method for classifying baby crying based on a Transformer fusion model; transforming an input audio sample through an audio processing module to obtain a spectrogram of an audio; obtaining the spectrogram characteristics by passing the obtained spectrogram through a pre-trained Resnet-50 model; inputting the obtained spectrogram characteristics into a spectrogram enhancement module and an attention mechanism module respectively, and extracting characteristic representations after data enhancement and differentiable characteristic representations in and among channels respectively; fusing the feature representation generated by bilinear through a Transformer fusion module, highlighting useful information, inhibiting redundant information and further enhancing the representation capability of a feature map; and (5) using the fused feature map for classifying the baby cry, and obtaining a final classification result through multiple iterative training.

Description

Infant crying classification method based on Transformer fusion model

Technical Field

The invention relates to a computer voice technology, in particular to a method for classifying baby crying based on a Transformer fusion model.

Background

The automatic classification of baby cry is a vital research field in bioengineering, and the signal of baby cry is analyzed by adopting medical and engineering techniques to distinguish the physiological and pathological states of baby cry, and different from the oral information of adult voice, it is difficult to identify what baby tries to transmit through their cry, so that it is important to research and design an effective classification model of baby cry, and efficiently obtain and identify the physiological and pathological states of baby cry.

Traditional classification models for baby crying are mostly based on single-branch classification models. The method mainly comprises a method based on a traditional machine learning classifier comprising models such as MLP, SVM and decision tree and a method based on a deep learning classifier comprising Resnet-50, a migration Resnet-50 combined SVM model, a graph volume model and an R-CNN series. However, the traditional machine learning classifier has many limitations, such as small data scale and poor generalization capability, and is difficult to apply to complex and variable actual scenes. However, the method of combining the deep learning classifier mostly only focuses on deeply extracting the complex baby cry feature representation, and how to expand the limited baby cry sample and fully dig the distinguishable feature representation inside the baby cry channel and among the channels cannot be considered at the same time.

In summary, the existing classification of baby cry has the following problems:

(1) the existing baby cry classification method cannot adaptively acquire differentiable features, and ignores the information interaction of features inside and among frequency spectrogram feature channels.

(2) Also in the problem of baby cry data with limited labels, the existing method cannot well increase the robustness of spectrogram features due to the sensitivity of the baby cry data and the time consumption for transcribing the original cry data.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to solve the defects in the prior art and provides a method for classifying baby crying based on a Transformer fusion model, which is based on the information fusion, and on one hand, a spectrogram enhancement module is used for extracting the characteristic representation of robustness to achieve the purpose of expanding the data of the baby crying; on the other hand, a spatial and channel attention mechanism module is used for extracting the distinguishable feature expressions in the channels and between the channels, and finally the feature expressions of the channel and the channel are fused through a Transformer module, so that the classification performance is further improved by fully utilizing the fusion idea.

The technical scheme is as follows: the invention relates to a method for classifying baby crying based on a Transformer fusion model, which comprises the following steps of:

step (1), inputting crying audio data of a baby to be classified, preprocessing the input audio data through an audio processing module, and generating a spectrogram;

step (2), constructing a training model, and preliminarily extracting spectrogram characteristics;

taking a Resnet-50 pre-training model as a basic network, then fixing parameters of the front 8 layers of the convolutional layer in the Resnet-50 model, only parameters of the rear two layers participate in the training process, and then adding two full-connection layers with 1024 neurons and a dropout layer with the speed of 0.7 to the basic network to further obtain a training model;

step 3, extracting robust feature representation, differentiable feature representation in channels and differentiable feature representation among channels by a spectrogram enhancement module and an attention mechanism module respectively for the spectrogram features obtained in the step 2;

fusing the two feature representations obtained in the step (3) through a Transformer fusion module; and then, using the fused feature map for classifying the crying of the baby, and obtaining a final classification result through multiple iterative training.

Further, the preprocessing in step (1) refers to converting audio file samples (including format samples such as wav, pcm, mp3, etc.) into spectrograms with a size of 256 × 256 by audio processing software (e.g., Sound Exchange, etc.), wherein the horizontal axis of a single-channel spectrogram represents time and the vertical axis represents frequency.

Further, the method for constructing the training base network model in the step (2) comprises the following steps:

taking a Resnet-50 pre-training model as a backbone network, then fixing parameters of the front 8 layers of the convolutional layers in the Resnet-50 model, including 8 groups of convolutional layers from Conv1 to Conv8 in the Resnet-50, and updating the parameters of the rear two layers after the training process; and then adding two full-connection layers with 1024 neurons and a dropout layer with the speed of 0.7 into the backbone network to form a required basic network model for preliminarily extracting spectrogram features.

Further, the specific method for extracting the spectrogram robustness feature by the spectrogram enhancement module in the step (3) comprises the following steps: respectively performing masking operation on a time domain channel and a frequency domain channel of the spectrogram, namely setting two masking frequency domain channels, wherein the range of random values is between 0 and 20; two masking time domain channels are arranged, the range of random values is between 10 and 30, and time warping operation is removed because the baby cries do not have strong semantic information in a voice time sequence.

Under the condition of insufficient training data, the frequency spectrum of the audio data can be enhanced through the frequency spectrogram enhancing module, and the data set can be dynamically expanded; the spectrogram enhancement can help the neural network to better learn the feature representation of the spectrogram, can also increase the robustness of the training network against time domain deformation and partial segment loss of the frequency domain, and finally effectively improves the performance of the final audio classifier.

Further, the attention mechanism module in the step (3) comprises a channel attention mechanism and a space attention mechanism, and the specific working process is as follows:

for the channel attention mechanism, channel information is first aggregated using global maximum pooling for each feature map; then the feature descriptors generated by the global maximum pooling are sent to the two-layer perceptron; equation (1) is as follows:

P^c＝M^mlp(M^max(P)), (1)

the final feature map is generated by multiplying the channel attention map and the original feature map; as shown in equation (2):

wherein M is^max(. to) represents the global maximum pooling of feature maps between channels, M^mlp(. is) a two-layer perceptron, P representing a characteristic diagram of the input, P^cIs a generated channel attention map, P^f1A feature map generated by the channel attention mechanism is shown.

Introducing global max pooling for each point in the feature map for a spatial attention mechanism, and then sending feature descriptors generated by the global max pooling to the two-layer perceptron; and finally, multiplying the channel attention diagram and the original feature diagram to generate a feature diagram, wherein the formula (3) is as follows:

wherein M in the formula (3)^max(. o) represents the global max pooling, P, of each point in the feature map^f2A feature map generated by a spatial attention mechanism is shown.

Compared with a single-branch spectrogram enhancement module, the dual-branch spectrogram enhancement module can obtain more robust feature embedding.

Further, the Transformer fusion module in the step (4) does not set position coding, and adopts a layer of Transformer Block fusion robust feature representation, distinguishable feature representation in channels and distinguishable feature representation between channels; the sentence tokens in the Transformer fusion module are set to 49, each token being 1 × 1 × 128(w × h × c) in size; and finally, obtaining a final classification result through the 2-layer full connection layer and the softmax.

On one hand, the invention considers the learning of the internal characteristics of the channels, thereby being beneficial to the information interaction between the channels; on the other hand, discriminative feature representations between channels are learned based on the channel attention and spatial attention modules. Therefore, a Transformer is used here to fuse the two branch modules, organically fusing the learned features together.

Further, in the iterative training process of the step (4), a random gradient method is used for 200 iterations, the batch of each iteration is set to be 32, the model learning rate is set to be 0.0001, the result of each ten iterations is not changed, and the final model is saved for model testing;

the model is trained and tested in a 5-fold cross validation mode, and finally the average test precision of 5 folds is used as a classification result.

Has the advantages that: the invention can provide a spectrogram enhanced random mask module combined space and channel attention mechanism under the problems of limited baby cry samples and how to effectively dig the distinguishable feature representation of the baby cry, and obtains better classification feature representation of the baby cry. Meanwhile, a Transformer mechanism is fused, so that the complementary action among the characteristics is realized, and the robustness of the classification of the baby crying is further improved.

Compared with the prior art, the invention has the following advantages:

(1) the training model constructed by the invention is a novel bilinear fusion network, and audio frequency spectrogram features are mined in a multi-level manner.

(2) The spectrogram enhancing module extracts the robustness characteristic representation of the spectrogram in the channel, can improve the audio number and can dynamically expand a data set.

(3) The attention mechanism module extracts the distinguishable feature representation and fully excavates the distinguishable features among the channels.

(4) The feature fusion module can realize the complementary action among features, highlight useful information, inhibit redundant information and further enhance the characterization capability of the feature map.

Drawings

FIG. 1 is a general classification flow diagram of the present invention;

FIG. 2 is a schematic diagram of a network model according to an embodiment;

FIG. 3 is a sample audio frequency spectrum of an embodiment;

FIG. 4 is a schematic diagram of a feature matrix of an embodiment;

FIG. 5 is a schematic diagram of the enhanced spectrum in the embodiment;

FIG. 6 is a diagram illustrating the classification result of baby cry types according to the embodiment.

Detailed Description

The technical solution of the present invention is described in detail below, but the scope of the present invention is not limited to the embodiments.

As shown in fig. 1, the method for classifying baby crying based on the Transformer fusion model of the present invention includes the following steps:

inputting crying audio data of a baby to be classified, and converting an audio sample into a spectrogram with the size of 256 multiplied by 256 by using Sound Exchange audio processing software, wherein the horizontal axis of the spectrogram of a single channel represents time, and the vertical axis represents frequency;

as shown in fig. 2, a Resnet-50 pre-training model is used as a basic network, then parameters of the first 8 layers of convolutional layers in the Resnet-50 model are fixed, only parameters of the second two layers participate in a training process, and then two full-connection layers with 1024 neurons and a dropout layer with a rate of 0.7 are added to the basic network to obtain a training model;

Example 1:

the embodiment comprises the following steps:

step (1), firstly, processing training data into a spectrogram with the size of 256 multiplied by 256 as the input of a subsequent model through an SOX tool, wherein the parameter setting of audio comprises that the sampling rate (sample rate) is set to 44100Hz, the sampling size (sample size) of 16 bits, the sampling coding (sample encoding) is set to 16-bit (signed integer) signed integer PCM, the number of channels is 3, and a specific audio spectrogram sample is shown in figure 3;

step (2), using a Resnet-50 pre-training model as a basic network, and carrying out fine tuning to construct a training model, which specifically comprises the following steps:

taking the Resnet-50 pre-training model as a backbone network, then fixing parameters of the front 8 layers of the convolutional layers in the Resnet-50 model, including 8 groups of convolutional layers from Conv1 to Conv8 in the Resnet-50, and updating the parameters of the back two layers in the training process; then, two fully-connected layers with 1024 neurons and a dropout layer with the rate of 0.7 are added into the backbone network to form a required basic network model for preliminarily extracting spectrogram features and generating a 1024 × 32 feature matrix, as shown in fig. 4.

Step (3), further optimizing the extracted spectrogram characteristics through two branches;

one adopts a spectrogram enhancing module, the setting of a distorted time domain signal is removed, and only time domain channels and frequency domain channels of the spectrogram are masked; specifically, two masking frequency domain channels are provided, and the random value ranges from 0 to 20; two masking time domain channels are also set, and the random value ranges from 10 to 30; after the spectrogram enhancement is performed on one of the spectrograms, it is shown in fig. 5.

And the other one adopts a space and channel attention module, and for the channel attention module, each spectrogram feature passes through a layer of global maximum pooling (Max boosting) and a two-layer MLP network to obtain a channel attention feature map. Then multiplying the channel attention feature map by the input original features, and taking the result as the input of a space attention module, wherein the space attention module summarizes and introduces the global maximum pooling of each point in the feature map; then the feature descriptors generated by the global maximum pooling are sent to the two-layer perceptron; and finally, multiplying the channel attention diagram and the original characteristic diagram to generate a characteristic diagram.

Step (4), fusing the two obtained representations by using a Transformer fusion module, removing the position coding in the Transformer module in the embodiment, and fusing the optimized spectrogram characteristics of the two branches by using a layer of Transformer Block; wherein the sentence block (token) is set to 49, each token being 1 × 1 × 128(w × h × c) in size; finally, a final classification result is obtained through the 2-layer full-link layer and softmax, and after one spectrogram is classified, the result is shown in fig. 6.

The classification result of the implementation includes hungry class, sleep class and wakeup class, and the meaning of the baby crying can be judged through the final classification detection of the baby crying audio, for example, hungry indicates that the hungry state is expressed through crying, wakeup indicates that the baby is awake, sleep indicates that the baby is sleepy, and thus the physiological needs of the baby are known.

In the iterative training process of the embodiment, a stochastic gradient method is used for performing 200 iterations, the batch (batch) of each iteration is set to be 32, the model learning rate is set to be 0.0001, the iteration result does not change any more in each ten iterations, and the final model is saved for model testing; the model is trained and tested in a 5-fold cross validation mode, and finally the average test precision of 5 folds is used as a classification result.

Example 2:

in order to verify the reasonability and the effectiveness of the technical scheme, a certain subset of a Baby2020 data set is selected for experiment, acc (acuracy) is used as an objective evaluation index of a classification result, the embodiment is realized based on a deep learning frame Pythroch, and an image processor (GPU) is used for accelerating operation, a 12GB memory and an Nvidia GeForce GTX 2080Ti display card are utilized.

For the Baby2020 data subset, three types of samples from 0 to 3 months old healthy infants were contained, with 1058 samples for the Hungry category, 1257 samples for the sleep category, and 949 samples for the Wakeup category. Of which 2790 audio tones are used as training data and 743 audio tones are used as test data. Table 1 lists the experimental results of the invention, acc reaches 83.14%, the classification performance is superior to other similar methods, and the effective classification of the targets is realized.

TABLE 1 classification results for the Baby2020 data subsets

The embodiment shows that the method can greatly improve the classification precision and efficiency of the baby crying audio and is convenient for timely knowing the requirements expressed by the baby crying.

Claims

1. A method for classifying baby crying based on a Transformer fusion model is characterized by comprising the following steps: the method comprises the following steps:

step (2), constructing a basic network model, and preliminarily extracting spectrogram characteristics;

taking a Resnet-50 pre-training model as a basic network, then fixing parameters of the front 8 layers of the convolutional layer in the Resnet-50 model, only parameters of the rear two layers participate in the training process, and then adding two full-connection layers with 1024 neurons and a dropout layer with the speed of 0.7 to the basic network to further obtain a basic network model;

2. The method for classifying baby crying based on Transformer fusion model as claimed in claim 1, wherein: the preprocessing in the step (1) is to convert an audio file sample into a spectrogram with the size of 256 × 256 through an audio processing module, wherein the horizontal axis of the spectrogram of a single channel represents time, and the vertical axis represents frequency.

3. The method for classifying baby crying based on Transformer fusion model as claimed in claim 1, wherein: the method for constructing the basic network model in the step (2) comprises the following steps:

taking a Resnet-50 pre-training model as a basic network, then fixing parameters of the front 8 layers of the convolutional layer in the Resnet-50 model, only parameters of the rear two layers participate in the training process, and then adding two full-connection layers with 1024 neurons and a dropout layer with the speed of 0.7 to the basic network to further obtain the basic network model.

4. The method for classifying baby crying based on Transformer fusion model as claimed in claim 1, wherein: the specific method for extracting the spectrogram robustness characteristic by the spectrogram enhancement module in the step (3) comprises the following steps: respectively performing mask operation on a time domain channel and a frequency domain channel of the spectrogram, namely setting two mask frequency domain channels, wherein the range of random values is between 0 and 20; two masking time domain channels are set and the random value ranges between 10 and 30.

5. The method for classifying crying of infant based on Transformer fusion model as claimed in claim 1, wherein: the attention mechanism module in the step (3) comprises a channel attention mechanism and a space attention mechanism, and the specific working process is as follows:

P^c＝M^mlp(M^max(P)), (1)

wherein M is^max(. to) represents the global maximum pooling of feature maps between channels, M^mlp(. is) a two-layer perceptron, P representing a characteristic diagram of the input, P^cIs a generated channel attention map, P^f1A feature map representing a channel attention mechanism generation;

6. The method for classifying baby crying based on Transformer fusion model as claimed in claim 1, wherein: the Transformer fusion module in the step (4) does not set position coding, and adopts a layer of Transformer Block fusion robustness characteristic representation, differentiable characteristic representation in channels and differentiable characteristic representation among the channels;

the sentence tokens in the Transformer fusion module are set to 49, each token being 1 × 1 × 128(w × h × c) in size; and finally obtaining a final classification result through the 2 layers of full connection layers and softmax.

7. The method for classifying baby crying based on Transformer fusion model as claimed in claim 1, wherein: in the iterative training process of the step (4), a random gradient method is used for performing 200 iterations, the batch of each iteration is set to be 32, the model learning rate is set to be 0.0001, the result of each ten iterations is not changed, and the final model is stored for model testing;