CN115188387A

CN115188387A - Effective marine mammal sound automatic detection and classification method

Info

Publication number: CN115188387A
Application number: CN202210817343.1A
Authority: CN
Inventors: 李丹阳; 李军; 蒋凯林; 郑兴泽; 李焦; 明扬; 李林成; 谢天宇
Original assignee: Sichuan Agricultural University
Current assignee: Sichuan Agricultural University
Priority date: 2022-07-12
Filing date: 2022-07-12
Publication date: 2022-10-14
Anticipated expiration: 2042-07-12
Also published as: CN115188387B

Abstract

The invention relates to an effective automatic detection and classification method for marine mammal sound, which comprises the steps of carrying out data enhancement processing on marine mammal audio data through a single-sample variation self-encoder; extracting a Mel cepstrum coefficient from the audio data, performing characteristic splicing on the Mel cepstrum coefficient and the initial intensity envelope to obtain a Mel frequency cepstrum coefficient of a first input characteristic, and extracting an audio fingerprint characteristic of a second input characteristic from the audio data in an audio fingerprint extraction mode; inputting the Mel frequency cepstrum coefficient and the audio fingerprint characteristics into a two-way fusion network to obtain two prediction results, and fusing the two prediction results to obtain a final prediction result for detecting and classifying the marine mammals. According to the invention, through a two-way parallel fusion network structure, the fusion network can simultaneously have the capabilities of capturing high-dimensional features and utilizing time sequence information, and the complementarity between models is enhanced to efficiently improve the model performance by utilizing different information concerned by different networks.

Description

Effective marine mammal sound automatic detection and classification method

Technical Field

The invention relates to the technical field of marine acoustics application, in particular to an effective automatic detection and classification method for marine mammal sound.

Background

Statistically, 2.1% of the mammals in the world have died since 1600 years. Expert statistical analysis shows that the extinction speed of the species is continuously increased, which is about 100 to 1000 times of the former (original estimation) speed. Among the 173.9 thousands of species recorded worldwide, there are about 130 species of marine mammals, of which over 90 species are of the whale and dolphin species and nearly 40 species are of the other orders pinniphyllales and bovid. Of 20278 species of organisms recorded in the sea area of our country, nearly 50 species of aquatic mammals (including introduced species) have been found, of which up to 41 species have been found recorded in whales and dolphins, 5 species of finfoot animals (excluding introduced species), and 1 species of animals of the order porcines. Marine mammals are the most endangered species in nature and are classified as protective animals in almost every country in the world.

Although the number of marine mammals is small, the role played in maintaining marine ecosystem balance is not trivial, and protection of marine mammals is a matter of paramount importance; in recent years, the survival conditions of marine mammals still face serious challenges due to the problems of unclear species resource conditions, declined habitat, water environment pollution and the like; the traditional method for manually identifying the mammals has the disadvantages of difficult work, low work efficiency, long work time, high material and labor cost, huge and difficult data processing, incapability of monitoring the marine mammals, time lag and certain danger. The sample collection has great contingency and randomness, and long-time accumulation of samples is needed; the field survey and passive acoustic monitoring require huge investment in manpower, financial resources and material resources, and are often difficult to implement quickly, frequently and on a large scale. Since many marine mammals inhabit sea areas where human tracks are rare and have strong mobility, great difficulty is brought to human identification. Therefore, how to detect, identify and classify marine mammals is a current consideration.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides an effective automatic detection and classification method for marine mammal sound, and solves the problems existing in the traditional method for manually identifying mammals.

The purpose of the invention is realized by the following technical scheme: an effective marine mammal sound automatic detection and classification method, the automatic detection and classification method comprising:

carrying out data enhancement processing on the audio data of the marine mammal, which is disclosed in the prior art and acquired in the field, by a single-sample variation self-encoder;

extracting a Mel cepstrum coefficient and an initial intensity envelope from the audio data after data enhancement, performing feature splicing to obtain a first input feature, and extracting an audio fingerprint feature of a second input feature from the audio data after data enhancement in an audio fingerprint extraction mode;

inputting the Mel frequency cepstrum coefficient and the audio fingerprint characteristics into a two-way fusion MG-ResFormer network, outputting to obtain two ways of results for predicting the sample class probability, and fusing the two ways of prediction results to obtain the final prediction result for detecting and classifying the marine mammals.

The extracting of the audio fingerprint feature of the second input feature from the data-enhanced audio data in the audio fingerprint extracting manner includes:

dividing audio data into a plurality of original subframes with the same size, and carrying out Fourier transform on the data of the original subframes to calculate the frequency spectrum information of the original subframes;

dividing the atomic spectrum obtained by calculation into a plurality of spectral bands, calculating each spectral band to obtain an energy block, and combining all the energy blocks to obtain a two-dimensional matrix representing atomic spectrum energy information;

performing differential calculation on the two-dimensional matrix, and acquiring a 01 matrix only containing 0 and 1 by capturing each energy block and adjacent energy blocks;

and splicing the two-dimensional matrix and the 01 matrix containing the biological sound production information to obtain the audio fingerprint characteristics.

The two-way fusion MG-ResFormer network comprises an MG-Resnet network model, an MG-Transformer network model and a fusion layer; the Mel frequency cepstrum coefficient is input into an MG-Transformer network model to obtain a probability matrix of the MG-Transformer network model to the sample, the audio fingerprint characteristics are input into an MG-Resnet network to obtain a probability matrix of the MG-Resnet network model to the sample, and the fusion layer fuses the probability matrices output by the two network models to obtain a final prediction result of the detection and classification of the marine mammals.

The MG-Resnet network model comprises five convolutional layer modules, a pooling layer, two full-connection layers and a coarse-fine granularity combination module, wherein the input audio fingerprint characteristics are subjected to 7 x 7 convolution through the first convolutional layer module, the average pooling is carried out after residual convolution is carried out on a second convolutional layer to a fifth convolutional layer respectively containing two build-blocks, then two obtained outputs are input to the coarse-fine granularity combination layer after the two parallel full-connection layers are passed, and finally a probability matrix of a sample is obtained.

The MG-Transformer network model firstly pools input Mel frequency cepstrum coefficients to reduce the sensitivity of characteristics to different positions, then inputs the coefficients to an encoding layer to extract different characteristic signals through a multi-head attention mechanism, segments the characteristics to enhance the attention of the model to global characteristics, respectively extracts coarse granularity and fine granularity probabilities through two linear layers, and finally inputs the probabilities to a coarse and fine granularity combination layer to obtain a probability matrix of a sample.

The fusion layer fuses the probability matrixes output by the two network models to obtain the final prediction result of the detection and classification of the marine mammals, and the final prediction result comprises the following steps:

9 neurons are arranged in the fusion layer, and the probability values of 9 types output by the two network models respectively pass through the 9 neurons and the 9 pseudo neurons;

the class probability value output by one network model is directly multiplied by 9 neurons, the class probability value output by the other network model is multiplied by 9 pseudo neurons, and then the two groups of probability values are added together to carry out normalization operation to obtain the final probability.

The obtaining of the sample quota probability matrix through the thickness granularity combination layer in the MG-Transformer network model and the MG-Resnet network model comprises the following steps:

tensors with the lengths of the fine-grained type quantity and the coarse-grained type quantity are respectively mapped through two parallel full connection layers in the MG-Resnet network model, and tensors with the lengths of the fine-grained type quantity and the coarse-grained type quantity are respectively mapped through two parallel linear layers in the MG-Transformer network model;

the coarse-grained layer in the coarse-grained and fine-grained combined layer obtains the probability of the coarse-grained category to which the sample belongs through a softmax function on input data, the fine-grained layer groups the input data, divides the fine-grained data belonging to the same coarse-grained category into one group, and performs softmax operation on each groupFinally, multiplying the obtained coarse granularity probability by the corresponding fine granularity probability to obtain class probability information p ₁ ；

Directly performing softmax operation on the input of the fine granularity layer through a similar residual error structure, and endowing epsilon with p ₁ The matrix in which it is located performs the operation, i.e. p ₂ ＝p ₁ +ε*p ₀ ，p ₀ Representing the probability obtained by directly performing softmax operation on the input of the fine-grained layer;

finally p is added ₂ Normalizing to obtain final probability matrix p of the sample ₃ ＝p ₂ /∑p ₂ 。

The invention has the following advantages: an effective automatic detection and classification method for marine mammal sound is different from the traditional characteristics, a great amount of energy, frequency and time sequence information of a sound signal contained in audio fingerprint information and specific sound production expression capability of different species are constructed, and the advancement is shown in a convolutional neural network; by constructing a multi-granularity combination layer for assisting multi-classification tasks, constructing a coarse-granularity layer and a fine-granularity layer for species data to be identified by using a dividing mode of 'kingdom compendium family' species, wherein the coarse-granularity layer corresponds to the 'family' of the species, the fine-granularity layer corresponds to the 'genus' of the species, and the decision of the fine-granularity layer is consolidated through the prior judgment of the coarse granularity, and the multi-granularity fusion layer has strong universality and is suitable for other researches; by means of the two-way parallel fusion network structure, the fusion network can have the capability of capturing high-dimensional features and utilizing time sequence information, and the complementarity among models is enhanced to efficiently improve the model performance by utilizing different information concerned by different networks.

Drawings

FIG. 1 is a schematic structural diagram of a two-way MG-ResFormer network of the present invention;

FIG. 2 is a schematic view of a coarse-fine grain composite layer;

FIG. 3 is a schematic diagram of the MG-Resnet network model;

FIG. 4 is a schematic structural view of a fused layer;

FIG. 5 is a comparison of the effects of various network models.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, as generally described and illustrated in the figures herein, could be arranged and designed in a wide variety of different configurations. Thus, the detailed description of the embodiments of the present application provided below in connection with the appended drawings is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application. The invention is further described below with reference to the accompanying drawings.

As shown in fig. 1, an effective automatic detection and classification method for marine mammal sounds comprises:

s1, carrying out data enhancement processing on the audio data of the existing open and field collected marine mammals through a single-sample variation self-encoder;

wherein the disclosed marine mammal audio data is class 9 marine mammal information disclosed in a waters marine mammal sound database providing marine mammal sound recordings from 1940 to 2000; in order to facilitate the generation of audio fingerprints and the same network input data scale, audio with different lengths is averagely divided into 2s of audio;

in order to prevent the Sample cloth distribution from causing negative influence on the learning of the model, the phenomena of blurring and chaos (image superposition is serious due to posterior collapse) which can occur in an image generated by a common VAE (Plain VAE) are solved by a single Sample variation auto-encoder (S3 VAE).

S2, extracting a Mel cepstrum coefficient and an initial intensity envelope from the audio data after data enhancement, performing feature splicing to obtain a first input feature, and extracting an audio fingerprint feature of a second input feature from the audio data after data enhancement in an audio fingerprint extraction mode;

further, audio feature extraction can reduce the sampling signal of the original waveform, thereby accelerating the understanding of the semantic meaning in the audio by the machine. In order to obtain the audio features with the best effect, 9 features of the audio data mainstream are extracted: chromatographic information, constant Q chromatographic information, normalized chromatographic information, mel-frequency spectral information, mel-cepstral information, spectral contrast, tonal centroid, local autocorrelation of the initial intensity envelope, fourier-velocimetry.

Firstly, pre-emphasis is carried out on samples to improve the energy of a high-frequency part of a signal, given a time domain input signal x [ n ], the pre-emphasized signal is: y [ n ] = x [ n ] - α x [ n-1],0.9 ≦ α ≦ 1.0;

to facilitate the extraction of various subsequent features, we should make the value of the signal at the window boundary approximate to 0, so that the signal approaches to be a periodic signal, and the window function is as follows:

after signal preprocessing, nine features in the above are extracted through an open source library librosa in python, the features are combined due to different concerned dimensions of each feature, the combinations are trained and the correlation is verified, and finally two features are selected to be spliced to serve as one of input data: mel-frequency cepstral coefficients (MFCCs) and the starting intensity envelope.

And S3, inputting the Mel frequency cepstrum coefficient and the audio fingerprint characteristics into a two-way fusion MG-ResFormer network, outputting two ways of results for predicting the sample class probability, and fusing the two ways of prediction results to obtain the final prediction result for detecting and classifying the marine mammals.

Further, as shown in fig. 2, extracting the audio fingerprint feature of the second input feature from the data-enhanced audio data by means of audio fingerprint extraction includes:

In particular, in conventional audio features such as mel-frequency spectrum, mfcc and chromatogram, which are often related to only a single piece of information in the voiceprint, the present invention is expected to construct a new voiceprint feature containing frequency, energy and time sequence to enhance the weak term of a single signal, and also to reflect the uniqueness of different kinds of utterances.

The invention reduces the influence of over-long and over-short audio on the final classification as much as possible. Therefore, in the construction of the audio fingerprint, the strategy of atomic frame stream is used, namely, the original audio is divided into original subframes with the same size, then a series of changes are carried out on the atomic frames to obtain atomic features, and the final features are obtained by combining the atomic features. Common atomic features may contain not much information enough to support the model to identify itself, but under normal circumstances the audio to be identified is made up of hundreds and thousands of original sub-frames, containing enough atomic frames to make efficient and reliable identification.

Then, performing fourier transform on the data of the original subframe to calculate the spectrum information of the original subframe, wherein the calculation comprises the following steps:

F(e ^jω )＝a+ib

wherein j is an imaginary number, omega is an angular frequency, t is time, in order to enable the energy information expression to be more accurate, an atomic frequency spectrum is divided into 65 frequency spectrum bands, and an energy block is obtained by calculating each frequency spectrum band

Thus, energy information of the atomic spectrum is obtained by combining 65 energy blocks. The energy information of the atomic frequency spectrums form a two-dimensional matrix, so that certain time sequence information is reserved, which is an important characteristic of the audio fingerprint.

And (3) carrying out differential calculation on the two-dimensional matrix, and obtaining a matrix only containing 0 and 1 by capturing each energy block and adjacent energy blocks, wherein the calculation formula is as follows:

the method comprises the following steps that (1) m and n represent coordinate information of a two-dimensional matrix, n represents a coordinate in an x direction, and m represents a coordinate in a y direction, so that the position of each energy block in the two-dimensional matrix is positioned; the 01 matrix contains the vocal information of the living beings, so that the vocal information is spliced with the two-dimensional matrix to obtain the final fingerprint characteristics.

Further, as shown in fig. 1, the two-way fusion MG-ResFormer network includes a MG-Resnet network model, a MG-Transformer network model and a fusion layer; the Mel frequency cepstrum coefficient is input into an MG-Transformer network model to obtain a probability matrix of the MG-Transformer network model to the sample, the audio fingerprint characteristics are input into an MG-Resnet network to obtain a probability matrix of the MG-Resnet network model to the sample, and the fusion layer fuses the probability matrices output by the two network models to obtain a final prediction result of the detection and classification of the marine mammals.

Inputting audio fingerprint characteristics of audio in an MG-Resnet network model, and recording a Loss function as Loss1; inputting the Mel frequency cepstrum coefficient characteristic of the audio in an MG-Transformer network model, and marking the Loss function as Loss2; in the fusion module, the probability output by the MGResnet network model and the MGTransformer network model is used as input to be fitted with the label information after One-hot coding, and the Loss is recorded as Loss3. The final loss function is noted as:

Loss＝loss1+loss2+loss3

fingerprint features of an audio sample, mel cepstrum coefficients (MFCC) and splicing features (description size) of a starting intensity envelope are respectively input into MGResnet and MGTransformer, after forward propagation through a two-way network, two matrixes for predicting sample class probabilities are respectively output, and at the moment, the two matrixes for predicting 1 × 9 are frozen. Because the two matrixes have calculation processes from two paths, parameters of the two-path network are changed when the Loss3 is propagated reversely, and the Loss1 and the Loss2 complete the updating of the two-path network, so that the two-path network is negatively influenced by the reverse propagation of the Loss3.

The Loss calculation process is correct because the Loss calculation is obtained according to gradient back propagation, and after the detailed partial derivative calculation, the Loss contained in the Loss is calculated ¹ ，Loss ² ，Loss ³ Will be responsible for the back propagation of the modules to which they belong, respectively.

As shown in fig. 3, the MG-respet network model includes five convolutional layer modules, one pooling layer, two full-link layers, and one coarse-fine granularity combination module, and first performs 7 × 7 convolution on input audio fingerprint features by the first convolutional layer module, performs average pooling after performing residual convolution on the second convolutional layer to the fifth convolutional layer, which respectively include two build-blocks, and then inputs two obtained outputs to the coarse-fine granularity combination layer after passing through the two parallel full-link layers, and finally obtains a probability matrix for a sample.

The MG-Transformer network model firstly pools input Mel frequency cepstrum coefficients to reduce the sensitivity of characteristics to different positions, then inputs the coefficients into a coding layer to extract different characteristic signals through a multi-head attention mechanism, segments the characteristics to enhance the attention of the model to global characteristics, respectively extracts coarse granularity and fine granularity probabilities through two linear layers, and finally inputs the probabilities into a coarse and fine granularity combination layer to obtain a probability matrix of a sample; the processing ensures that the model keeps high-efficiency recognition performance for a large number of call sound classification tasks, can better sense global characteristics and has more excellent performance for multi-class tasks.

As shown in fig. 4, obtaining the sample quota probability matrix through the coarse and fine granularity combination layer in the MG-Transformer network model and the MG-respet network model includes:

tensors with the lengths of the fine-grained category number and the coarse-grained category number are respectively mapped through two parallel full-connection layers in the MG-Resnet network model, and tensors with the lengths of the fine-grained category number and the coarse-grained category number are respectively mapped through two parallel linear layers in the MG-Transformer network model; the data used here still has only one tag, namely a fine-grained tag, and the capture of coarse-grained information is realized by means of a coarse-grained layer.

A coarse-grained layer in the coarse-grained and fine-grained combined layer obtains the probability of a coarse-grained category to which a sample belongs through a softmax function for input data, a fine-grained layer groups the input data, divides fine-grained data belonging to the same coarse-grained category into one group, performs softmax operation on each group, and finally multiplies the obtained coarse-grained probability with the corresponding fine-grained probability to obtain category probability information; the following formula represents a calculation formula of the tiger whale class probability:

p ₁ (Killer whale)＝p(Killer whale|Whale)*p(Whale)

in addition, it is necessary to consider a special case that the coarse-grained layer may output an incorrect coarse-grained classification probability, and although this is extremely low, it is necessary to directly perform softmax operation on the input of the fine-grained layer by using a similar residual structure, and then give epsilon and p ₁ The matrix in which the operation is performed is p ₂ (Killer whale)＝p ₁ (Killer whale)+ε*p ₀ (Killer whale)；

Wherein p is ₀ (Killer tile) represents the probability obtained by directly performing softmax operation on the input of the fine-grained layer;

finally p is added ₀ (Killer while) is normalized to obtain the final probability matrix p of the sample ₃ (Killer whale)＝p ₂ (Killer whale)/∑p ₂ 。

In the actual training process, the neural network gradually notices the correctness of the learning characteristics, the more correctly the coarse-grained classes to which the fine-grained classes belong are classified, the more likely the fine-grained classes are correctly distributed, and although the invention only uses the fine-grained labels, the final classification effect is still very ideal.

As shown in fig. 5, the fusion layer fuses the probability matrices output by the two network models to obtain the final prediction result of the detection and classification of marine mammals, which includes:

9 neurons are arranged in the fusion layer, and the probability values (respectively marked as input1 and input 2) of 9 categories output by the two network models pass through the 9 neurons and the 9 pseudo neurons respectively;

the class probability value output by one network model is directly multiplied by 9 neurons, the class probability value output by the other network model is multiplied by 9 pseudo neurons, and the two groups of obtained probability values are added to carry out normalization operation to obtain the final probability.

There are also 9 pseudo-neurons, the value of each of which depends on the value of the neuron at the corresponding position. The calculation formula is as follows, wherein _Ti Representing neurons involved in training, beta _Fi Represent dummy neurons not involved in training:

β _Ti ＝1-β _Fi

in order to ensure that the fusion structure has correct influence, a similar residual error structure is also used, the final output probability is added with the input1 and input2, and the standardization operation is carried out to obtain the final output.

The invention tests the two-way fusion MG-ResFormer network, and the following table shows that the two-way fusion MG-ResFormer network has extreme performance in the audio nine-classification task of marine mammals. In this task, the ACC, AUC, map and f1_ score reach 99.09, 99.99, 99.97 and 99.24 respectively. Compared with the classification network commonly used in the same field, the network makes obvious progress.

Network effect comparison graph

Network/index	ACC	AUC	mAP	f1_score
					MGResFormer	99.09	99.99	99.97	99.24
MGResnet18	97.27	99.93	99.35	96.54
					MGTrans	96.36	99.85	98.63	95.76
InceptionV3	94.62	98.97	99.03	94.43
					EfficientNet	95.39	99.14	99.02	95.78

As shown in fig. 5, the accuracy of MGResFormer almost always lies above MGResnet and mgfransformer, which proves that the two networks merge into each other to play a role of complementary promotion.

The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An effective method for automatically detecting and classifying marine mammal sounds, comprising: the automatic detection and classification method comprises the following steps:

carrying out data enhancement processing on marine mammal audio data which is disclosed and collected in the field by a single-sample variation self-encoder;

2. An effective method for automatic detection and classification of marine mammal sounds according to claim 1, wherein: the extracting of the audio fingerprint feature of the second input feature from the data-enhanced audio data in the audio fingerprint extracting manner includes:

dividing audio data into a plurality of original subframes with the same size, and performing Fourier transform on data of the original subframes to calculate frequency spectrum information of the data;

3. An effective method for automatic detection and classification of marine mammal sounds according to claim 1, wherein: the two-way fusion MG-ResFormer network comprises an MG-Resnet network model, an MG-Transformer network model and a fusion layer; the Mel frequency cepstrum coefficient is input into an MG-Transformer network model to obtain a probability matrix of the MG-Transformer network model to the sample, the audio fingerprint characteristics are input into an MG-Resnet network to obtain a probability matrix of the MG-Resnet network model to the sample, and the fusion layer fuses the probability matrices output by the two network models to obtain a final prediction result of the detection and classification of the marine mammals.

4. An effective method for automatic detection and classification of marine mammal sounds according to claim 3, wherein: the MG-Resnet network model comprises five convolutional layer modules, a pooling layer, two full-connection layers and a coarse-fine granularity combination module, wherein the input audio fingerprint characteristics are subjected to 7 x 7 convolution through the first convolutional layer module, the average pooling is carried out after residual convolution is carried out on a second convolutional layer to a fifth convolutional layer respectively containing two build-blocks, then two obtained outputs are input to the coarse-fine granularity combination layer after the two parallel full-connection layers are passed, and finally a probability matrix of a sample is obtained.

5. An effective method for automatic detection and classification of marine mammal sounds according to claim 4, wherein: the MG-Transformer network model firstly pools input Mel frequency cepstrum coefficients to reduce the sensitivity of features to different positions, then inputs the coefficients to an encoding layer to extract different feature signals through a multi-head attention mechanism, divides the features to enhance the attention of the model to global features, respectively extracts coarse granularity and fine granularity probabilities through two linear layers, and finally inputs the probabilities to a coarse granularity combination layer to obtain a probability matrix of a sample.

6. An effective marine mammal sound automatic detection and classification method according to claim 3, characterized in that: the fusion layer fuses probability matrixes output by the two network models to obtain a final prediction result of the detection and classification of the marine mammals, and the final prediction result comprises the following steps:

9 neurons are arranged in the fusion layer, and the probability values of 9 categories output by the two network models respectively pass through the 9 neurons and the 9 pseudo neurons;

7. An effective method for automatic detection and classification of marine mammal sounds according to claim 5, wherein: the obtaining of the sample quota probability matrix through the thickness granularity combination layer in the MG-Transformer network model and the MG-Resnet network model comprises the following steps:

a coarse-grained layer in the coarse-grained and fine-grained combined layers obtains the probability of a coarse-grained category to which a sample belongs through a softmax function on input data, fine-grained layers group the input data, divide fine-grained data belonging to the same coarse-grained category into one group, perform softmax operation on each group, and finally multiply the obtained coarse-grained probability with the corresponding fine-grained probability to obtain category probability information p ₁ ；