CN115188387B

CN115188387B - Effective marine mammal sound automatic detection and classification method

Info

Publication number: CN115188387B
Application number: CN202210817343.1A
Authority: CN
Inventors: 李丹阳; 李军; 蒋凯林; 郑兴泽; 李焦; 明扬; 李林成; 谢天宇
Original assignee: Sichuan Agricultural University
Current assignee: Sichuan Agricultural University
Priority date: 2022-07-12
Filing date: 2022-07-12
Publication date: 2023-04-07
Anticipated expiration: 2042-07-12
Also published as: CN115188387A

Abstract

The invention relates to an effective automatic detection and classification method for marine mammal sound, which comprises the steps of carrying out data enhancement processing on marine mammal audio data through a single-sample variation self-encoder; extracting a Mel cepstrum coefficient from the audio data, performing characteristic splicing on the Mel cepstrum coefficient and the initial intensity envelope to obtain a Mel frequency cepstrum coefficient of a first input characteristic, and extracting an audio fingerprint characteristic of a second input characteristic from the audio data in an audio fingerprint extraction mode; inputting the Mel frequency cepstrum coefficient and the audio fingerprint characteristics into a two-way fusion network to obtain two-way prediction results, and fusing the two-way prediction results to obtain a final prediction result for the detection and classification of the marine mammals. According to the invention, through a two-way parallel fusion network structure, the fusion network can simultaneously have the capabilities of capturing high-dimensional features and utilizing time sequence information, and the complementarity between models is enhanced to efficiently improve the model performance by utilizing different information concerned by different networks.

Description

Effective marine mammal sound automatic detection and classification method

Technical Field

The invention relates to the technical field of marine acoustics application, in particular to an effective automatic detection and classification method for marine mammal sound.

Background

Statistically, 2.1% of the mammals in the world have died since 1600 years. The expert statistical analysis shows that the extinction speed of the species is continuously increased, which is about 100 to 1000 times of the former (original estimation) speed. Among the 173.9 million species recorded worldwide, there are about 130 species of marine mammals, of which there are more than 90 species of cetacea and dolphins, and nearly 40 species of other animals of the order finpoda and bovida. Of 20278 species of organisms recorded in the sea area of our country, nearly 50 species of aquatic mammals (including introduced species) have been found, of which up to 41 species have been found recorded in whales and dolphins, 5 species of finfoot animals (excluding introduced species), and 1 species of animals of the order porcines. Marine mammals are the most endangered species in nature and are classified as protective animals in almost every country in the world.

Although the number of marine mammals is small, the role played in maintaining marine ecosystem balance is not trivial, and protection of marine mammals is a matter of paramount importance; in recent years, the survival condition of marine mammals still faces a serious challenge due to the problems of unclear species resource condition, declined habitat, water environment pollution and the like; the traditional manual work for identifying the mammals is difficult in work, low in work efficiency, long in work time, high in material and labor cost, huge and difficult in data processing, incapable of monitoring the marine mammals, time lag and dangerous to some extent. The sample collection has great contingency and randomness, and long-time sample accumulation is needed; the field survey and passive acoustic monitoring require huge investment in manpower, financial resources and material resources, and are often difficult to implement quickly, frequently and on a large scale. In view of the fact that many marine mammals inhabit sea areas where human tracks are rare and have strong mobility, great difficulty is brought to human identification. Therefore, how to detect, identify and classify marine mammals is a current consideration.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides an effective automatic detection and classification method for marine mammal sound, and solves the problems of traditional manual mammal identification.

The purpose of the invention is realized by the following technical scheme: an efficient automatic detection and classification method for marine mammal sounds, said automatic detection and classification method comprising:

carrying out data enhancement processing on marine mammal audio data which is disclosed and collected in the field by a single-sample variation self-encoder;

extracting a Mel cepstrum coefficient and an initial intensity envelope from the audio data after data enhancement, performing feature splicing to obtain a first input feature, and extracting an audio fingerprint feature of a second input feature from the audio data after data enhancement in an audio fingerprint extraction mode;

inputting the Mel frequency cepstrum coefficient and the audio fingerprint characteristics into a two-way fusion MG-ResFormer network, outputting to obtain two ways of results for predicting the sample class probability, and fusing the two ways of prediction results to obtain the final prediction result for detecting and classifying the marine mammals.

The extracting of the audio fingerprint feature of the second input feature from the data-enhanced audio data in the audio fingerprint extraction manner includes:

dividing audio data into a plurality of original subframes with the same size, and carrying out Fourier transform on the data of the original subframes to calculate the frequency spectrum information of the original subframes;

dividing the atomic spectrum obtained by calculation into a plurality of spectral bands, calculating each spectral band to obtain an energy block, and combining all the energy blocks to obtain a two-dimensional matrix representing atomic spectrum energy information;

performing differential calculation on the two-dimensional matrix, and acquiring a 01 matrix only containing 0 and 1 by capturing each energy block and adjacent energy blocks;

and splicing the two-dimensional matrix and the 01 matrix containing the biological sound production information to obtain the audio fingerprint characteristics.

The two-way fusion MG-ResFormer network comprises an MG-Resnet network model, an MG-Transformer network model and a fusion layer; the Mel frequency cepstrum coefficient is input into an MG-transform network model to obtain a probability matrix of the MG-transform network model to a sample, the audio fingerprint characteristics are input into an MG-Resnet network to obtain a probability matrix of the MG-Resnet network model to the sample, and the fusion layer fuses the probability matrices output by the two network models to obtain a final prediction result of detection and classification of the marine mammals.

The MG-Resnet network model comprises five convolutional layer modules, a pooling layer, two full-connection layers and a coarse-fine granularity combination module, wherein the input audio fingerprint characteristics are subjected to 7 x 7 convolution through the first convolutional layer module, the average pooling is carried out after residual convolution is carried out on a second convolutional layer to a fifth convolutional layer respectively containing two build-blocks, then two obtained outputs are input to the coarse-fine granularity combination layer after the two parallel full-connection layers are passed, and finally a probability matrix of a sample is obtained.

The MG-Transformer network model firstly pools input Mel frequency cepstrum coefficients to reduce the sensitivity of characteristics to different positions, then inputs the coefficients to an encoding layer to extract different characteristic signals through a multi-head attention mechanism, segments the characteristics to enhance the attention of the model to global characteristics, respectively extracts coarse granularity and fine granularity probabilities through two linear layers, and finally inputs the probabilities to a coarse and fine granularity combination layer to obtain a probability matrix of a sample.

The fusion layer fuses probability matrixes output by the two network models to obtain a final prediction result of the detection and classification of the marine mammals, and the final prediction result comprises the following steps:

9 neurons are arranged in the fusion layer, and the probability values of 9 types output by the two network models respectively pass through the 9 neurons and the 9 pseudo neurons;

the class probability value output by one network model is directly multiplied by 9 neurons, the class probability value output by the other network model is multiplied by 9 pseudo neurons, and the two groups of obtained probability values are added to carry out normalization operation to obtain the final probability.

The obtaining of the sample quota probability matrix through a coarse and fine granularity combination layer in the MG-Transformer network model and the MG-Resnet network model comprises the following steps:

tensors with the lengths of the fine-grained type quantity and the coarse-grained type quantity are respectively mapped through two parallel full connection layers in the MG-Resnet network model, and tensors with the lengths of the fine-grained type quantity and the coarse-grained type quantity are respectively mapped through two parallel linear layers in the MG-Transformer network model;

the coarse-grained layer in the coarse-grained and fine-grained combined layer obtains the probability of the coarse-grained category to which the sample belongs through a softmax function on input data, the fine-grained layer groups the input data, divides fine-grained data belonging to the same coarse-grained category into one group, conducts softmax operation on each group, and finally multiplies the obtained coarse-grained probability with the corresponding fine-grained probability to obtain category probability information p ₁ ；

Directly carrying out softmax operation on the input of the fine granularity layer through a similar residual error structure, endowing epsilon with epsilon, and then adding p ₁ The matrix in which it is located performs the operation, i.e. p ₂ ＝p ₁ +ε*p ₀ ，p ₀ Representing the probability obtained by directly performing softmax operation on the input of the fine granularity layer;

finally p is added ₂ Normalizing to obtain final probability matrix p of the sample ₃ ＝p ₂ /∑p ₂ 。

The invention has the following advantages: an effective automatic detection and classification method for marine mammal sound is different from the traditional characteristics, a great amount of energy, frequency and time sequence information of a sound signal contained in audio fingerprint information and specific sound production expression capability of different species are constructed, and the advancement is shown in a convolutional neural network; by constructing a multi-granularity combination layer for assisting multi-classification tasks, constructing a coarse-granularity layer and a fine-granularity layer for species data to be identified by using a dividing mode of 'kingdom compendium family' species, wherein the coarse-granularity layer corresponds to the 'family' of the species, the fine-granularity layer corresponds to the 'genus' of the species, and the decision of the fine-granularity layer is consolidated through the prior judgment of the coarse granularity, and the multi-granularity fusion layer has strong universality and is suitable for other researches; by means of the two-way parallel fusion network structure, the fusion network can have the capability of capturing high-dimensional features and utilizing time sequence information at the same time, and the complementarity between models is enhanced to efficiently improve the performance of the models by utilizing different information concerned by different networks.

Drawings

FIG. 1 is a schematic structural diagram of a two-way MG-ResFormer network of the present invention;

FIG. 2 is a schematic diagram of a coarse-fine grain size composite layer;

FIG. 3 is a schematic diagram of the MG-Resnet network model;

FIG. 4 is a schematic structural view of a fused layer;

FIG. 5 is a comparison of the effects of various network models.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the detailed description of the embodiments of the present application provided below in connection with the appended drawings is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application. The invention is further described below with reference to the accompanying drawings.

As shown in fig. 1, an effective automatic detection and classification method for marine mammal sounds comprises:

s1, carrying out data enhancement processing on the audio data of the existing open and field collected marine mammals through a single-sample variation self-encoder;

wherein the disclosed marine mammal audio data is class 9 marine mammal information disclosed in a waters marine mammal sound database providing marine mammal sound recordings from 1940 to 2000; in order to facilitate the generation of audio fingerprints and the same network input data scale, audio with different lengths is averagely divided into 2s of audio;

in order to prevent the Sample distribution from causing negative influence on the learning of the model, the phenomena of blurring and chaos (meaning that the image is seriously overlapped due to posterior collapse) which may occur in an image generated by a common VAE (Plain VAE) are solved by a single Sample variation auto-encoder (S3 VAE).

S2, extracting a Mel cepstrum coefficient and an initial intensity envelope from the audio data after data enhancement, performing feature splicing to obtain a first input feature, and extracting an audio fingerprint feature of a second input feature from the audio data after data enhancement in an audio fingerprint extraction mode;

further, audio feature extraction can reduce the sampling signal of the original waveform, thereby accelerating the understanding of the semantic meaning in the audio by the machine. In order to obtain the audio features with the best effect, 9 features of the audio data mainstream are extracted: chromatographic information, constant Q chromatographic information, normalized chromatographic information, mel-frequency spectral information, mel-cepstral information, spectral contrast, tonal centroid, local autocorrelation of the initial intensity envelope, fourier-velocimetry.

Firstly, pre-emphasis is carried out on samples to improve the energy of a high-frequency part of a signal, given a time domain input signal x [ n ], the pre-emphasized signal is: y [ n ] = x [ n ] -alphaxn-1, 0.9 ≦ alpha ≦ 1.0;

to facilitate the extraction of various subsequent features, we should make the value of the signal at the window boundary approximate to 0, so that the signal approaches to be a periodic signal, and the window function is as follows:

after signal preprocessing, nine features in the above are extracted through an open source library librosa in python, the features are combined due to different concerned dimensions of each feature, the combinations are trained and the correlation is verified, and finally two features are selected to be spliced to serve as one of input data: mel-frequency cepstral coefficients (MFCCs) and the starting intensity envelope.

And S3, inputting the Mel frequency cepstrum coefficient and the audio fingerprint characteristics into a two-way fusion MG-ResFormer network, outputting to obtain two ways of results for predicting the sample class probability, and fusing the two ways of prediction results to obtain a final prediction result for detecting and classifying the marine mammals.

Further, as shown in fig. 2, extracting the audio fingerprint feature of the second input feature from the data-enhanced audio data by means of audio fingerprint extraction includes:

In particular, in conventional audio features such as mel-frequency spectrum, mfcc and chromatogram, which are often related to only a single piece of information in the voiceprint, the present invention is expected to construct a new voiceprint feature containing frequency, energy and time sequence to enhance the weak term of a single signal, and also to reflect the uniqueness of different kinds of utterances.

The invention reduces the influence of overlong and overlong audio frequency on the final classification as much as possible. Therefore, in the construction of the audio fingerprint, the strategy of atomic frame stream is used, namely, the original audio is divided into original subframes with the same size, then a series of changes are carried out on the atomic frames to obtain atomic features, and the final features are obtained by combining the atomic features. Common atomic features may contain not much information enough to support the model to identify itself, but under normal circumstances the audio to be identified is made up of hundreds and thousands of original sub-frames, containing enough atomic frames to make efficient and reliable identification.

Then, performing fourier transform on the data of the original subframe to calculate the spectrum information of the original subframe, wherein the calculation comprises the following steps:

F(e ^jω )＝a+ib

wherein j is an imaginary number, ω is an angular frequency, and t is time, in order to make the energy information expression more accurate, the atomic frequency spectrum is divided into 65 spectral bands, and an energy block is obtained by calculating for each spectral band

Then, energy information of the atomic spectrum is obtained by combining 65 energy blocks. The energy information of the atomic frequency spectrums form a two-dimensional matrix, so that certain time sequence information is reserved, which is an important characteristic of the audio fingerprint.

And (3) carrying out differential calculation on the two-dimensional matrix, and obtaining a matrix only containing 0 and 1 by capturing each energy block and adjacent energy blocks, wherein the calculation formula is as follows:

wherein m and n represent coordinate information of a two-dimensional matrix, n represents a coordinate in an x direction, and m represents a coordinate in a y direction, so as to position the position of each energy block in the two-dimensional matrix; the 01 matrix contains the vocal information of the living beings, so that the vocal information is spliced with the two-dimensional matrix to obtain the final fingerprint characteristics.

Further, as shown in fig. 1, the two-way fusion MG-ResFormer network includes a MG-Resnet network model, a MG-Transformer network model and a fusion layer; the Mel frequency cepstrum coefficient is input into an MG-transform network model to obtain a probability matrix of the MG-transform network model to a sample, the audio fingerprint characteristics are input into an MG-Resnet network to obtain a probability matrix of the MG-Resnet network model to the sample, and the fusion layer fuses the probability matrices output by the two network models to obtain a final prediction result of detection and classification of the marine mammals.

Inputting audio fingerprint characteristics of audio in an MG-Resnet network model, and recording a Loss function as Loss1; inputting the Mel frequency cepstrum coefficient characteristic of the audio in an MG-Transformer network model, and marking the Loss function as Loss2; in the fusion module, the probability output by the MGResnet network model and the MGTransformer network model is used as input to be fitted with the label information after One-hot coding, and the Loss is recorded as Loss3. The final loss function is noted as:

Loss＝loss1+loss2+loss3

fingerprint features of an audio sample, mel cepstrum coefficients (MFCC) and splicing features (description size) of a starting intensity envelope are respectively input into MGResnet and MGTransformer, after forward propagation through a two-way network, two matrixes for predicting sample class probabilities are respectively output, and at the moment, the two matrixes for predicting 1 × 9 are frozen. Because the two matrixes have calculation processes from two paths, parameters of the two-path network are changed when the Loss3 is propagated reversely, and the Loss1 and the Loss2 complete the updating of the two-path network, so that the two-path network is negatively influenced by the reverse propagation of the Loss3.

The calculation of the Loss is obtained according to gradient back propagation, so that the Loss calculation process is correct, and after the fine partial derivative calculation, the Loss contained in the Loss is calculated ¹ ，Loss ² ，Loss ³ Will be responsible for the back propagation of the modules to which they belong, respectively.

As shown in fig. 3, the MG-Resnet network model includes five convolutional layers, a pooling layer, two full-link layers, and a coarse-fine granularity combination module, and first performs 7 × 7 convolution on input audio fingerprint features by the first convolutional layer module, performs average pooling after performing residual convolution on the second convolutional layer to the fifth convolutional layer, which respectively include two build-blocks, and then inputs two obtained outputs to the coarse-fine granularity combination layer after passing through the two parallel full-link layers, and finally obtains a probability matrix for a sample.

The MG-Transformer network model firstly pools input Mel frequency cepstrum coefficients to reduce the sensitivity of characteristics to different positions, then inputs the coefficients into a coding layer to extract different characteristic signals through a multi-head attention mechanism, segments the characteristics to enhance the attention of the model to global characteristics, respectively extracts coarse granularity and fine granularity probabilities through two linear layers, and finally inputs the probabilities into a coarse and fine granularity combination layer to obtain a probability matrix of a sample; the processing ensures that the model keeps high-efficiency recognition performance for a large number of call sound classification tasks, can better sense global characteristics and has more excellent performance for multi-class tasks.

As shown in fig. 4, obtaining the sample probability matrix through the coarse and fine granularity combination layer in the MG-Transformer network model and the MG-respet network model includes:

tensors with the lengths of the fine-grained category number and the coarse-grained category number are respectively mapped through two parallel full-connection layers in the MG-Resnet network model, and tensors with the lengths of the fine-grained category number and the coarse-grained category number are respectively mapped through two parallel linear layers in the MG-Transformer network model; the data used here still has only one tag, namely a fine-grained tag, and the capture of coarse-grained information is realized by means of a coarse-grained layer.

A coarse-grained layer in the coarse-grained and fine-grained combined layer obtains the probability of a coarse-grained category to which a sample belongs through a softmax function for input data, a fine-grained layer groups the input data, divides fine-grained data belonging to the same coarse-grained category into one group, performs softmax operation on each group, and finally multiplies the obtained coarse-grained probability with the corresponding fine-grained probability to obtain category probability information; the following formula represents a calculation formula of the tiger whale class probability:

p ₁ (Killer whale)＝p(Killer whale|Whale)*p(Whale)

in addition, it is necessary to consider a special case that the coarse-grained layer may output an incorrect coarse-grained classification probability, and although this is extremely low, it is necessary to directly perform softmax operation on the input of the fine-grained layer by using a similar residual structure, and then give epsilon and p ₁ The matrix in which the operation is performed is p ₂ (Killer whale)＝p ₁ (Killer whale)+ε*p ₀ (Killer whale)；

Wherein p is ₀ (Killer tile) represents the probability obtained by directly performing softmax operation on the input of the fine-grained layer;

finally p is added ₀ (Killer while) normalizing to obtain final probability matrix p of the sample ₃ (Killer whale)＝p ₂ (Killer whale)/∑p ₂ 。

In the actual training process, the neural network gradually notices the correctness of the learning characteristics, the more correctly the coarse-grained classes to which the fine-grained classes belong are classified, the more likely the fine-grained classes are correctly distributed, and although the invention only uses the fine-grained labels, the final classification effect is still very ideal.

As shown in fig. 5, the fusion layer fusing the probability matrices output by the two network models to obtain the final prediction result of the detection and classification of marine mammals includes:

9 neurons are arranged in the fusion layer, and the probability values (respectively marked as input1 and input 2) of 9 categories output by the two network models pass through the 9 neurons and the 9 pseudo neurons respectively;

There are also 9 pseudo-neurons, the value of each of which depends on the value of the neuron at the corresponding position. The calculation formula is as follows, wherein _Ti Representing neurons involved in training, beta _Fi Represent pseudo neurons not involved in training:

β _Ti ＝1-β _Fi

in order to ensure that the fusion structure has correct influence, a similar residual error structure is also used, the final output probability is added to the input1 and input2, and the normalization operation is performed to obtain the final output.

The invention tests the two-way fusion MG-ResFormer network, and the following table shows that the two-way fusion MG-ResFormer network has extreme performance in the audio nine-classification task of marine mammals. In this task, the ACC, AUC, map and f1_ score reach 99.09, 99.99, 99.97 and 99.24 respectively. Compared with the classification network commonly used in the same field, the network makes obvious progress.

Network effect comparison graph

Network/index	ACC	AUC	mAP	f1_score
					MGResFormer	99.09	99.99	99.97	99.24
MGResnet18	97.27	99.93	99.35	96.54
					MGTrans	96.36	99.85	98.63	95.76
InceptionV3	94.62	98.97	99.03	94.43
					EfficientNet	95.39	99.14	99.02	95.78

As shown in fig. 5, the accuracy of MGResFormer almost always lies above MGResnet and mgfransformer, which proves that the two networks merge into each other to play a role of complementary promotion.

The foregoing is illustrative of the preferred embodiments of the present invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and is not to be construed as limited to the exclusion of other embodiments, and that various other combinations, modifications, and environments may be used and modifications may be made within the scope of the concepts described herein, either by the above teachings or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An effective method for automatically detecting and classifying marine mammal sounds, comprising: the automatic detection and classification method comprises the following steps:

carrying out data enhancement processing on audio data in a Wattix marine mammal sound database and marine mammal audio data collected on the spot by a single-sample variation self-encoder;

2. An effective method for automatic detection and classification of marine mammal sounds according to claim 1, wherein: the extracting of the audio fingerprint feature of the second input feature from the data-enhanced audio data in the audio fingerprint extraction manner includes:

dividing audio data into a plurality of original subframes with the same size, and performing Fourier transform on data of the original subframes to calculate frequency spectrum information of the data;

3. An effective method for automatic detection and classification of marine mammal sounds according to claim 1, wherein: the two-way fusion MG-ResFormer network comprises an MG-Resnet network model, an MG-Transformer network model and a fusion layer; the Mel frequency cepstrum coefficient is input into an MG-transform network model to obtain a probability matrix of the MG-transform network model to a sample, the audio fingerprint characteristics are input into an MG-Resnet network to obtain a probability matrix of the MG-Resnet network model to the sample, and the fusion layer fuses the probability matrices output by the two network models to obtain a final prediction result of detection and classification of the marine mammals.

4. An effective marine mammal sound automatic detection and classification method according to claim 3, characterized in that: the MG-Resnet network model comprises five convolutional layer modules, a pooling layer, two full-connection layers and a coarse-fine granularity combination module, wherein the input audio fingerprint characteristics are subjected to 7 x 7 convolution through the first convolutional layer module, the average pooling is carried out after residual convolution is carried out on a second convolutional layer to a fifth convolutional layer respectively containing two build-blocks, then two obtained outputs are input to the coarse-fine granularity combination layer after the two parallel full-connection layers are passed, and finally a probability matrix of a sample is obtained.

5. An effective marine mammal sound automatic detection and classification method according to claim 4, wherein: the MG-Transformer network model firstly pools input Mel frequency cepstrum coefficients to reduce the sensitivity of characteristics to different positions, then inputs the coefficients to an encoding layer to extract different characteristic signals through a multi-head attention mechanism, segments the characteristics to enhance the attention of the model to global characteristics, respectively extracts coarse granularity and fine granularity probabilities through two linear layers, and finally inputs the probabilities to a coarse and fine granularity combination layer to obtain a probability matrix of a sample.

6. An effective method for automatic detection and classification of marine mammal sounds according to claim 3, wherein: the fusion layer fuses the probability matrixes output by the two network models to obtain the final prediction result of the detection and classification of the marine mammals, and the final prediction result comprises the following steps:

7. An effective method for automatic detection and classification of marine mammal sounds according to claim 5, wherein: the obtaining of the sample quota probability matrix through the thickness granularity combination layer in the MG-Transformer network model and the MG-Resnet network model comprises the following steps:

tensors with the lengths of the fine-grained category number and the coarse-grained category number are respectively mapped through two parallel full-connection layers in the MG-Resnet network model, and tensors with the lengths of the fine-grained category number and the coarse-grained category number are respectively mapped through two parallel linear layers in the MG-Transformer network model;

the coarse-grained layer in the coarse-grained and fine-grained combined layer obtains the probability of the coarse-grained category to which the sample belongs through a softmax function on input data, the fine-grained layer groups the input data, divides the fine-grained data belonging to the same coarse-grained category into one group, conducts softmax operation on each group, and finally multiplies the obtained coarse-grained probability with the corresponding fine-grained probability to obtain category probability information

；

Directly performing softmax operation on the input of the fine granularity layer through a similar residual error structure, and endowing epsilon with a sum

The matrix in which it is operated on->

，/>

Representing the probability obtained by directly performing softmax operation on the input of the fine granularity layer;

finally will be

Normalization is performed to obtain a final probability matrix for the sample>

。/>