CN117831572A

CN117831572A - Automatic underwater target sound classification method based on lightweight multi-scale convolution attention neural network

Info

Publication number: CN117831572A
Application number: CN202410031610.1A
Authority: CN
Inventors: 杨俊杰; 张镇宇; 谢峰; 古宇宏; 丁家辉; 李彦杨; 徐梧高; 钟浩华; 杨超
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2024-01-09
Filing date: 2024-01-09
Publication date: 2024-04-05

Abstract

The invention belongs to the technical field of underwater target classification, and provides an automatic underwater target sound classification method based on a light-weight multi-scale convolution attention neural network, which comprises the following steps: s1, converting an original underwater target signal into single-channel quantity, and performing sampling and standardized length processing; s2, extracting MFCC features and GFCC features of the underwater target signal, and respectively carrying out second-order difference and splicing on the MFCC features and the GFCC features; splitting the processed MFCC features and GFCC features into a training set and a testing set after splicing the MFCC features and GFCC features along a first dimension; s3, constructing a light multi-scale convolution attention neural network model, and optimizing the parameters of neurons of the light multi-scale convolution attention neural network through a training set; the trained neural network is used for classifying the underwater target test set; compared with the traditional method for classifying the underwater targets by using the training classification framework, the method provided by the invention has the advantages of higher accuracy, wider applicability and less consumption of calculation resources.

Description

Automatic underwater target sound classification method based on lightweight multi-scale convolution attention neural network

Technical Field

The invention belongs to the technical field of underwater target classification, and particularly relates to an automatic underwater target sound classification method based on a lightweight multi-scale convolution attention neural network.

Background

Because of the wide application of sonar, people have generated great interest in the sound classification of underwater targets, particularly, the type of the ship is judged by the sound of the underwater screw propeller of the ship, because the type of the ship can be quickly identified by sound analysis of the ship, the management efficiency of the port on the ship is improved, and the design of the ship is improved, so that the ship is quieter or more efficient, and the influence on the marine ecological environment is improved; for example, by using intelligent equipment to distinguish passing ships in the sea area, the efficiency of marine workers can be greatly improved; therefore, it is important to develop a high-precision underwater target recognition technology; from the current state of the art, deep learning-based underwater target identification is a mainstream underwater target identification method, but two main challenges still remain: 1. the underwater target type is complex, the distinction degree of the original characteristics of the underwater noise signals of different original ships is low, meanwhile, the influence of the ocean on the environmental noise is strong due to the influence of ocean currents, monsoon and the like in different seasons, and the difficulty is caused to the task of classifying tags subsequently; 2. the neural network parameter related to the classification of most underwater targets has large scale and complex operation, and is not suitable for terminal deployment.

Therefore, a person skilled in the art proposes an automatic underwater target sound classification method based on a lightweight multi-scale convolution attention neural network to solve the problems proposed in the background art.

Disclosure of Invention

In order to solve the technical problems, the invention provides an automatic underwater target sound classification method based on a light-weight multi-scale convolution attention neural network, so as to solve the problems in the prior art.

An automatic underwater target sound classification method based on a lightweight multi-scale convolution attention neural network comprises the following steps:

s1, audio pretreatment: converting the multichannel underwater target signals into fixed channel numbers, and carrying out sampling and standardized length processing to obtain processed audio data;

s2, feature extraction: performing Mel filtering and cepstrum transformation on the audio data output in the step S1, and outputting MFCC feature vectors; performing Gamma filtering on the audio data output by the S1, outputting GFCC feature vectors, then processing the MFCC feature vectors and the GFCC feature vectors, and finally establishing a feature data set and loading;

s3, constructing 4 multi-scale convolution layers with different convolution kernel sizes, a batch normalization layer and a convolution layer: inputting the underwater target feature data set extracted in the step S2 into a multi-scale convolution layer for feature extraction, and then processing the extracted feature data;

s4, constructing a ConvNextV2 layer-convolution layer module, and maximizing a pooling layer: the ConvNextV2 layer-convolution layer of the data input extracted in the S3 is subjected to more complex and more abstract feature extraction; constructing a maximum pooling layer, performing feature dimension reduction on the extracted feature data, and outputting the feature dimension reduction to a next layer;

s5, constructing a multi-scale attention convolution module, and processing the characteristic data output by the S4 by using a maximum pooling layer, an average pooling layer and a full connection layer module;

s6, training phase: training the output of the S5 by combining with label information, and completing the optimized training of the structure and parameters of the multi-scale convolutional neural network classifier after the maximum number of iterations;

s7, testing: and classifying the underwater sound features in the test sample data set by using the trained multi-scale convolutional neural network classifier to obtain a test classification result.

Preferably, the target signal comprises an underwater target sound including fishing boats, trawl boats, mussel boats, tug boats, dredgers, motorboats, navigation boats, sailing boats, passenger ferries, ocean-going airliners, roll-on-roll boats, marine environmental noise, and the like.

Preferably, the specific description of the audio preprocessing is as follows:

s11, converting an original underwater target signal into a single-channel signal;

s12, setting the audio sampling rate as 22050;

s13, standardizing the audio length, intercepting all audio signals with the length of 1S, and filling redundant parts.

Preferably, in step S2, the processing of the MFCC feature vector and the GFCC feature vector is specifically described as follows: respectively calculating second-order difference from the obtained MFCC feature vector and GFCC feature vector, splicing, and carrying out normalization processing on the spliced MFCC feature vector and GFCC feature vector; the processed MFCC feature vector and GFCC feature vector are stitched along dimension 1.

Preferably, the MFCC uses a linear mel-scale filter bank, and the extraction process includes:

1) Preprocessing data;

2) Performing fast fourier transform on the signal;

3) Creating a group of triangular Mel filters which uniformly cover the frequency spectrum range, and calculating the overlapping area of each filter and the frequency spectrum to obtain a series of energy values;

4) Carrying out logarithmic conversion on the Mel spectrogram to generate a Mel logarithmic power spectrogram;

5) The MFCC features are obtained by applying a Discrete Cosine Transform (DCT) to the log-taken filter bank energy values.

Preferably, in step S3, the processing steps of the extracted feature data are as follows: combining the extracted characteristic data, and carrying out batch normalization on the combined characteristic data; and after batch normalization, the data enter a convolution layer to perform characteristic dimension reduction, and are output to the next layer.

Preferably, the step S4 is specifically described as follows: and (3) stacking the data extracted in the step S3 for three times by constructing a ConvNextV2 layer-convolution layer module to perform more complex and abstract feature extraction, and constructing a maximum pooling layer to perform feature dimension reduction.

Preferably, the step S5 is specifically described as follows: constructing a multi-scale attention convolution module, and a maximum pooling layer, an average pooling layer and a full connection layer module: and (3) performing key feature selection, dimension reduction and nonlinear change processing on the feature data output by the S4.

Preferably, the training times of the multi-scale convolutional neural network classifier structure and parameter optimization in the step S6 are 300 times.

Preferably, the training loss function of the network classifier is defined asWhen the training times are 300 times, the obtained loss rate and accuracy gradually tend to converge.

Compared with the prior art, the invention has the following beneficial effects:

according to the invention, the MFCC (frequency division multiplexing) characteristic and the GFCC characteristic are extracted by carrying out the extraction of the MFCC characteristic and the GFCC characteristic on each type of different underwater target signals, the extracted MFCC characteristic and GFCC characteristic are respectively subjected to second-order differential processing, and then original characteristic data and characteristic data of each differential are respectively spliced; splicing the processed MFCC features and GFCC features along the 1 st dimension, establishing a training set and a data set, and loading; constructing a multi-scale convolution attention neural network model, training a multi-scale convolution attention neural network classification model, and completing classification recognition; the method has the advantages that the processed MFCC characteristics and GFCC characteristics are extracted from the underwater target sound and are used for training the multi-scale convolution attention neural network to classify, and compared with the underwater target sound classification method using the traditional training classification framework, the method provided by the invention has higher accuracy and less calculation resource consumption.

Drawings

FIG. 1 is a system flow diagram of an embodiment of an automatic classification method of underwater target sound based on a lightweight multi-scale convolution attention neural network;

FIG. 2 is a 5s time domain waveform diagram of five different types of underwater target signals according to an embodiment of an automatic underwater target sound classification method based on a lightweight multi-scale convolution attention neural network;

FIG. 3 is a 5s Mel spectrum of five different classes of underwater target signals according to an embodiment of the invention, which is an automatic classification method of underwater target sound based on a lightweight multi-scale convolution attention neural network;

fig. 4 is a 5 sffcc feature diagram of five different types of underwater target signals according to an embodiment of an automatic classification method of underwater target sound based on a lightweight multi-scale convolution attention neural network;

FIG. 5 is a frame diagram of a multi-scale convolutional attention neural network according to an embodiment of the present invention, which is a method for automatically classifying underwater target sounds based on a lightweight multi-scale convolutional attention neural network;

FIG. 6 is a frame diagram of a multi-scale convolutional layer according to an embodiment of the present invention for automatic classification of underwater target sound based on a lightweight multi-scale convolutional attention neural network;

fig. 7 is a frame diagram of a convnxtv 2 module according to an embodiment of the present invention for automatic classification of underwater target sound based on a lightweight multi-scale convolutional attention neural network;

FIG. 8 is a frame diagram of a multi-scale convolution attention module according to an embodiment of the present invention for automatically classifying underwater target sounds based on a lightweight multi-scale convolution attention neural network;

FIG. 9 is a diagram showing a variation/convergence of a loss function value according to an embodiment of an automatic classification method of underwater target sound based on a lightweight multi-scale convolution attention neural network;

FIG. 10 is a chart of error rate variation of classification and identification of underwater targets in an automatic underwater target sound classification method based on a lightweight multi-scale convolution attention neural network;

FIG. 11 is a confusion matrix for identifying various underwater targets in the automatic underwater target sound classification method based on the light-weight multi-scale convolution attention neural network.

Detailed Description

Embodiments of the present invention are described in further detail below with reference to the accompanying drawings and examples. The following examples are illustrative of the invention but are not intended to limit the scope of the invention.

As shown in fig. 1 to 11:

the invention provides an automatic underwater target sound classification method based on a light-weight multi-scale convolution attention neural network, which comprises the following steps of:

s1, audio pretreatment:

s12, setting the audio sampling rate as 22050;

s13, standardizing the audio length, intercepting all audio signals with the length of 1S, and filling redundant parts;

s2, feature extraction:

s21, calculating the MFCC characteristics of the audio signal processed by the S1, calculating the first-order difference and the second-order difference of the MFCC characteristics, and maintaining the dimension consistency by adding all zero rows in the calculation result; splicing the MFCC features, the first-order difference and the second-order difference features; normalizing the spliced characteristic data; the MFCC uses a linear mel-scale filter bank, and the extraction process includes: 1) Preprocessing data; 2) Performing fast fourier transform on the signal; 3) Creating a group of triangular Mel filters which uniformly cover the frequency spectrum range, and calculating the overlapping area of each filter and the frequency spectrum to obtain a series of energy values; 4) Carrying out logarithmic conversion on the Mel spectrogram to generate a Mel logarithmic power spectrogram; 5) The MFCC characteristics are obtained by applying Discrete Cosine Transform (DCT) to the logarithmized filter bank energy values, namely:

wherein M is the number of triangular filters, X _m (n) is the mthThe nth frame of the channel is logarithmic Mel spectrogram, i is the index of cepstrum coefficient;

s22, calculating the GFCC characteristics of the audio signals processed by the S1, calculating the first-order difference and the second-order difference of the GFCC characteristics, and maintaining the dimension consistency by adding all zero rows in the calculation result; splicing GFCC features, first-order difference features and second-order difference features; normalizing the spliced characteristic data; wherein the GFCC in the time-frequency domain gammatine filter is defined as follows:

where f is the frequency, a is a constant,is the phase, p is the order of the filter; the factor b in equation (2) is the filter bandwidth and is calculated as follows:

wherein:

ERB(f)＝24.7+0.108f (4)

wherein ERB (f) is the equivalent rectangular bandwidth; computing gammatine filter response of each channel according to G (f, t) and taking absolute value to obtain time-frequency diagram G _c (n); where c represents a channel index and n represents a frame index; the GFCC can be obtained by performing discrete cosine transform on the time frequency diagram, namely:

wherein C is the number of Gamma filters, G _c (n) the nth frame cochlear spectrogram of the c-th channel, i being the index of the cepstral coefficient;

s23, combining the processed MFCC features and GFCC features;

s24, label processing, wherein a shipsEar public data set is used according to experiments, and the shipsEar data set is derived from different regions of the Atlantic coast in the northwest of Spain; wherein most of the data is collected at or near Vigo (42 ° 14.5'n 008 ° 43.4' w); the port has the characteristics of heavy traffic, rich ship types, rich environmental noise and the like, and can well collect the sounds of ships and environmental noise, including fishing boats, ocean going passenger ships, ferries of various sizes, containers, ro-ro ships, tugboats, sailing ships, yachts, small sailing ships and the like; the dataset recorded 11 sounds of the vessel and 1 ambient noise, for a total of 90 different time periods, about 3.13 hours of audio files; classification is based on the size of the ship, and the data is classified into 4 classes and ambient noise as follows: 1: fishing boats, trawl boats, mussel boats, tug boats and dredger boats; 2: motorboats, sailing vessels and sailing vessels; 3: passenger ferry; 4: ocean going passenger vessels and roll-on-roll-off vessels; 5: background noise;

s25, dividing a data set, wherein the data set comprises 11270 data, 1871 data are in class 1, 1554 data are in class 2, 4256 data are in class 3, 2453 data are in class 4, and 1139 data are in class 5; wherein 80% is randomly selected as a training set, 10% is selected as a verification set, and 10% is selected as a test set;

s3, constructing 4 multi-scale convolution layers with different convolution kernel sizes, a batch normalization layer and a convolution layer module:

s31, inputting the characteristic quantity of the underwater target signal extracted in the step S2 into a multi-scale convolution layer, and then entering the convolution layers with the kernel sizes of four different branches for characteristic extraction, wherein the principle of the convolution layers is as follows:

wherein N represents a batch number,the number of channels to be output is represented, b represents deviation, and w represents weight;

input is x _in The shape is as follows:

x _in ＝(N,C,H,W) (7)

wherein N represents the batch size, C represents the channel number, H represents high, and W represents wide; wherein: x is x _in Obtaining x1= (N, 4c, h, w) by branch 1 for convolution layer with convolution kernel size (3, 3), padding and step size 1; x is x _in By branch 2 for convolution kernel size (9, 9), filling in the convolution layer with 4 and step size 1 to obtain x2= (N, 4c, h, w); x is x _in By branch 3 for convolution kernel size (15, 15), filling in the convolution layer with 7 and step size 1 to obtain x3= (N, 4c, h, w); x is x _in By branch 4 for convolution kernel size (21, 21), filling the convolution layer with 10 and step size 1 to obtain x4= (N, 4c, h, w);

combining the features of the four branches to obtain x _1,2,3,4 = (N, 16c, h, w), then passing through a batch normalization layer, wherein the normalization layer adopts GRN (Global Response Normalization, global response normalization layer), and the method can better process key feature information and improve the recognition degree of features; the principle is as follows:

GRN(X):＝X∈R ^H×W×C →gx∈R ^C (8)

wherein X ε R ^H×W×C For a given input feature, H, W and C respectively represent the height, width and channel number of the feature map; using canonical feature aggregation, GRN (X) =gx= { |x ₁ ||,||x ₂ ||,....,||x _C ||}∈R ^C Wherein GRN (X) _i ＝||x _i I is a scalar that aggregates the ith channel statistics, especially using L2-norm would result in better performance of the network model; then, carrying out response normalization function N (·) on the cluster set; we use a standard division normalization method as follows:

wherein, X _i I represents the L2-norm of the ith channel; finally, calibrating the original input response by using the calculated characteristic normalization score:

X _i ＝X _i *N(GRN(X) _i )∈R ^H×W (10)

and then is againAfter passing through a convolution layer with the convolution kernel size of (1, 1) and the filling and step length of 1, outputting x _out To the next layer;

s4, constructing a ConvNextV2 layer-convolution layer module and a maximum pooling layer module:

s41, through the three-time stacked convolution layers and ConvNextV2 layers, the 32-dimensional characteristic data output by the S4 are up-scaled to 512 dimensions; the first step is that the characteristic data output by the S4 is normalized through LN, and then passes through a ConvNextV2 layer, and GRN technology is used in the ConvNextV2 layer to improve the performance of the network; then taking the data output in the first step as the input of the second step, increasing the feature dimension to 64 dimensions through a convolution layer with a convolution kernel of (3, 3) and a step length of 2 and filling the convolution layer with (1, 1), passing through a ConvNextV2 layer after passing through an LN layer, and finally increasing the feature dimension to 128 dimensions through a convolution layer with a convolution kernel of (3, 3) and a step length of 2 and filling the convolution layer with (1, 1) through a ConvNextV2 layer to obtain the output 128-dimensional feature data;

s42, the 128-dimensional feature data is subjected to convolution kernel (3, 3), the step length is 2, the feature is increased to 512 dimensions by a convolution layer filled with the (1, 1), and then the feature is reduced by maximum pooling with the kernel (3, 3) and the step length of 2;

s5, constructing a multi-scale attention convolution module, a maximum pooling layer, an average pooling layer and a full connection layer module:

s51, combining an acceptance multi-scale convolution layer and an ECA (Efficient Channel Attention, efficient channel attention module) to form a multi-scale attention module, wherein the ECA obtains weights by pooling and one-dimensional convolution operation on the characteristics of each channel and strengthens important channel information in a next layer; wherein the update mode of each channel weight is defined as follows:

wherein omega _i Representing the weight, ω, of channel i ^j Representing the parameters for calculating the weight of each channel, k (1. Ltoreq.k. Ltoreq.C) represents cross-channel intersection during the processMutual range, f _i ^j Representing a characteristic value associated with channel i, j being an index;representing a feature set associated with each channel i; the method for adaptively determining the number of the channels by k is as follows:

wherein C represents the number of channels entered, γ, b represents a hyper-parameter, |n| _odd Expressed as an odd number closest to the real number n;

extending the depth and complexity of the neural network by stacking four layers of multi-scale convolution attention modules, and selecting key features of the input features; firstly, carrying out more comprehensive and rich feature extraction and key feature selection on the feature data output by s4 through two layers of multi-scale attention blocks, and then carrying out feature dimension reduction through a maximum pooling layer with a kernel size of (3, 3) and a step length of 2;

s52, the perception and understanding capability of the model to different scale features in the feature data are improved through the two layers of multi-scale convolution attention blocks on the feature data after the feature data are subjected to dimension reduction, so that the model can better capture global feature information, the accuracy and robustness of target classification are improved, the parameter number and the calculated amount of the model are reduced through the feature dimension reduction of the average pooling layer with the core of (1, 1), the overfitting is prevented, and the calculation efficiency of the neural network is improved; finally, nonlinear transformation is carried out through a linear layer;

s6, training a multi-scale convolutional neural network classification model: inputting the processed various underwater target signals MFCC characteristics and GFCC characteristics into a network to perform label training on a multi-scale convolutional neural network model, using an Adam optimizer and setting the initial learning rate to be 0.1;

s7, classification test: performing test classification on the underwater target features in the test sample data set by using the trained multi-scale convolutional neural network structure and parameters to obtain classification results, and identifying various underwater targets;

s8, adopting Accuracy, precision, recall, and F1 score as evaluation indexes, wherein the following basic concepts are needed: TP true example, predicted true and actual true; FP: false positive, predicted true but actually false; FN, false negative case, predicted false but actually true; TN is true negative, predicted as false and actually false;

wherein, accuracy, precision, recovery, F1-score formula is as follows:

the invention also evaluates the complexity and the calculation efficiency of the model by comparing the parameter quantity (parameters) and the floating point operation quantity (Floating point operations) of the model, wherein the calculation principle of each convolution layer parameter is as follows:

params＝C _out ×(κ _ω ×κ _h ×C _in +1) (17)

wherein C is _out Representing the number of channels output, C _in Representing the number of channels, κ, of the input _ω And kappa (kappa) _h Representing the convolution kernel size;

the calculation principle of the operand FLPs of each layer in convolution is as follows:

FLOPs＝[(κ _ω ×k _h ×C _in )+(κ _ω ×k _h ×C _in -1)+1]×C _out ×W×H (18)

wherein W and H respectively represent the width and the height of the characteristic map;

in the step S2, feature extraction is performed on the underwater target signal in the step S1: carrying out Mel filtering on the data, carrying out cepstrum transformation, outputting MFCC (multi-frequency component carrier) feature vectors, carrying out second-order difference on the MFCC, and splicing; performing Gamma filtering on the data, outputting GFCC feature vectors, performing second-order difference on GFCC features, and splicing the two features along the 1 st dimension;

in the step S3, 4 multi-scale convolution layers with different convolution kernel sizes, a batch normalization layer and a convolution layer module are constructed; the multi-scale convolution can capture the characteristic information of different scales, and enrich the expression capability of the characteristics; the normalization layer can normalize each batch of input, so that the network is more stable and the convergence speed is faster;

in the step S4, a ConvNextV2 layer-convolution layer module and a maximum pooling layer module are constructed; the ConvNextV2 layer-convolution layer module can increase the depth and complexity of the network, extract more comprehensive and abstract characteristic data and prevent the network from being over-fitted; the maximum pooling layer reduces the dimension of the extracted features, reduces the calculated amount of the network and improves the calculation efficiency;

the multi-scale attention convolution module constructed in the step S5 can model multi-scale features more comprehensively, so that not only can the information of the space and the channel be fused better, the expression capacity of the features is enhanced, but also the richness and the robustness of feature extraction can be improved;

the number of times of training the multi-scale convolutional neural network classification model in the step S6 is 300, and experimental parameters are continuously adjusted and optimized to finally obtain the most suitable experimental parameters; the loss rate and the accuracy rate of the model are found to gradually trend towards convergence trend about 300 times after training in sequence in the experiment, so that the training iteration number of the model is adjusted to 300 times; in each training, the model parameters obtained from the last epoch are trained for the next time, and the frame is retrained, so that the obtained loss rate and accuracy are still small, and 300 times are selected;

when the training times of training the multi-scale convolutional neural network classification model are 300 times, the obtained loss rate and accuracy gradually tend to converge, and the loss function is defined as: the result is 0 only when the underwater target feature is 0 or 1, and the result of classification is 0 only when the underwater target feature is 1;

step one, an audio preprocessing stage: carrying out channel conversion, sampling and interception processing on underwater target data downloaded from a data set provided by the paper of 'shipear: an underwater vessel noise database', and constructing underwater target data heard under a real underwater scene;

step two, feature extraction and labeling: extracting MFCC and GFCC features of the preprocessed underwater target data, performing second-order differential transformation, respectively splicing the original spectrum, the first-order features and the second-order features, splicing and labeling the MFCC features and the GFCC features extracted from the processed underwater target signal, and constructing a data set;

step three, an underwater target classification process: inputting the training set in the data set obtained in the second step into a multi-scale convolutional neural network classification network for classification, and finally classifying five different underwater targets to train a classification frame.

Embodiment one:

the feasibility and superiority of the algorithm are described below by classifying different underwater targets based on the designed underwater target classification model under the strategy.

Firstly, carrying out data processing, wherein the selected data set is from the data set provided by the paper of 'shipear: an underwater vessel noise database', and the underwater targets comprise fishing boats, ocean going passenger ships, ferries of various sizes, containers, roll-on and roll-off ships, tugboats, sailing vessels, yachts, sailing vessels and the like; the dataset recorded 11 sounds of the vessel and 1 ambient noise, for a total of 90 different time periods, about 3.13 hours of audio files; classification is based on the size of the ship, and the data is classified into 4 classes and ambient noise as follows: 1: fishing boats, trawl boats, mussel boats, tug boats and dredger boats; 2: motorboats, sailing vessels and sailing vessels; 3: passenger ferry; 4: ocean going passenger vessels and roll-on-roll-off vessels; 5: background noise; firstly, converting an original underwater target signal into a single-channel signal, setting an audio sampling rate to 22050Hz, then standardizing audio lengths, intercepting the parts with the time length longer than 1 second of all audio signals for audio data with different time lengths, filling the redundant parts, and continuing the experiment;

the method is characterized in that the MFCC feature map feature or the GFCC feature map feature is singly used for carrying out underwater object classification, so that the method adopts a method of combining the MFCC feature and the GFCC feature, calculates the MFCC feature of an audio signal processed in the step S1, calculates first-order and second-order differences of the MFCC feature, and maintains the dimension consistency by adding all zero lines in a calculation result; splicing the MFCC features, the first-order difference and the second-order difference features; the main idea of the MFCC feature extraction is to develop a feature that can be well adapted to the human hearing scale; calculating the GFCC characteristics of the audio signals processed by the S1, calculating the first-order difference and the second-order difference of the GFCC characteristics, and keeping the dimension consistent by adding all zero rows in the calculation result; splicing GFCC features, first-order difference features and second-order difference features; normalizing the spliced characteristic data;

taking class 2 as an example, constructing a multi-scale convolutional neural network classifier module, wherein the extracted characteristics of the acoustic signals of the passenger ferry firstly enter an input layer of a multi-scale convolutional layer, and then enter the convolutional layers with the core sizes of four different branches for characteristic extraction, wherein: branch 1 is a convolution layer with convolution kernel size (3, 3), padding and step size 1; branch 2 is a convolution layer with convolution kernel size (9, 9), padding 4 and step size 1; branch 3 is a convolution layer with convolution kernel size (15, 15), padding 7 and step size 1; branch 4 is convolved with a kernel size of (21, 21), fillingConvolutions of 10 and 1 steps; combining the characteristics of the four branches, and activating the functions after passing through a batch normalization layer; then a convolution layer with the convolution kernel size of (1, 1) and the filling and step length of 1 is used for carrying out characteristic dimension reduction and outputting the characteristic x _out ；

Let x _out A ConvNextV2 layer-convolution layer module is stacked through three layers, and a layer module is maximally pooled; extracting more comprehensive and abstract feature data, and preventing network overfitting, wherein the maximum pooling layer reduces the dimension of the extracted features, reduces the calculated amount of the network and improves the calculation efficiency; the output is x obtained through the neural network layer ₂ ；

According to the above passenger ferry audio frequency characteristic x ₂ The multiscale attention module is formed by combining an acceptance multiscale convolution layer and an ECA attention mechanism; extending the depth and complexity of the neural network by stacking four layers of multi-scale convolution attention modules, and selecting key features of input features; firstly, feature extraction and key feature selection are carried out on the feature data output in the step S4 through two layers of multi-scale attention blocks, and then feature dimension reduction is carried out on a maximum pooling layer; then, the feature data after dimension reduction is subjected to two layers of multi-scale convolution attention blocks to improve the perception and understanding abilities of the model to the features with different scales, and global feature information is captured; then, feature dimension reduction is carried out through an average pooling layer, so that the parameter quantity and the calculated quantity are reduced, and the calculation efficiency is improved; finally, nonlinear transformation is carried out through the linear layer;

training a multi-scale convolutional neural network classification model: the processed passenger ferry underwater acoustic signal characteristic is input into a network to carry out label training on a multi-scale convolutional neural network model, input is defined as input minus an input average value and divided by an input standard deviation when each iteration is carried out, a Loss function value (using a cross entropy Loss function) is calculated after the input average value passes through a model classification layer, an Adam optimizer is used, and an initial learning rate is set to be 0.1, so that gradient descent speed is accelerated; when the number of iterations reaches 300, the network is considered to complete the training process for the underwater target.

In the invention, the classification performance of the proposed algorithm and other algorithms on the underwater target is evaluated through four indexes, and table 1 is the accuracy comparison of the automatic underwater target sound classification method based on the light-weight multi-scale convolution attention neural network and the paper of the existing method Joint learning model for underwater acoustic target recognition; table 1 is obtained:

comparison results of the model presented in Table 1 with other current algorithms

In summary, the automatic underwater target sound classification method based on the light-weight multi-scale convolution attention neural network has the following effects:

1. the neural network model provided by the invention is used for classifying the underwater target audio data, can rapidly and accurately judge the target type, provides objective and accurate ship type identification for sonar information processing personnel, reduces unnecessary inspection, improves the working efficiency and reduces the time consumption;

2. according to the invention, the model parameter and the inference time are reduced, the classification performance of the model is ensured, the accuracy and the instantaneity of identifying the underwater target are effectively improved, the high efficiency and the robustness of the algorithm are increased, and an effective classification strategy can be provided for the sound classification collected by sonar.

What is not described in detail in this specification is all that is known to those skilled in the art.

Although the present invention has been described with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described, or equivalents may be substituted for elements thereof, and any modifications, equivalents, improvements and changes may be made without departing from the spirit and principles of the present invention.

Claims

1. An automatic underwater target sound classification method based on a light-weight multi-scale convolution attention neural network is characterized by comprising the following steps of: the method comprises the following steps:

s2, feature extraction: performing Mel filtering and cepstrum transformation on the audio data output in the step S1, and outputting MFCC feature vectors; performing Gamma filtering on the audio data output in the step S1, outputting GFCC feature vectors, then processing the MFCC feature vectors and the GFCC feature vectors, and finally establishing a feature data set and loading;

2. The automatic classification method of underwater target sound based on the lightweight multi-scale convolution attention neural network as set forth in claim 1, wherein the automatic classification method is characterized in that: in step S2, the processing of the MFCC feature vector and the GFCC feature vector is specifically described as follows: respectively calculating second-order difference from the obtained MFCC feature vector and GFCC feature vector, splicing, and carrying out normalization processing on the spliced MFCC feature vector and GFCC feature vector; the processed MFCC feature vector and GFCC feature vector are stitched along dimension 1.

3. The automatic classification method of underwater target sound based on the lightweight multi-scale convolution attention neural network as set forth in claim 1, wherein the automatic classification method is characterized in that: in step S3, the processing steps of the extracted feature data are as follows: combining the extracted characteristic data, and carrying out batch normalization on the combined characteristic data; and after batch normalization, the data enter a convolution layer to perform characteristic dimension reduction, and are output to the next layer.

4. The automatic classification method of underwater target sound based on the lightweight multi-scale convolution attention neural network as set forth in claim 1, wherein the automatic classification method is characterized in that: the step S4 is specifically described as follows: and (3) stacking the data extracted in the step S3 for three times by constructing a ConvNextV2 layer-convolution layer module to perform more complex and abstract feature extraction, and constructing a maximum pooling layer to perform feature dimension reduction.

5. The automatic classification method of underwater target sound based on the lightweight multi-scale convolution attention neural network as set forth in claim 1, wherein the automatic classification method is characterized in that: the step S5 is specifically described as follows: constructing a multi-scale attention convolution module, and a maximum pooling layer, an average pooling layer and a full connection layer module: and (3) performing key feature selection, dimension reduction and nonlinear change processing on the feature data output by the S4.

6. The automatic classification method of underwater target sound based on the lightweight multi-scale convolution attention neural network as set forth in claim 1, wherein the automatic classification method is characterized in that: the training times of the multi-scale convolutional neural network classifier structure and parameter optimization in the step S6 are 300.

7. A lightweight multi-scale convolutional attention-based neural network as recited in claim 6The automatic classification method of the underwater target sound of the complex is characterized by comprising the following steps of: the loss function trained through the network classifier is defined asWhen the training times are 300 times, the obtained loss rate and accuracy gradually tend to converge.