CN117854540A

CN117854540A - Underwater sound target identification method and system based on neural network and multidimensional feature fusion

Info

Publication number: CN117854540A
Application number: CN202410264622.9A
Authority: CN
Inventors: 张文; 林彬; 燕玉林; 肖波; 杨鹏飞; 叶燕清; 马力
Original assignee: Strategic Evaluation And Consultation Center Of Pla Academy Of Military Sciences
Current assignee: Strategic Evaluation And Consultation Center Of Pla Academy Of Military Sciences
Priority date: 2024-03-08
Filing date: 2024-03-08
Publication date: 2024-04-09
Anticipated expiration: 2044-03-08
Also published as: CN117854540B

Abstract

The invention discloses a method and a system for identifying underwater sound targets based on neural network and multidimensional feature fusion, which relate to the technical field of underwater sound target identification and comprise the following steps: extracting the spectrum characteristics of the underwater sound signal of the underwater sound target to be identified; the spectral features include: STFT spectral features, mel spectral features, and MFCC features; carrying out feature fusion on the Mel frequency spectrum features and the MFCC features to obtain primary fusion features; inputting the primary fusion characteristic and the STFT spectrum characteristic into a trained convolutional neural network to respectively obtain a first embedded characteristic vector and a second embedded characteristic vector; performing feature fusion on the first embedded feature vector and the second embedded feature vector to obtain target fusion features; and inputting the target fusion characteristics into a trained classification model, and identifying the underwater sound target to be identified by using the trained classification model. The method and the device solve the technical problem that the target signal is difficult to separate from unordered ocean noise in the traditional underwater sound target identification method.

Description

Underwater sound target identification method and system based on neural network and multidimensional feature fusion

Technical Field

The invention relates to the technical field of underwater sound target recognition, in particular to an underwater sound target recognition method and system based on neural network and multidimensional feature fusion.

Background

The underwater acoustic target recognition (Underwater Acoustic Target Recognition) technique is an information processing technique that utilizes passive target radiated noise, active target echo signals, and other sensor signals received by the acoustic receiver to extract target features and identify or classify target types. The complexity of this problem has been internationally recognized because of the complexity of the marine sound field environment, the time-space-frequency-variant nature of the underwater sound channel, and the interference of other noise, the task of underwater sound target identification remains a significant challenge.

The underwater acoustic signals observed by the passive sonar are generally divided into two types, one is natural environment noise, including wind noise, rain noise, marine organism sound and the like; the other is artificial noise, including radiated noise from underwater vessels, drilling platforms, submarines, and various interference noise generated by receiving equipment. People recognize various types of sound signals according to the unique voiceprint characteristics of each sound, the radiation noise of the underwater ship targets is various and interactive, and the main sound source component of the underwater ship targets consists of mechanical noise, propeller noise, hydrodynamic noise and the like.

Signal features commonly used in underwater acoustic target recognition tasks are generally classified into three types, namely a time waveform feature, a time-frequency feature and an auditory perception feature, including a zero-crossing rate, a peak-to-peak value, a short-time fourier transform (STFT) spectrum, a discrete wavelet transform, a LOFAR spectrum, a deman spectrum, a Hilbert-Huang (Hilbert-Huang) transform, a Mel spectrum, a Mel-frequency cepstrum coefficient (MFCC), and the like.

Waveform features adopted by the traditional method are usually first-order/second-order statistical features of time sequence waveforms, and are often used as auxiliary features rather than main features in related underwater target identification tasks. For the time-frequency characteristics, the time-frequency information of the most comprehensive part of the target waveform signal can be reserved by the time-frequency spectrum during the short-time Fourier transformation, so that the time-frequency spectrum during the short-time Fourier transformation is widely applied to the related research of voice signal processing and underwater sound signal processing. Essentially, the LOFAR spectrum is almost identical to the STFT spectrum, and is a time-spectral feature. When the DEMON spectrum is extracted, the received broadband signal is firstly demodulated to obtain a low-frequency envelope spectrum chart, and then the low-frequency physical characteristics such as axial frequency, leaf frequency and the like of the target signal are extracted through transformation. Wavelet transforms are typically feature extracted in combination with hilbert-yellow transforms, but empirical mode decomposition based methods have difficulty separating the target signal from unordered ocean noise due to lack of accurate a priori knowledge.

Disclosure of Invention

The invention aims to solve at least one technical problem and provide a method and a system for identifying underwater sound targets based on neural network and multidimensional feature fusion.

In a first aspect, an embodiment of the present invention provides a method for identifying an underwater sound target based on a neural network and multidimensional feature fusion, including: extracting the spectrum characteristics of the underwater sound signal of the underwater sound target to be identified; the spectral features include: STFT spectral features, mel spectral features, and MFCC features; performing feature fusion on the Mel frequency spectrum feature and the MFCC feature to obtain a primary fusion feature; inputting the primary fusion feature and the STFT spectrum feature into a trained convolutional neural network to respectively obtain a first embedded feature vector and a second embedded feature vector; performing feature fusion on the first embedded feature vector and the second embedded feature vector to obtain target fusion features; and inputting the target fusion characteristics into a trained classification model, and identifying the underwater sound target to be identified by using the trained classification model.

Further, the classification model includes a long and short term memory network layer.

Further, the spectral characteristics further include delta spectrum and delta-delta spectrum; extracting the spectral features of the underwater sound signal of the underwater sound target to be identified comprises: respectively carrying out pre-emphasis, framing, windowing and discrete Fourier transform processing on the underwater sound signals to obtain the STFT spectrum characteristic, the Mel spectrum characteristic and the MFCC characteristic; based on the Mel spectrum characteristics, delta spectrum and delta-delta spectrum corresponding to the Mel spectrum characteristics are calculated; based on the STFT spectrum characteristics, delta spectrum corresponding to the STFT spectrum characteristics is calculated; wherein the delta spectrum calculation formula comprises:the method comprises the steps of carrying out a first treatment on the surface of the The delta-delta spectrum calculation formula comprises:the method comprises the steps of carrying out a first treatment on the surface of the Wherein S is _t+m For the Mel spectrum characteristic or STFT spectrum characteristic of the t+m time frame, S _t-m For the Mel spectrum characteristic or STFT spectrum characteristic of the t-M time frame, M represents the number of adjacent frames taken, M is the difference dimension of the time spectrum;delta spectral feature coefficients for the t + m time frame,delta spectral feature coefficients for t-m time frames.

Further, feature fusion is performed on the Mel spectrum feature and the MFCC feature to obtain a primary fusion feature, including: and carrying out feature fusion on the Mel frequency spectrum feature, the delta frequency spectrum and delta-delta frequency spectrum corresponding to the Mel frequency spectrum feature and the MFCC feature to obtain the primary fusion feature.

Further, inputting the STFT spectral features into a trained convolutional neural network, comprising: and taking the characteristic after characteristic fusion of the STFT frequency spectrum characteristic and the delta frequency spectrum corresponding to the STFT frequency spectrum characteristic as input, and inputting the characteristic into the trained convolutional neural network.

Further, the loss function that optimizes the classification model includes a multi-class cross entropy loss function.

In a second aspect, an embodiment of the present invention further provides an underwater sound target recognition system based on neural network and multidimensional feature fusion, including: the device comprises a feature extraction module, an embedded vector generation module, a multidimensional feature fusion module and an underwater sound target recognition module; the characteristic extraction module is used for extracting the spectrum characteristics of the underwater sound signals of the underwater sound targets to be identified; the spectral features include: STFT spectral features, mel spectral features, and MFCC features; the feature extraction module is further configured to perform feature fusion on the Mel spectrum feature and the MFCC feature to obtain a primary fusion feature; the embedded vector generation module is used for inputting the primary fusion characteristic and the STFT spectrum characteristic into a trained convolutional neural network to respectively obtain a first embedded characteristic vector and a second embedded characteristic vector; the multidimensional feature fusion module is used for carrying out feature fusion on the first embedded feature vector and the second embedded feature vector to obtain target fusion features; the underwater sound target recognition module is used for inputting the target fusion characteristics into a trained classification model, and recognizing the underwater sound target to be recognized by utilizing the trained classification model.

Further, the spectral characteristics further include delta spectrum and delta-delta spectrum; the feature extraction module is further configured to: respectively carrying out pre-emphasis, framing, windowing and discrete Fourier transform processing on the underwater sound signals to obtain the STFT spectrum characteristic, the Mel spectrum characteristic and the MFCC characteristic; based on the Mel spectrum characteristics, delta spectrum and delta-delta spectrum corresponding to the Mel spectrum characteristics are calculated; based on the STFT spectrum characteristics, delta spectrum corresponding to the STFT spectrum characteristics is calculated; wherein the delta spectrum calculation formula comprises:the method comprises the steps of carrying out a first treatment on the surface of the The delta-delta spectrum calculation formula comprises:the method comprises the steps of carrying out a first treatment on the surface of the Wherein S is _t+m For the Mel spectrum characteristic or STFT spectrum characteristic of the t+m time frame, S _t-m For the Mel spectrum characteristic or STFT spectrum characteristic of the t-M time frame, M represents the number of adjacent frames taken, M is the difference dimension of the time spectrum;delta spectral feature coefficients for the t + m time frame,delta spectral feature coefficients for t-m time frames.

In a third aspect, an embodiment of the present invention further provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method according to the first aspect when executing the computer program.

In a fourth aspect, embodiments of the present invention also provide a computer readable storage medium storing computer instructions which, when executed by a processor, implement a method as described in the first aspect.

The invention provides a method and a system for identifying a water sound target based on neural network and multi-dimensional feature fusion, which utilize a convolutional neural network layer to process the fused multi-dimensional time-frequency domain features, then utilize a classification model to identify the water sound target of the multi-dimensional fusion features, can effectively separate a target signal from unordered ocean noise, and have obvious advantages in extracting the features with obvious characterization capability and realizing a steady target classification model.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following description will briefly introduce the drawings that are needed in the detailed description or the prior art, it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an underwater sound target recognition method based on neural network and multidimensional feature fusion provided by an embodiment of the invention;

FIG. 2 is a diagram of a model architecture of a method for identifying underwater acoustic targets based on neural network and multidimensional feature fusion, provided by an embodiment of the invention;

fig. 3 is a schematic diagram of a spectrum feature extraction flow provided in an embodiment of the present invention;

FIG. 4 is a schematic diagram of a spectral feature according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a primary feature fusion provided by an embodiment of the present invention;

FIG. 6 is a schematic diagram of another method for identifying underwater acoustic targets based on neural network and multidimensional feature fusion according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a training process of a method for identifying an underwater sound target based on neural network and multidimensional feature fusion on a shipEar data set according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of a confusion matrix of an underwater sound target recognition method based on neural network and multidimensional feature fusion according to an embodiment of the present invention;

fig. 9 is a schematic diagram of an underwater sound target recognition system based on neural network and multidimensional feature fusion according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

Fig. 1 is a flowchart of an underwater sound target recognition method based on neural network and multidimensional feature fusion, which is provided by an embodiment of the invention. As shown in fig. 1, the method specifically includes the following steps:

step S102, extracting the spectrum characteristics of the underwater sound signal of the underwater sound target to be identified; the spectral features include: STFT spectral features, mel spectral features, and MFCC features. Specifically, the underwater sound signal is a time-series waveform signal.

And step S104, carrying out feature fusion on the Mel frequency spectrum feature and the MFCC feature to obtain a primary fusion feature.

And S106, inputting the primary fusion feature and the STFT spectrum feature into a trained convolutional neural network to respectively obtain a first embedded feature vector and a second embedded feature vector.

And S108, performing feature fusion on the first embedded feature vector and the second embedded feature vector to obtain target fusion features.

Step S110, inputting the target fusion characteristics into a trained classification model, and identifying the underwater sound target to be identified by using the trained classification model.

Fig. 2 is a schematic diagram of a model structure of an underwater sound target recognition method based on neural network and multidimensional feature fusion according to an embodiment of the present invention. As shown in fig. 2, first, three types of spectral features of the underwater sound target time series signal are extracted, namely a Short-time fourier transform (Short-Time Fourier Transform, STFT) spectral feature, a Mel spectral feature, and a Mel frequency cepstrum coefficient (Mel Frequency Cepstrum Coefficient, MFCC) feature, respectively. Then in the process of generating the embedded vector, two symmetrical sub-networks (namely convolutional neural networks) with different Input characteristic data are included, the two sub-networks are represented as sub-network1 and sub-network2, the Input characteristics of the two sub-networks are respectively STFT spectrum characteristics and fusion characteristics of Mel spectrum characteristics and MFCC characteristics, and the Input characteristics are represented as Input1 and Input2. The outputs of the two subnetworks are embedded feature vectors, denoted as embedded 1 and embedded 1, respectively.

Specifically, as shown in fig. 2, the convolutional neural network provided by the embodiment of the invention includes two-dimensional convolutional neural network layers (conv2d_1 and conv2d_2), a maximum pooling layer (MaxPooling 2d_1 and MaxPooling 2d_2), a batch normalization layer (batch normalization_1 and batch normalization_2), a Dropout layer (dropout_1 and dropout_2), and a full-connection layer (Dense_1 and Dense_2). The convolutional neural network can mine local mechanism features of image data, and preserve the position relation of the features relative to an original image, and the batch normalization layer and the Dropout layer can improve generalization capability and robustness of the convolutional neural network.

Preferably, the classification model includes a Long Short Term Memory network Layer (LSTM). Specifically, as shown in fig. 2, the classification model further includes an activation function (ReLU), a Dropout layer (dropout_3), and two fully connected layers (Dense_3 and Dense_4). The long-short-term memory network (LSTM) layer can operate audio frame by frame on past history information, and can improve the voice generalization capability of a target recognition task; and the batch normalization layer, the Dropout layer and the activation function can improve the generalization capability and the robustness of the classification model.

Fig. 3 is a schematic diagram of a spectrum feature extraction flow according to an embodiment of the present invention. As shown in fig. 3, the spectral characteristics also include delta spectrum (i.e., delta in fig. 3) and delta-delta spectrum (i.e., delta in fig. 3) ² ). Specifically, step S102 further includes:

step S1021, respectively carrying out pre-emphasis, framing, windowing and discrete Fourier transform (Discrete Fourier Transform, DFT) processing on the underwater sound signals to obtain STFT spectrum characteristics, mel spectrum characteristics and MFCC characteristics;

step S1022, based on the Mel spectrum characteristics, calculating delta spectrum and delta-delta spectrum corresponding to the Mel spectrum characteristics;

step S1023, based on the STFT spectrum characteristics, delta spectrum corresponding to the STFT spectrum characteristics is calculated;

wherein the delta spectrum calculation formula comprises:

the delta-delta spectrum calculation formula includes:

wherein S is _t+m For the Mel spectrum characteristic or STFT spectrum characteristic of the t+m time frame, S _t-m For the Mel spectrum characteristic or STFT spectrum characteristic of the t-M time frame, M represents the number of adjacent frames taken, M is the difference dimension of the time spectrum;delta spectral feature coefficients for the t + m time frame,delta spectral feature coefficients for t-m time frames.

In the embodiment of the invention, STFT spectrum characteristics, mel spectrum characteristics and MFCC characteristics are obtained through pre-emphasis, framing, windowing and DFT processing. Because the original frequency spectrum features can only reflect the static characteristics of the signals, in order to extract different characteristics of the signals, the embodiment of the invention also calculates the time derivative of adjacent frames in the frequency spectrum graph, thereby extracting the delta frequency spectrum (delta) and delta-delta spectrum graph corresponding to the frequency spectrum features) This is more advantageous for identifying moving objects.

For example, at a sampling frequency of 16000Hz, at an extraction frequencyWhen the spectrum is characterized, the frame length is 1024, and the frame shift is 512, so that the generated STFT spectrum is characterized by superposition of 512-dimensional spectrograms in the time dimension. In the model architecture shown in fig. 2, input1 is a three-dimensional tensor having a shape of (513×95×n), and the left-to-right dimensions are the number of frequency points, the number of frames, and the number of channels, respectively. The Mel spectrum feature and MFCC feature have a shape of (128×265×n), a frame length of 1024 and a frame shift of 256 are taken at the time of extraction, and 128 Mel filter banks are set. Fig. 4 is a schematic diagram of a spectrum feature according to an embodiment of the present invention, where in fig. 4, (a) is an original STFT spectrum feature, and (b) isSpectrum, (c) original Mel spectrum characteristics of the graph, (d) graph isSpectrum, (e) diagram isThe spectral sum (f) diagram is the MFCC characteristic.

Fig. 5 is a schematic diagram of a primary feature fusion provided in accordance with an embodiment of the present invention. As shown in fig. 5, step S104 includes:

and carrying out feature fusion on the Mel frequency spectrum features, delta frequency spectrum corresponding to the Mel frequency spectrum features, delta-delta frequency spectrum and MFCC features to obtain primary fusion features.

As shown in fig. 5, step S106 further includes:

and taking the characteristic after characteristic fusion of the STFT frequency spectrum characteristic and the delta frequency spectrum corresponding to the STFT frequency spectrum characteristic as input, and inputting the characteristic into the trained convolutional neural network.

Fig. 6 is a diagram of a model architecture of another underwater sound target recognition method based on neural network and multidimensional feature fusion according to an embodiment of the present invention. As shown in fig. 6, a two-dimensional convolutional neural network layer (Conv 2D) and a long-short-term memory network Layer (LSTM) are key network layers, where the Conv2D convolutional neural network structure can mine local structural features of image data and preserve the positional relationship of the features with respect to the original image, for example, the implementation of the present inventionIn the example, use 5A convolution kernel of 5 size, a stride parameter of 1 for the convolution operation1. After the Input1 and the Input2 are processed by the sub-network1 and the sub-network2, one-dimensional embedded feature vectors with the shape of 12800 are output, spliced and transformed into two-dimensional matrixes with the shape of (256, 100), and then Input into a subsequent LSTM network layer, wherein the LSTM can operate audio frame by frame on past historical information, and the sound generalization capability of a target recognition task can be improved, for example, the number of layers of the LSTM is 64 in the embodiment of the invention. And then connecting two fully-connected neural network layers with the node numbers of 32 and 5 respectively, and finally outputting the network as a one-dimensional vector with the length of 5. The network super parameters and the input and output data formats provided by the embodiment of the invention are shown in the table one:

form-network super-parameter and input-output data format

Preferably, in an embodiment of the present invention, the loss function that optimizes the classification model comprises a multi-class cross entropy (MCE) loss function. Specific:

wherein C represents the number of underwater target types, pred represents a predicted result, and y represents a real result. Preferably, in the embodiment of the present invention, c=5, pred and y are both 5-dimensional one-hot vectors.

Preferably, the evaluation indexes of the underwater sound target recognition method based on the neural network and the multidimensional feature fusion comprise accuracy, precision, recall and F1-score, and the formula of each index is as follows:

wherein TP represents True Positive, meaning that a Positive determination is made and is correct; TN represents True Native, meaning that a determination of Native is made and is correct; FP represents False Positive, meaning that a Positive determination is made, but the determination is erroneous; FN represents False positive, meaning that a determination of positive is made, but the determination is erroneous.

As can be seen from the above description, the embodiment of the invention provides a method for identifying an underwater sound target based on the fusion of a neural network and a multidimensional feature, which utilizes a convolutional neural network layer to process the fused multidimensional time-frequency domain feature, and then utilizes a classification model to identify the underwater sound target of the multidimensional fusion feature, so that a target signal can be effectively separated from unordered ocean noise, and the method has obvious advantages in extracting the feature with obvious characterization capability and realizing a robust target classification model.

Example two

The performance of the underwater sound target recognition method (FCRN) based on the neural network and the multidimensional feature fusion provided by the embodiment of the invention is verified through a set of comparison experiments.

(1) Data set

Shipear is a public data set containing underwater audio recordings of various types of ships and natural environment noise, and is collected by hydrophones deployed at a wharf and comprises 90 audio recordings of different lengths, the duration of the audio being varied from 15 seconds to 15 minutes, and the sampling frequency being 52734Hz. The data set is collected by using several hydrophones with different gains, so that the signal-to-noise ratio of the data is different. In particular, the underwater audio data set records 11 types of radiated noise of the ship, 5 types of natural environment background noise such as wind noise, rain noise, wave noise, waterfall noise, and other environmental noise. All of these above audio recordings are roughly classified into 5 types A, B, C, D and E. The invention divides the audio in the data set into audio fragments with the duration of 3 seconds, the overlapping rate between the audio fragments is 50%, and 6644 marked clean underwater sound fragments are obtained in total, and the detailed information of specific categories and contained audio fragment data is shown in the table two:

target class and tagged audio clips in table two dataset

The samples were randomly shuffled and then split into test, validation and test sets at a ratio of 7:1.5:1.5, with the number of audio fragments contained being 4651, 996, 997, respectively.

(2) Experimental setup

1) Contrast model

Underwater target recognition model based on convolutional neural network CNN: consists of two convolved blocks, each containing one conv2D (64, (5, 5)) layer, one MaxPooling2D (3, 3) layer and one batch normalization layer, followed by one Dropout (0.25) layer and two fully connected layers. The first full connection layer has 64 nodes and the second full connection layer has 5 nodes.

Underwater target recognition model based on convolutional cyclic neural network CRN: the deconstructing of the convolution block is consistent with the CNN-based model, and then connects one Dropout (0.25) layer with two LSTM layers, both of which have 64 nodes. After which the two fully connected layers and Dropout (0.25) layer are connected and finally the data is input to one fully connected layer with 5 nodes.

For both the above CNN-based model and the CRN-based model, the input features include SFTF-2D spectrum, mel-3D spectrum, and MFCC spectrum.

2) Optimizer parameters

All the comparison models and the underwater sound target recognition method (FCRN) based on the neural network and the multidimensional feature fusion are realized based on a TensorFlow tool kit, network parameters are trained and optimized based on a training set by using an Adam optimizer, and learning rate is controlled by using a verification set. Further, the initial learning rate is set toThe batch size was 256 and the training period was set to 100.

3) Results and discussion

Table three summarizes the performance of the underwater sound target classifier implemented in this experiment, where the bold data represents the best results:

results of three different models

As can be seen from table three, (1) the underwater sound target recognition method (FCRN) based on the neural network and the multidimensional feature fusion provided by the embodiment of the invention obtains the best recognition performance on the shipear dataset, and secondly, the CRN-Mel3D model, and the result obviously shows that the fusion feature is more beneficial to recognizing the target than the single feature; (2) the model based on the CRN structure is generally superior to the model based on the CNN structure, which indicates that the CRN network structure has stronger time sequence modeling capability; (3) mel spectral features are more discriminative features than other features for underwater sound target recognition tasks.

Fig. 7 is a schematic diagram of a training process of an underwater sound target recognition method based on neural network and multidimensional feature fusion on a shipear dataset according to an embodiment of the present invention. Wherein, the left graph of fig. 7 shows the change condition of the accuracy rate on the training set and the verification set, and the right graph of fig. 7 shows the change condition of the loss on the training set and the verification set. Fig. 8 is a schematic diagram of a confusion matrix of an underwater sound target recognition method based on neural network and multidimensional feature fusion according to an embodiment of the present invention. As shown in fig. 8, the accuracy was 96.67% for 997 samples in the test set.

The embodiment of the invention provides a submarine sound target recognition method based on neural network and multidimensional feature fusion to solve the problem of submarine sound target recognition, and the method has obvious advantages in the aspects of extracting features with obvious characterization capability and realizing a steady target classification model through evaluation verification on a ShipEar actually measured submarine sound data set. Experimental results show that multidimensional feature fusion and integration of different network structure layers are helpful for improving classification performance.

Example III

Fig. 9 is a schematic diagram of an underwater sound target recognition system based on neural network and multidimensional feature fusion according to an embodiment of the present invention. As shown in fig. 9, the system includes: the system comprises a feature extraction module 10, an embedded vector generation module 20, a multi-dimensional feature fusion module 30 and an underwater sound target recognition module 40.

Specifically, the feature extraction module 10 is configured to extract a spectrum feature of an underwater sound signal of an underwater sound target to be identified; the spectral features include: STFT spectral features, mel spectral features, and MFCC features.

The feature extraction module 10 is further configured to perform feature fusion on the Mel spectrum feature and the MFCC feature, so as to obtain a primary fusion feature.

The embedded vector generation module 20 is configured to input the primary fusion feature and the STFT spectrum feature into a trained convolutional neural network, so as to obtain a first embedded feature vector and a second embedded feature vector respectively.

Specifically, the convolutional neural network provided by the embodiment of the invention comprises a two-dimensional convolutional neural network layer, a max pooling layer, a batch normalization layer, a Dropout layer and a full connection layer.

The multidimensional feature fusion module 30 is configured to perform feature fusion on the first embedded feature vector and the second embedded feature vector to obtain a target fusion feature.

The underwater sound target recognition module 40 is used for inputting the target fusion characteristics into the trained classification model, and recognizing the underwater sound target to be recognized by using the trained classification model. Preferably, the classification model includes a long and short term memory network layer, and the loss function for optimizing the classification model includes a multi-class cross entropy loss function.

Specifically, the spectral characteristics further include delta spectrum and delta-delta spectrum; the feature extraction module 10 is further configured to:

respectively carrying out pre-emphasis, framing, windowing and discrete Fourier transform processing on the underwater sound signals to obtain STFT spectrum characteristics, mel spectrum characteristics and MFCC characteristics;

based on the Mel spectrum characteristics, delta spectrum and delta-delta spectrum corresponding to the Mel spectrum characteristics are calculated;

based on the STFT spectrum characteristics, delta spectrum corresponding to the STFT spectrum characteristics is calculated;

wherein the delta spectrum calculation formula comprises:

the delta-delta spectrum calculation formula includes:

The feature extraction module 10 is further configured to: and carrying out feature fusion on the Mel frequency spectrum features, delta frequency spectrum corresponding to the Mel frequency spectrum features, delta-delta frequency spectrum and MFCC features to obtain primary fusion features.

The embedded vector generation module 20 is further configured to input, as an input, features after feature fusion of the STFT spectrum features and delta spectrums corresponding to the STFT spectrum features, to a trained convolutional neural network.

The embodiment of the invention also provides electronic equipment, which comprises: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method as in embodiment one when executing the computer program.

The embodiment of the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores computer instructions which, when executed by a processor, implement the method as in the first embodiment.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to the embodiments described in detail below, and that the embodiments described in the examples may be combined as appropriate to form other embodiments that will be apparent to those skilled in the art.

Claims

1. The underwater sound target identification method based on the neural network and the multidimensional feature fusion is characterized by comprising the following steps of:

extracting the spectrum characteristics of the underwater sound signal of the underwater sound target to be identified; the spectral features include: STFT spectral features, mel spectral features, and MFCC features;

performing feature fusion on the Mel frequency spectrum feature and the MFCC feature to obtain a primary fusion feature;

inputting the primary fusion feature and the STFT spectrum feature into a trained convolutional neural network to respectively obtain a first embedded feature vector and a second embedded feature vector;

performing feature fusion on the first embedded feature vector and the second embedded feature vector to obtain target fusion features;

and inputting the target fusion characteristics into a trained classification model, and identifying the underwater sound target to be identified by using the trained classification model.

2. The method according to claim 1, characterized in that: the classification model includes a long and short term memory network layer.

3. The method according to claim 1, characterized in that: the spectral characteristics further include delta spectrum and delta-delta spectrum; extracting the spectral features of the underwater sound signal of the underwater sound target to be identified comprises:

respectively carrying out pre-emphasis, framing, windowing and discrete Fourier transform processing on the underwater sound signals to obtain the STFT spectrum characteristic, the Mel spectrum characteristic and the MFCC characteristic;

wherein the delta spectrum calculation formula comprises:

，

the delta-delta spectrum calculation formula comprises:

，

wherein S is _t+m For the Mel spectrum characteristic or STFT spectrum characteristic of the t+m time frame, S _t-m For the Mel spectrum characteristic or STFT spectrum characteristic of the t-M time frame, M represents the number of adjacent frames taken, M is the difference dimension of the time spectrum;delta spectral characteristic coefficients for the t+m time frame, < >>Delta spectral feature coefficients for t-m time frames.

4. A method according to claim 3, characterized in that: performing feature fusion on the Mel spectrum feature and the MFCC feature to obtain a primary fusion feature, including:

and carrying out feature fusion on the Mel frequency spectrum feature, the delta frequency spectrum and delta-delta frequency spectrum corresponding to the Mel frequency spectrum feature and the MFCC feature to obtain the primary fusion feature.

5. A method according to claim 3, characterized in that: inputting the STFT spectrum features into a trained convolutional neural network, comprising:

6. The method according to claim 1, characterized in that: the loss function that optimizes the classification model includes a multi-class cross entropy loss function.

7. An underwater sound target recognition system based on neural network and multidimensional feature fusion, comprising: the device comprises a feature extraction module, an embedded vector generation module, a multidimensional feature fusion module and an underwater sound target recognition module; wherein,

the characteristic extraction module is used for extracting the spectrum characteristics of the underwater sound signals of the underwater sound targets to be identified; the spectral features include: STFT spectral features, mel spectral features, and MFCC features;

the feature extraction module is further configured to perform feature fusion on the Mel spectrum feature and the MFCC feature to obtain a primary fusion feature;

the embedded vector generation module is used for inputting the primary fusion characteristic and the STFT spectrum characteristic into a trained convolutional neural network to respectively obtain a first embedded characteristic vector and a second embedded characteristic vector;

the multidimensional feature fusion module is used for carrying out feature fusion on the first embedded feature vector and the second embedded feature vector to obtain target fusion features;

the underwater sound target recognition module is used for inputting the target fusion characteristics into a trained classification model, and recognizing the underwater sound target to be recognized by utilizing the trained classification model.

8. The system according to claim 7, wherein: the spectral characteristics further include delta spectrum and delta-delta spectrum; the feature extraction module is further configured to:

wherein the delta spectrum calculation formula comprises:

，

the delta-delta spectrum calculation formula comprises:

，

9. An electronic device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method according to any of claims 1-6 when the computer program is executed.

10. A computer readable storage medium storing computer instructions which, when executed by a processor, implement the method of any one of claims 1-6.