CN117454240A

CN117454240A - Ship target identification method and system based on underwater acoustic signals

Info

Publication number: CN117454240A
Application number: CN202311192186.0A
Authority: CN
Inventors: 周翱隆; 李小勇; 宋君强; 任开军; 冷洪泽; 邓科峰; 金文婧; 李平政
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2023-09-15
Filing date: 2023-09-15
Publication date: 2024-01-26

Abstract

The invention discloses a ship target recognition method and a ship target recognition system based on underwater acoustic signals, wherein the method firstly extracts various representation forms of the underwater acoustic signals, each representation form focuses on different frequency components, and feature representation is enriched; secondly, extracting time-frequency spectrum features focusing on different frequency components, and acting on input data of subsequent model identification; then in model identification, a plurality of continuous attention-based multi-scale convolution blocks are deployed to learn key features related to identifying underwater sound signals of ships of different categories, specifically, a group of parallel residual convolution blocks with different kernels are utilized to capture the multi-scale features so as to further learn spectral space discrimination features of different scales, and then an adaptive channel attention module is adopted to highlight a dominant part in global features and inhibit interference noise; and finally, accurately and efficiently outputting the categories of the ships in the underwater acoustic signals based on the model. The invention can improve the accuracy of ship identification.

Description

Ship target identification method and system based on underwater acoustic signals

Technical Field

The invention relates to the technical field of underwater acoustic signal identification, in particular to a ship target identification method and system based on underwater acoustic signals.

Background

Underwater acoustic target recognition (Underwater acoustic target recognition, UATR) is an important research direction and technical problem in the field of underwater acoustic signal processing, especially in the field of remote target detection and decision information transmission. Acoustic waves are the only effective information carrier for underwater long-distance propagation at present, and a ship moving in the ocean inevitably emits acoustic signals. The ship radiation signal generally reflects the type information such as the hull structure, the propeller structure, the engine power, the ship condition and the like, and is important identity information for ship type assessment and identification. However, the conventional UATR method is always dependent on manual judgment of a trained sonar operator, and the recognition accuracy is affected not only by subjective factors such as psychology and physiology, but also by objective conditions such as severe underwater environment and low signal strength. Therefore, the exploration of an accurate and reliable underwater sound target automatic identification method has important practical significance.

Currently, many efforts have been proposed to address the problem of automatic recognition of underwater sound targets. However, the following two problems are still needed to be solved in the current technology:

(1) Different ship radiation signals usually have distinguishable characteristics due to the influence of factors such as ship size, navigation speed, propulsion system and the like, and the existing UATR characteristic extraction method is usually focused on processing a specific type of characteristic, however, a single characteristic is usually insufficient for comprehensively describing the unique characteristics of the ship radiation signals, and particularly for solving the UATR problem under complex environmental conditions;

(2) Due to the complexity of the marine environment and the interference of background noise, the existing UATR feature extraction method is greatly influenced by noise, which influences the accuracy of ship target identification.

Disclosure of Invention

The present invention aims to solve at least one of the technical problems existing in the prior art. Therefore, the invention provides a ship target identification method, a system, equipment and a storage medium based on underwater acoustic signals. The accuracy of ship identification can be improved.

According to an embodiment of the first aspect of the present invention, a ship target recognition method based on underwater acoustic signals includes the following steps:

extracting a multi-view time domain signal representation of an underwater acoustic signal of a ship, and converting the multi-view time domain signal representation into a time-frequency spectrum characteristic;

inputting the time spectrum features into the first multi-scale convolution block in a plurality of multi-scale convolution blocks connected in series to obtain the final features output by the last multi-scale convolution block, and inputting the final features into an average pooling layer and a full connection layer to obtain a ship target recognition result output by the full connection layer; any two multi-scale convolution blocks have the same structure, and any one multi-scale convolution block parallelly inputs the time spectrum characteristics or the characteristics output by the last multi-scale convolution block in series to a group of residual convolution blocks with different kernels so as to obtain multi-scale characteristic diagrams output by each residual convolution block, combines the multi-scale characteristic diagrams of each residual convolution block to obtain combined characteristic diagrams, inputs the combined characteristic diagrams into an adaptive channel attention block to obtain intermediate characteristics or final characteristics output by the adaptive channel attention block, and the intermediate characteristics serve as characteristics input by the next multi-scale convolution block in series.

The control method according to the embodiment of the invention has at least the following beneficial effects:

firstly, extracting multi-view time domain signal representation of the underwater sound signal, extracting multiple representation forms of the same data, wherein each representation form focuses on different frequency components, and enriching characteristic representation; secondly, extracting time spectrum features focusing on different frequency components, and acting on subsequent model identification; then in model identification, a plurality of continuous attention-based multi-scale convolution blocks are deployed to learn key features related to identifying underwater sound signals of ships of different categories, specifically, a group of parallel residual convolution blocks with different kernels are utilized to capture the multi-scale features so as to further learn spectral space discrimination features of different scales, and then an adaptive channel attention module is adopted to highlight a dominant part in global features and inhibit interference noise; and finally, accurately and efficiently outputting the categories of the ships in the underwater acoustic signals based on the model.

According to some embodiments of the present invention, any one of the residual convolution blocks includes a plurality of two-dimensional convolution layers connected in series, the two-dimensional convolution layers connected in series are connected by adopting a 1x1 convolution layer to perform channel connection, convolution kernels of any two of the two-dimensional convolution layers are the same, and a batch normalization and rectification linear unit is connected after each two-dimensional convolution layer; and the any residual convolution block inputs the time spectrum characteristics or characteristics output by the last multi-scale convolution block in series into the first two-dimensional convolution layer in the plurality of two-dimensional convolution layers in series so as to obtain a multi-scale characteristic diagram output by the last two-dimensional convolution layer.

According to some embodiments of the invention, the adaptive channel attention block aggregates global statistical features of each channel in the combined feature map using average pooling, generates an intermediate tensor by reshape, inputs the intermediate tensor to an MLP module with two fully connected layers, captures a nonlinear inter-channel relationship and generates attention weights for all channels, multiplies the attention weights by the combined feature map element by element, and outputs an intermediate feature or the final feature.

According to some embodiments of the invention, the multi-scale convolution block comprises three parallel residual convolution blocks and the convolution kernel sizes of the three parallel residual convolution blocks are 3, 5, and 7.

According to some embodiments of the invention, the extracting a multi-view time domain signal representation of the underwater sound signal and converting the multi-view time domain signal representation into time-spectral features comprises:

extracting a multi-view time domain signal representation of the underwater sound signal using a set of bandpass filters; each of the band pass filters has a frequency band that does not overlap;

the multi-view time domain signal representation is converted to time-spectral features using a mel filter.

According to some embodiments of the invention, before extracting the multi-view time domain signal representation of the underwater acoustic signal using the set of bandpass filters, the method for identifying a ship target based on the underwater acoustic signal further comprises:

Gaussian noise is added to the underwater acoustic signal.

According to some embodiments of the invention, before the converting the multi-view time domain signal representation into time-frequency spectrum features using a mel filter, the method for ship target identification based on underwater acoustic signals further comprises:

and carrying out data enhancement on the time spectrum characteristics by adopting a SpecAugment method.

According to a second aspect of the present invention, a ship target recognition system based on an underwater acoustic signal includes:

the characteristic extraction unit is used for extracting a multi-view time domain signal representation of the underwater sound signal of the ship and converting the multi-view time domain signal representation into a time spectrum characteristic;

the target identification unit is used for inputting the time spectrum characteristics into the first multi-scale convolution block in the plurality of multi-scale convolution blocks connected in series to obtain the final characteristics output by the last multi-scale convolution block, and inputting the final characteristics into an average pooling layer and a full connection layer to obtain a ship target identification result output by the full connection layer; any two multi-scale convolution blocks have the same structure, and any one multi-scale convolution block parallelly inputs the time spectrum characteristics or the characteristics output by the last multi-scale convolution block in series to a group of residual convolution blocks with different kernels so as to obtain multi-scale characteristic diagrams output by each residual convolution block, combines the multi-scale characteristic diagrams of each residual convolution block to obtain combined characteristic diagrams, inputs the combined characteristic diagrams into an adaptive channel attention block to obtain intermediate characteristics or final characteristics output by the adaptive channel attention block, and the intermediate characteristics serve as characteristics input by the next multi-scale convolution block in series.

An electronic device according to an embodiment of the third aspect of the present invention comprises at least one control processor and a memory for communication connection with the at least one control processor; the memory stores instructions executable by the at least one control processor to enable the at least one control processor to perform the above-described underwater acoustic signal based ship target recognition method.

A computer-readable storage medium according to an embodiment of the fourth aspect of the present invention stores computer-executable instructions for causing a computer to perform the above-described ship target recognition method based on underwater sound signals.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the invention will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:

FIG. 1 is a flow chart of a method for identifying a ship target based on underwater acoustic signals according to an embodiment of the present invention;

FIG. 2 is a block diagram of a ship target recognition method based on underwater acoustic signals according to another embodiment of the present invention;

FIG. 3 is a schematic diagram of a multi-scale convolutional network based on an attention mechanism, in accordance with one embodiment of the present invention;

FIG. 4 is an architecture diagram of a residual convolution block provided by an embodiment of the present invention;

FIG. 5 is a block diagram of an adaptive channel attention block according to one embodiment of the present invention;

FIG. 6 is a schematic diagram of the accuracy, precision, recall and F1 score values of different feature extraction methods trained on the shipear dataset provided by an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.

In the description of the present invention, the description of first, second, etc. is for the purpose of distinguishing between technical features only and should not be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.

In the description of the present invention, it should be understood that the direction or positional relationship indicated with respect to the description of the orientation, such as up, down, etc., is based on the direction or positional relationship shown in the drawings, is merely for convenience of describing the present invention and simplifying the description, and does not indicate or imply that the apparatus or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention.

In the description of the present invention, unless explicitly defined otherwise, terms such as arrangement, installation, connection, etc. should be construed broadly and the specific meaning of the terms in the present invention can be determined reasonably by a person skilled in the art in combination with the specific content of the technical solution.

Description of the background

Underwater acoustic target recognition is an important research direction and technical problem in the field of underwater acoustic signal processing, particularly in the aspects of remote target detection and decision information transmission. Acoustic waves are the only effective information carrier for underwater long-distance propagation at present, and a ship moving in the ocean inevitably emits acoustic signals. The ship radiation signal generally reflects the type information such as the hull structure, the propeller structure, the engine power, the ship condition and the like, and is important identity information for ship type assessment and identification. However, the conventional UATR method is always dependent on manual judgment of a trained sonar operator, and the recognition accuracy is affected not only by subjective factors such as psychology and physiology, but also by objective conditions such as severe underwater environment and low signal strength.

Currently, many efforts have been proposed to address the problem of automatic recognition of underwater sound targets. For example, the basic components of the underwater sound target recognition system include information acquisition, preprocessing, feature extraction and classifier. However, the following two problems are still needed to be solved in the current technology:

(1) Different ship radiation signals often have distinguishable characteristics due to the influence of factors such as ship size, navigation speed, propulsion system and the like, and the characteristics are usually represented by different spectral space energy distribution structures in a time-frequency domain, and can be used as important standards for identifying ship targets. However, most prior art techniques input the entire spectral feature into the classifier model, which may result in over-reliance on low frequency bands with higher energy during training, and neglecting the contribution of the lower energy high frequency bands (the low frequency component of the underwater acoustic signal is significantly stronger than the high frequency component). Furthermore, existing UATR feature extraction methods are typically focused on processing specific types of features, however, a single feature is often insufficient to fully describe unique features of ship radiated signals, especially in solving the UATR problem under complex environmental conditions.

First embodiment

Referring to fig. 1, in one embodiment of the present application, there is provided a ship target recognition method based on an underwater acoustic signal, the ship target recognition method based on the underwater acoustic signal including the steps of:

step S101, extracting a multi-view time domain signal representation of the underwater sound signal of the ship, and converting the multi-view time domain signal representation into time spectrum characteristics.

Step S102, inputting the time spectrum characteristics into a first multi-scale convolution block in a plurality of multi-scale convolution blocks connected in series to obtain final characteristics output by a last multi-scale convolution block, and inputting the final characteristics into an average pooling layer and a full connection layer to obtain a ship target identification result output by the full connection layer; the method comprises the steps that the structures of any two multi-scale convolution blocks are the same, any one multi-scale convolution block parallelly inputs time spectrum characteristics or characteristics output by the last multi-scale convolution block connected in series to a group of residual convolution blocks with different kernels so as to obtain multi-scale characteristic diagrams output by each residual convolution block, the multi-scale characteristic diagrams of each residual convolution block are combined to obtain combined characteristic diagrams, the combined characteristic diagrams are input into an adaptive channel attention block, intermediate characteristics or final characteristics output by the adaptive channel attention block are obtained, and the intermediate characteristics serve as characteristics input by the next multi-scale convolution block connected in series.

In step S101, extracting a multi-view time domain signal representation of the underwater acoustic signal of the ship and converting the multi-view time domain signal representation into time-spectral features, comprising:

step S1011, extracting a multi-view time domain signal representation of the underwater sound signal by adopting a group of band-pass filters; each band pass filter has non-overlapping frequency bands.

Step S1012, converting the multi-view time domain signal representation into a time-spectral feature using a mel filter.

In step S1011, inspired by the multi-view data representation technique using spectral filtering, the present embodiment proposes a band-pass filter bank as a preprocessing step to extract a subset of frequency components from the original ship' S underwater sound signal, which will then be represented as multi-view data, and used for subsequent feature extraction.

It is known that audio signals radiated by different types of ships have different energy patterns within a specific narrow frequency band and are presented in different frequency energy distribution structures, which can generally be an important basis for object recognition. However, most existing methods directly deal with the entire spectral features of the underwater acoustic signal, and these methods tend to be trained with more attention to the low frequency band having higher energy, while ignoring the contribution of the high frequency band to object recognition. Thus, in order to locate discrimination information between different classes of vessels, a multi-view representation of the broadband ship's radiated signals is employed in the stage of acquiring the characteristics, each view representing a narrowband localized acoustic signal.

In one embodiment of the present application, the filter bank is made up of a plurality of bandpass filters having different cut-off frequencies, as the distinguishable information of the ship's radiation signal tends to concentrate in the low frequency part.

Existing uart feature extraction methods are typically focused on isolating a particular type of feature. However, a single feature is often insufficient to fully describe the unique features of a ship's radiated signal, particularly when addressing the UATR problem in complex environmental conditions. Thus, in step S102, the present embodiment extracts valuable parts of features from different angles using multi-scale convolution blocks having different kernel sizes, thereby enriching the expressive power of the features. In order to prepare the filtered multi-view representation input model for object recognition, the multi-view representation is further converted into temporal-spectral features. The present embodiment performs time-frequency domain conversion using a mel filter bank simulating auditory characteristics of human ears, which has been widely used in various fields such as speech recognition, underwater sound target recognition, and the like.

In step S102, the model includes a plurality of multi-scale convolution blocks, an averaging pooling layer, and a full-join layer in series. The network structure of each multi-scale convolution block is the same, and the purpose of arranging a plurality of multi-scale convolution blocks is to further extract the features of different scales and highlight the main components in the global features; the output of the former multi-scale convolution block is used as the input of the next multi-scale convolution block, a batch normalization layer is further arranged in front of the first multi-scale convolution block, unbalanced data distribution in each view feature is solved by the model, after the final multi-scale convolution block outputs the feature, the model inputs the feature into an average pooling layer, pooling is carried out by the average pooling layer, finally, the spectrum feature is aggregated and the classification of the target class is carried out through a full-connection layer, and the model comprises two full-connection layers.

The multi-scale convolution block is composed of a set of residual convolution blocks and an adaptive channel attention block, wherein the set of residual convolution blocks have different kernels. And a group of residual convolution blocks are arranged in parallel, and are characterized in that the characteristics are input into each residual convolution block in parallel, the characteristics output by each residual convolution block are finally obtained, the characteristics are combined and then are input into the self-adaptive channel attention block, and the characteristics output by the self-adaptive channel attention block are also output results of the multi-scale convolution blocks.

The first multi-scale convolution block performs the following steps: the time spectrum features are input into a group of residual convolution blocks with different kernels so as to obtain a multi-scale feature map output by each residual convolution block, the multi-scale feature maps of each residual convolution block are combined to obtain a combined feature map, the combined feature map is input into an adaptive channel attention block so as to obtain an intermediate feature output by the adaptive channel attention block, and the intermediate feature is used as the input of the next multi-scale convolution block connected in series.

The last multi-scale convolution block performs the following steps: the method comprises the steps of inputting intermediate features output by a plurality of multi-scale convolution blocks in series into a group of residual convolution blocks with different kernels in parallel to obtain multi-scale feature images output by each residual convolution block, merging the multi-scale feature images of each residual convolution block to obtain merged feature images, inputting the merged feature images into an adaptive channel attention block to obtain final features output by the adaptive channel attention block, and inputting the final features into an average pooling layer.

Large scale convolution has a wider receptive field and a stronger semantic information representation capability, but the resolution of the feature map is lower. Accordingly, small scale convolutions have less receptive fields, are good at representing detailed information, and have higher feature map resolution. Thus, the multi-scale convolution block of this embodiment consists of a set of residual convolution layers with different kernel sizes and an adaptive channel attention module. In one aspect, the multi-scale residual convolution layer extracts features with different receptive fields, thereby learning the discriminating features of the underwater acoustic signal from different angles as much as possible. On the other hand, the channel attention module locates the most valuable parts from the plurality of features and adaptively enhances their weights.

In some embodiments of the present application, any one residual convolution block includes a plurality of two-dimensional convolution layers connected in series, the two-dimensional convolution layers connected in series are connected by adopting a 1x1 convolution layer, convolution kernels of any two-dimensional convolution layers are the same, and batch normalization and rectification linear units are connected after each two-dimensional convolution layer; any one residual convolution block inputs the time spectrum characteristics or the characteristics output by the last multi-scale convolution block connected in series into the first two-dimensional convolution layer in the two-dimensional convolution layers connected in series so as to obtain a multi-scale characteristic diagram output by the last two-dimensional convolution layer.

The first two-dimensional convolution layer is used to increase feature dimensions while reducing the size of the feature map. In addition, a 1x1 convolution layer is added to the channel connection portion to maintain consistency of residual inputs. Each two-dimensional convolution layer is followed by a Batch Normalization (BN) and a rectifying linear unit (ReLU). The channel connection operation is used for fusing the output characteristics of each residual two-dimensional convolution module. The combined features are passed through a 1x1 convolutional layer and then to a subsequent channel attention module.

In some embodiments of the present application, the adaptive channel attention block aggregates global statistical features of each channel in the combined feature map using average pooling, and generates an intermediate tensor by reshape, inputs the intermediate tensor to an MLP module with two fully connected layers, captures the relationship between nonlinear channels and generates attention weights for all channels, multiplies the attention weights by the combined feature map element by element, and outputs intermediate features or final features.

In order to capture more discriminant information from global features, successful application of channel attention in computer vision is referenced. Step S102 employs an adaptive channel attention mechanism to adaptively assign different weights to all channel features by capturing the relationships between features from different channels. The core idea of the channel attention mechanism is to enhance important channels and suppress unimportant channels, so that the discrimination of the features is improved. In this way, greater weights are assigned to dominant features in the recognition, helping to suppress interference noise and improve the recognition accuracy and robustness of our model.

In the embodiment, firstly, multi-view time domain signal representation of the underwater sound signal is extracted, multiple representation forms of the same data are extracted, each representation form focuses on different frequency components, and feature representation is enriched; secondly, extracting time spectrum features focusing on different frequency components, and acting on subsequent model identification; then in model identification, a plurality of continuous attention-based multi-scale convolution blocks are deployed to learn key features related to identifying underwater sound signals of ships of different categories, specifically, a group of parallel residual convolution blocks with different kernels are utilized to capture the multi-scale features so as to further learn spectral space discrimination features of different scales, and then an adaptive channel attention module is adopted to highlight a dominant part in global features and inhibit interference noise; and finally, accurately and efficiently outputting the categories of the ships in the underwater acoustic signals based on the model.

In some embodiments of the present application, the step S101 further includes enhancing the data, which specifically includes the following enhancing steps:

before extracting the multi-view time domain signal representation of the underwater sound signal using the set of band pass filters in step S101, further comprises: gaussian noise is added to the underwater acoustic signal.

Before converting the multi-view time domain signal representation into the time-spectral features using a mel filter in step S101, further comprises: and adopting a SpecAugment method to carry out data enhancement on the time spectrum characteristics.

Due to the complex underwater environment, it is difficult to acquire underwater acoustic signals with sufficient data, and the data are inevitably interfered by background noise, reverberation and other factors, so that the task of underwater target identification is very challenging. Data enhancement is an important technology in recognition tasks, and is helpful for increasing the quantity and diversity of training data and improving the generalization capability and recognition performance of a model. In the above embodiment, different data enhancement strategies are adopted for the time domain signal and the time-frequency domain feature. Specifically, gaussian noise is randomly added to the original waveform signal to generate data with different signal-to-noise ratios (SNRs), simulating various noise intensities common in underwater environments. For time-spectral features, the Mel-spectral features are data enhanced using the SpecAugment (a simple automatic speech recognition data extension method), which has proven to be effective in audio recognition tasks. Time-masking and Frequency-masking are the two most representative strategies in SpecAugment.

Second embodiment

Referring to fig. 2 to 5, in one embodiment of the present application, there is provided a ship target recognition method based on underwater acoustic signals, the method including:

step S201, a feature extraction stage and a feature enhancement stage;

random gaussian noise is added to the original ship's underwater acoustic signal to generate data with different signal-to-noise ratios (SNRs), simulating various noise intensities common in underwater environments.

The original ship underwater sound signal (composed of ship radiation sound signal and noise) is subjected to spectrum filtering through a band-pass filter, so that the multi-view representation of the ship radiation sound signal (including mechanical vibration sound, propeller sound and hydrodynamic sound) is obtained. Consider the representation x (m) ∈R ^(1×M) (m.epsilon.1.,. M.) and its corresponding class label y.epsilon. {0, 1., _c -1} the original underwater acoustic signal, wherein M represents the point in time in the signal, N _c Representing the total number of different categories.

Multi-view representation featuresBy using a band-pass filter bank +.>For original shipThe radiation information x is obtained by spectral filtering, the filter bank comprising N _b And a time domain bandpass filter. In particular, the process of spectral filtering can be expressed as:

wherein,indicating bandpass filter operation. After the filtering operation, each view representation is focused on a specific frequency band, acting on the identification of different classes of ships in a finer granularity.

A filter bank is used, where N _b =5 bandpass filters, each bandpass filter has non-overlapping frequency bands, and the bandwidth range increases from 10 to 8000Hz (10-500, 500-1000, 1000-2000, 2000-4000,4000-8000 Hz). The filter is used to decompose each original acoustic signal into 5 different representations.

The generated views are processed using a mel-filter bank to extract mel-spectrogram features that are focused on different frequency components.

And adopting a SpecAugment method to carry out data enhancement on the features of the Mel spectrogram.

The specific relation of mel frequency fmel to actual frequency can be expressed as follows:

mel frequency f _mel The scale value of (c) corresponds to the logarithmic distribution of the actual frequency f. As the frequency increases, the bandwidth of the filter gradually widens. The corresponding mel-spectrum features of the multi-view data representation may be described as X e R ^C×F×T Where C represents the number of channels (set to N _b ) F denotes the number of filter banks and T denotes the number of time frames.

Step S202, a feature recognition stage;

first, a multi-scale convolution network based on an attention mechanism is set, and the network structure sequentially comprises a Batch normalization Layer (such as Batch Norm in fig. 3), a plurality of multi-scale convolution blocks (four), a pooling Layer (such as Avgpooling in fig. 3) and 2 continuous full connection layers (such as Linear layers in fig. 3). The multi-scale convolution block is used as a main network, and the characteristic X epsilon R of the Mel spectrogram is used as a main network ^C×F×T And inputting, extracting features of different scales, and highlighting main components in the global features. Finally, two consecutive full-connected layers aggregate spectral features and produce a final class output. As shown in fig. 2, AMSC Block is used to represent a multi-scale convolution Block, and a plurality of multi-scale convolution blocks are connected in series.

The multi-scale convolution block is composed of a set of residual convolution blocks with different kernel sizes and an adaptive channel attention block. On the one hand, the residual convolution block extracts features with different receptive fields, so that the distinguishing features of the underwater sound signals are learned from different angles as much as possible. On the other hand, the adaptive channel attention block locates the most valuable parts from the plurality of features and adaptively enhances their weights. In the multi-scale convolution network of the embodiment, the multi-scale convolution blocks are repeated four times, each block is composed of three parallel residual convolution blocks, and the convolution kernel sizes are 3, 5 and 7. The number of output channels of all residual convolution blocks in each multi-scale convolution block is the same, and the number of channels corresponding to four different multi-scale convolution blocks is 16, 32, 64 and 128 respectively.

As in fig. 4, the residual convolution block consists of three two-dimensional convolution layers with the same kernel size. The first convolution layer is used to increase the feature dimension while decreasing the feature map size, so that its stride is (2, 2), and the next two convolution layers are (1, 1). In addition, a 1x1 convolution layer is added to the channel connection portion to maintain consistency of residual inputs. All convolution layers in the residual convolution block are followed by a Batch Normalization (BN) and a commutative linear unit (ReLU). The channel connect operation is used to fuse the output characteristics of each convolutional layer. The combined features are passed through a 1x1 convolutional layer and then sent to a subsequent channel attention module (e.g., channel Attention in fig. 3).

Consider an input feature with a shape of B×C' ×F×TD, where B is the batch size and C' =3c represents the number of channels of the combined feature. First apply an averaging pooling to aggregate the global statistics of each channel, generating an intermediate tensor with a shape b×c' by reshape. The aggregated features are then input into an MLP module with two fully connected layers to capture the nonlinear inter-channel relationship and generate attention weights ω εR for all channels ^B×C′ . Finally, the attention weights are multiplied element by element with the original input features D to recalibrate the corresponding channels. In this way, greater weights are assigned to dominant features in the recognition, helping to suppress interference noise and improve recognition accuracy and robustness of the model.

On one hand, the embodiment provides two multidimensional feature extraction strategies to enrich the representation of the features, specifically, a band-pass filter bank is applied to an original underwater sound signal to extract multi-view data representation focusing on different frequency components, and then a multi-scale convolution strategy is applied to the extracted spectral features to further learn spectral space discrimination features of different scales; on the other hand, a channel attention mechanism is added after the multi-scale convolution layer to select and highlight dominant parts from global features and suppress interference noise, and finally, a proposed attention-based multi-scale convolution network is trained to learn key features related to identifying different kinds of ship radiation signals.

Third embodiment

Referring to fig. 6 and 7, in one embodiment of the present application, a set of experimental schemes based on the second embodiment is provided, specifically as follows:

1. setting experimental data;

the experiment was evaluated on two published underwater acoustic datasets: shipear and deeppclip.

Shipear: the dataset contains ship radiated sound data and pure background noise data, including 90 pieces of audio from 11 ships and various natural environment noise. In order to solve the problem of data volume imbalance between different types of vessels, all records are reclassified into five different categories according to the size of the vessel.

Deep clip: the dataset contained a record of real underwater targets from 47 hours 4 minutes for four classes of 256 different vessels. The sampling rate of the data recording was 3.2kHz.

For each dataset, all acoustic signals were resampled to 16kHz and clipped to a 3 second segment. The segmented audio segments were randomly divided into training, validation and test sets, accounting for 70%, 15% and 15%, respectively.

2. Setting an experimental environment;

for input features, a window length of 512 samples (32 milliseconds) and a frame shift of 256 samples (16 milliseconds) are used to extract mel-frequency spectral features from time-domain audio. This process uses a total of 80 mel filters. The method is characterized in that a Pytorch is used for realizing a multi-scale convolution network model (hereinafter referred to as the network model) based on an attention mechanism, and an Adam optimizer is used for optimization. All models were trained with 100 epochs using NVIDIA GeForce RTX3090 GPU and Core i9-12900KF CPU, with an initial learning rate of 2e-4 and a batch size of 32. The cosineAnneanlingLR optimization strategy was used to adjust the learning rate training process during the training process, with the minimum learning rate set to 1e-7. For the data enhancement method, the time mask is set to 15 and the frequency mask is set to 25. The model is trained with minimal cross entropy loss. The loss function is shown below:

Wherein N and N _c Representing the number of samples and the number of categories, respectively. i=1.. N and c=0,.. _c -1。y _ic = (0, 1) represents a sign function, where if the true class of sample i is equal to c, it is 1, otherwise it is 0. Variable p _ic Is the predicted probability of observing that sample i belongs to category c.

3. Experimental evaluation indexes;

to evaluate the proposed classification performance, accuracy, precision, recall, and F1 score were reported as objective indicators. Higher values correspond to better performance. The classification accuracy may be defined as follows:

for each category c (c e {0,., N.) _c -1 }) precision, recall, and F1 score are calculated by the following equations:

here, n _ij (wherein i=0,.. _c -1 and j=0,.. _c -1) represents the number of samples in class i predicted as class j. To obtain overall accuracy, recall and F1 score values, these indices are at all N _c The class is averaged.

4. Experimental results;

(1) A multi-view representation;

it is experimentally demonstrated herein that the multi-view representation strategy of the present application is effective for enhancing the performance of the UATR system. In the feature extraction stage, the original ship radiation signal is first decomposed into five different data representations by a set of bandpass filters with non-overlapping frequency bands. These multiple view representations cover the frequency ranges of 0-500Hz, 500-1000Hz, 10002000Hz, 2000-4000Hz, and 4000-8000 Hz. To study the contributions of different view features in identifying various types of vessels and to demonstrate the superiority of the multi-view representation strategy, sub-band features, full-band features and multi-view features were evaluated separately. The experimental results on the shipear dataset are shown in the following table.

TABLE 1

For single view representation, the low frequency features (0-500 Hz) are not only superior to other features in terms of overall accuracy (92.66%), but also achieve optimal performance in identifying all classes of vessels. The results indicate that identifiable information of ship radiated signals tends to be concentrated in the low frequency range. Furthermore, by comparing experimental results represented by different views, it can be observed that different frequency band characteristics represent particular advantages in identifying different types of vessels. For example, features with highest frequencies (4000 to 8000 Hz) are more efficient for identifying class D, while features with frequencies distributed between 2000-4000Hz identify class C more accurately, features in the range of 1000-2000Hz have a relative advantage in identifying background noise data (class E). Furthermore, although the full-band features have a higher overall accuracy (95.87%) than the sub-band features, the lowest frequency features (0 to 500 Hz) are more advantageous for identifying class D. This suggests that while the model may obtain global information from the wideband features, it may ignore some fine-grained components in the features. To address this limitation, the present application integrates features from different frequency bands. Experimental results indicate that the multi-view representation strategy yields competitive recognition performance and achieves the best accuracy results in all categories.

(2) Evaluating a feature extraction method;

the filtered multi-view representation requires further conversion to time-spectral features before being input into the network. To verify the effect of different time-spectral features on model performance, a set of experiments were performed here comparing mel-spectrum, STFT and MFCC features. For the shipear dataset, the characteristic dimensions of mel-spectrum, STFT, and MFCC are (80,192), (257,192), and (13,192), respectively, where the first value represents the resolution of the frequency dimension and the second value represents the number of time ranges. The accuracy, precision, recall, and F1 score values of the different feature extraction methods trained on the shipear dataset are shown in fig. 6.

As can be seen from fig. 6, when Mel spectrum and STFT spectrum are used as the input features of the present network, the evaluation results of the model on all indexes are significantly better than those using MFCC features. Specifically, the Mel spectrum characteristic with the network obtains the best result on the shipear dataset, the accuracy index is 98.17%, the precision index is 98.5%, the recall rate index is 97.78%, and the F1 fraction index is 98.14%.

(3) Ablation experiments;

to evaluate the effectiveness of the various components in the present network are presented. Two sets of design comparison experiments were performed on the shipear and deepphip datasets to evaluate the performance of the different techniques. The present network is taken as a baseline model. Experiments were then performed to determine the contribution of these methods to the model performance improvement by removing the data enhancement mechanism and adaptive channel attention block from the present network, respectively. Ablation study results for the shipear and deepphip datasets are shown in the following table.

TABLE 2

Without any data enhancement strategy, the accuracy of the network on the shipear dataset is reduced by 0.62%, the accuracy is reduced by 1.34%, the recall is reduced by 0.61%, and the F1 fraction is reduced by 0.98%. As for the deep clip dataset, a 1.38% decrease in average performance of all evaluation metrics can be observed. Thus, data enhancement strategies have proven to be effective in improving recognition performance.

The channel attention module was further removed from the present network model to investigate its effect on model performance. Compared with the baseline, the accuracy, precision, recall and F1 fraction of the simplified network model on the shipear dataset are obviously reduced by 2.76%, 3.47%, 3.16% and 3.32%, respectively. On the deep clip dataset, the adaptive channel attention block was improved by 2.08% on average over the various indicators. These observations confirm that the adaptive channel attention mechanism proposed in the present network model is crucial for improving recognition capability.

(4) Comparing with a reference model;

the results were analyzed to show the advantage of the present network over the representative DNN-based approach in UATR. The comparative models were:

EfficientNet-b0: a simple but efficient Convolutional Neural Network (CNN) model has demonstrated excellent performance in a variety of computer vision tasks;

CRNN: a hybrid architecture that combines CNNs for local feature extraction and LSTM for capturing relevant features;

MbNet-V2: a streamlined architecture for object detection using deep convolution;

UATR-transducer: identifying a network based on the transducer's underwater sound target signal;

for fair comparison, all models described above were modified, taking one-dimensional Mel spectral features as input, and training from scratch on the same shipear and deeppclip datasets.

TABLE 3 Table 3

The above demonstrates that the present network achieves better evaluation scores on both data sets than all other techniques, yielding identification accuracies of 98.2% and 98.4%, respectively. This more demonstrates the very well generalization of the present network.

(5) Evaluating on different noise level data;

to demonstrate the robustness performance of the present network. The trained present network was reevaluated using different levels of gaussian noise on the shipear and deep ship datasets. Noise level is measured by signal-to-noise ratio (SNR). The experiment takes the original acoustic signals from both data sets as clean signals. For evaluation purposes, the signal and Gaussian noise are combined to synthesize noise samples at seven different noise levels (-15 dB, -10dB, -5dB, 0dB, 5dB, 10dB, and 15 dB).

For the shipear dataset, the present network achieved an overall accuracy of 95.87% with a signal-to-noise ratio of 0dB in all categories. Furthermore, it can be observed that the classification accuracy generally increases gradually with increasing signal-to-noise ratio, especially at low signal-to-noise ratios (from-15 dB to 0 dB) more significantly. In addition, the network always obtains the highest recognition rate of class E in all evaluation scenes due to the distinguishable frequency spectrum characteristics between marine environmental noise and ship radiation audio. The sample size of the deep clip dataset is relatively more abundant than the ShisEar dataset, with more pronounced regularity over test sets of different noise levels. Recognition accuracy is perceived to be affected by class sample size and strong noise interference. In particular, the total number of class a and B and class B and D samples of the shipclip dataset is relatively small compared to other target classes. In the case of a low signal-to-noise ratio (.ltoreq.0 dB), the accuracy of identification of these categories drops dramatically as the signal-to-noise ratio drops.

In one embodiment of the present application, a ship target recognition system based on underwater acoustic signals is provided, and the system includes a feature extraction unit 1100 and a target recognition unit 1200, specifically as follows:

The feature extraction unit 1100 is configured to extract a multi-view time domain signal representation of an underwater acoustic signal of a ship and convert the multi-view time domain signal representation into a time-spectrum feature.

The target recognition unit 1200 is configured to input a time-spectrum feature to a first multi-scale convolution block of the multiple multi-scale convolution blocks connected in series, so as to obtain a final feature output by a last multi-scale convolution block, and input the final feature to an average pooling layer and a full connection layer, so as to obtain a ship target recognition result output by the full connection layer; the method comprises the steps that the structures of any two multi-scale convolution blocks are the same, any one multi-scale convolution block parallelly inputs time spectrum characteristics or characteristics output by the last multi-scale convolution block connected in series to a group of residual convolution blocks with different kernels so as to obtain multi-scale characteristic diagrams output by each residual convolution block, the multi-scale characteristic diagrams of each residual convolution block are combined to obtain combined characteristic diagrams, the combined characteristic diagrams are input into an adaptive channel attention block, intermediate characteristics or final characteristics output by the adaptive channel attention block are obtained, and the intermediate characteristics serve as characteristics input by the next multi-scale convolution block connected in series.

The system embodiment and the method embodiment are based on the same inventive concept, so that the content of the method embodiment can also be applied to the method embodiment, and the description thereof is omitted herein.

Referring to fig. 7, an embodiment of the present application further provides an electronic device, where the electronic device includes:

at least one memory;

at least one processor;

at least one program;

the program is stored in the memory, and the processor executes at least one program to implement the ship target identification method based on the underwater sound signal.

The electronic device can be any intelligent terminal including a mobile phone, a tablet personal computer, a personal digital assistant (Personal Digital Assistant, PDA), a vehicle-mounted computer and the like.

The electronic device according to the embodiment of the present application is described in detail below.

A processor, which may be implemented by a general-purpose central processing unit (Central Processing Unit, CPU), a microprocessor, an application specific integrated circuit (Application SpecificIntegrated Circuit, ASIC), or one or more integrated circuits, for executing a relevant program, so as to implement the technical solution provided by the embodiments of the present disclosure;

the Memory may be implemented in the form of Read Only Memory (ROM), static storage device, dynamic storage device, or random access Memory (Random Access Memory, RAM). The memory may store an operating system and other application programs, and when the technical solutions provided in the embodiments of the present disclosure are implemented by software or firmware, relevant program codes are stored in the memory, and the processor invokes the ship target recognition method based on the underwater acoustic signal to perform the embodiments of the present disclosure.

The input/output interface is used for realizing information input and output;

the communication interface is used for realizing communication interaction between the device and other devices, and can realize communication in a wired mode (such as USB, network cable and the like) or in a wireless mode (such as mobile network, WIFI, bluetooth and the like);

a bus that transfers information between the various components of the device (e.g., processor, memory, input/output interfaces, and communication interfaces);

wherein the processor, the memory, the input/output interface and the communication interface are communicatively coupled to each other within the device via a bus.

The embodiment of the disclosure also provides a storage medium, which is a computer-readable storage medium, and the computer-readable storage medium stores computer-executable instructions for causing a computer to execute the ship target identification method based on the underwater sound signal.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The embodiments described in the embodiments of the present disclosure are for more clearly describing the technical solutions of the embodiments of the present disclosure, and do not constitute a limitation on the technical solutions provided by the embodiments of the present disclosure, and as those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present disclosure are equally applicable to similar technical problems.

It will be appreciated by those skilled in the art that the technical solutions shown in the figures do not limit the embodiments of the present disclosure, and may include more or fewer steps than shown, or may combine certain steps, or different steps.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the present application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution, in the form of a software product stored in a storage medium, including multiple instructions for causing an electronic device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only memory (ROM), a random access memory (Random Access Memory, RAM), a magnetic disk, an optical disk, or the like, which can store a program.

The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of one of ordinary skill in the art without departing from the spirit of the present invention.

Claims

1. The ship target identification method based on the underwater acoustic signal is characterized by comprising the following steps of:

2. The ship target identification method based on the underwater acoustic signal according to claim 1, wherein any one of the residual convolution blocks comprises a plurality of two-dimensional convolution layers which are connected in series, the two-dimensional convolution layers which are connected in series are connected through channels by adopting 1x1 convolution layers, convolution kernels of any two of the two-dimensional convolution layers are the same, and a batch normalization and rectification linear unit is connected behind each two-dimensional convolution layer; and the any residual convolution block inputs the time spectrum characteristics or characteristics output by the last multi-scale convolution block in series into the first two-dimensional convolution layer in the plurality of two-dimensional convolution layers in series so as to obtain a multi-scale characteristic diagram output by the last two-dimensional convolution layer.

3. The ship target recognition method based on the underwater sound signal according to claim 2, wherein the adaptive channel attention block aggregates global statistical features of each channel in the combined feature map by means of average pooling, generates an intermediate tensor through reshape, inputs the intermediate tensor to an MLP module with two fully connected layers, captures a nonlinear inter-channel relationship and generates attention weights for all channels, multiplies the attention weights by the combined feature map element by element, and outputs an intermediate feature or the final feature.

4. The marine vessel target identification method based on underwater sound signals according to claim 1, wherein the multi-scale convolution block comprises three parallel residual convolution blocks and convolution kernel sizes of the three parallel residual convolution blocks are 3, 5 and 7, respectively.

5. The method of claim 1, wherein extracting a multi-view time domain signal representation of the underwater acoustic signal and converting the multi-view time domain signal representation to a time-frequency spectral feature comprises:

6. The method of claim 5, wherein prior to extracting the multi-view time domain signal representation of the underwater acoustic signal using the set of bandpass filters, the method further comprises:

gaussian noise is added to the underwater acoustic signal.

7. The method of claim 5, further comprising, prior to the converting the multi-view time domain signal representation into time-frequency spectral features using a mel filter:

8. The ship target recognition system based on the underwater acoustic signal is characterized by comprising:

9. An electronic device, characterized in that: comprising at least one control processor and a memory for communication connection with the at least one control processor; the memory stores instructions executable by the at least one control processor to enable the at least one control processor to perform the underwater acoustic signal based ship target recognition method of any one of claims 1 to 7.

10. A computer-readable storage medium, characterized by: the computer-readable storage medium stores computer-executable instructions for causing a computer to perform the underwater sound signal-based ship target recognition method of any one of claims 1 to 7.