CN115238783A

CN115238783A - Underwater sound target positioning method based on self-supervision learning

Info

Publication number: CN115238783A
Application number: CN202210841975.1A
Authority: CN
Inventors: 毕然; 姜龙玉
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-07-18
Filing date: 2022-07-18
Publication date: 2022-10-25

Abstract

The invention discloses an underwater sound target positioning method based on self-supervision learning, which comprises the steps of preprocessing collected underwater sound original data, dividing a data set into labeled data and unlabeled data according to whether a position label is provided, dividing the labeled data into a training set and a test set which are independent of each other, randomly destroying a sampling covariance matrix of a sample without the label, training a self-supervision module based on a transform model by taking a part damaged by reconstruction as a target, wherein the self-supervision module is a self-encoder structure based on the transform model and has the function of reconstructing the damaged sampling covariance matrix, taking parameters of an encoder as initialization parameters of a locator model after the self-encoder training is finished, and finely adjusting the parameters of the locator module by using the training set in the labeled data set, so that the model performance and the generalization capability are obviously improved in related tasks, and the function of reducing a label sample required by training is reduced.

Description

Underwater sound target positioning method based on self-supervision learning

Technical Field

The invention belongs to the technical field of digital signal processing and oceanography intersection, and particularly relates to an underwater sound target positioning method based on self-supervision learning.

Background

The propagation distance of the sound waves in water is long, the loss is small, underwater targets can be identified through sound wave signals, the position of a sound source is located, and underwater communication is carried out. Sonar systems have many applications in many areas, such as civilian applications, where they can facilitate the tracking of marine life and the exploration of marine resources.

And passive sonar positioning utilizes the sensor array to receive sonar signals sent by a target sound source, and processes and analyzes the received signals to obtain the position of the target sound source. The traditional sound source positioning method adopts a mathematical modeling mode, mostly models a physical environment, finds the most possible target position by searching a target area, and the method seriously depends on the physical modeling of a real environment, has poor generalization capability and low precision and efficiency of the searching mode. With the rapid development of the deep learning technology, the deep neural network is utilized to position the underwater sound source, so that a good effect is achieved. Compared with the traditional method, the deep learning method has lower estimation error and stronger generalization capability.

In underwater sound source positioning, it is often difficult to obtain enough training samples containing labels, and insufficient training samples can cause the over-fitting problem of a deep learning model, i.e. the model does not learn enough generalized characteristic expressions, so that when the model is applied to an actual scene, the effect on the training samples cannot be achieved. Therefore, how to improve the generalization capability of the deep learning model on a small-scale data set, reducing the influence caused by overfitting also becomes a bottleneck for the development of related research.

Disclosure of Invention

The invention provides an underwater sound target positioning method based on self-supervision learning, aiming at the problems that a model in the prior art does not learn enough generalized characteristic expression and the effect is poor in practical application, the collected underwater sound original data is preprocessed, a data set is divided into labeled data and unlabeled data according to whether a position label is provided, the labeled data is divided into a training set and a testing set which are independent of each other, a sampling covariance matrix of an unlabeled data sample is destroyed randomly, a part damaged by reconstruction is taken as a target, a self-supervision module based on a Transformer model is trained, the self-supervision module is an self-encoder structure based on the Transformer model and is used for reconstructing the damaged sampling covariance matrix, then after the self-encoder training is finished, parameters of an encoder are taken as initialization parameters of the locator model, and finally the parameters of the locator module are finely tuned by using the training set in the labeled data set, so that the performance and generalization ability of the model are obviously improved in related tasks, and the effect of the samples required by training is reduced.

In order to achieve the purpose, the invention adopts the technical scheme that: an underwater sound target positioning method based on self-supervision learning comprises the following steps:

s1, collecting original data of an underwater acoustic signal, preprocessing the original data, dividing a data set into labeled data and unlabeled data according to whether a position label is arranged on the data set, and dividing the labeled data into a training set and a testing set which are independent of each other; all the data samples are represented by a standard normalized sampling covariance matrix;

s2: randomly destroying a sampling covariance matrix of a non-tag data sample, taking a reconstructed damaged part as a target, training an auto-supervision module based on a Transformer model, wherein the auto-supervision module is an auto-encoder structure based on the Transformer model and is used for reconstructing the damaged sampling covariance matrix, and parameters of an encoder are used as initialization parameters of a locator model in a subsequent step S3; the goal of the model is to minimize the mean square error of the reconstructed input and the masked portion of the original input:

wherein w is a parameter of the self-supervision module f, w ^* As final parameters of the model, x _m Representing the masked part in the sampling covariance, wherein N is the number of samples;

s3: after the self-encoder training is finished, initializing parameters of a locator module by using the parameters of an encoder, wherein the goal of the locator module is to minimize the error of a real value and a predicted value, namely, the following two objective functions are respectively minimized:

wherein L is _r ,L _d Respectively a distance loss function and a depth loss function,

is the true distance and depth of the sample,

predicting distance and depth for the model to the sample;

s4: and (4) utilizing the training set in the labeled data set in the step (S1) to finely adjust the parameters of the locator module.

Compared with the prior art, the invention has the following beneficial effects:

(1) The invention uses the mode of self-supervision learning to pre-train the model, the pre-training mode can learn the feature expression with stronger generalization ability without the sample with a label, the sample is randomly destroyed in the self-supervision training process, the mode makes the training task more complicated, and the model can learn the feature expression with deeper level.

(2) Compared with the training without pre-training, the training of the downstream task of the invention can obviously reduce the positioning error of the model and improve the positioning precision. In the training process, when the same effect as direct training is achieved, only half of the training time is spent; and after the model converges, the positioning error is obviously smaller than the error of direct training. Furthermore, in the case of environmental mismatch (sensor array offset 0.6 °), the error using the pre-trained model is also significantly smaller than that of direct training.

(3) When the training set is large enough, the results of using the pre-training module are only slightly better than those of direct training, but when the training set becomes smaller, the effect of not using the pre-training module is drastically reduced, and the effect of using the pre-training module is reduced within an acceptable range. Experiments show that about half of the label training samples can be saved by using the pre-training module.

Drawings

FIG. 1 is a schematic overall framework diagram of the process of the present invention;

FIG. 2 is a flow chart of the method training of the present invention;

FIG. 3 is a schematic diagram of a model structure of the self-supervision module according to the present invention;

FIG. 4 is a schematic view of a model structure of a positioner module according to the present invention;

FIG. 5 is a graph showing the variation trend of the accuracy of a test set with the increase of the number of iterations in two training strategies according to the test example of the present invention;

FIG. 6 is a diagram illustrating the positioning result of two training strategies for sound sources under ideal environment in the testing example of the present invention;

the present invention will be further illustrated with reference to the accompanying drawings and specific embodiments, which are to be understood as illustrative only and not limiting of the scope of the invention.

Detailed Description

Example 1

An underwater acoustic target positioning method based on self-supervision learning, fig. 1 is an overall frame diagram of the method of the present invention, wherein an ellipse represents data or output, and a rectangle represents a frame or module; FIG. 2 is a training flow diagram of the present framework, namely: firstly, preprocessing the non-label data and the label data by using a data preprocessing module; after the data processing is finished, the self-supervision module finishes training by using the label-free data; and then, initializing the locator module by using the parameters of the self-monitoring module, and finally, finely adjusting the locator module by using the labeled data.

An underwater sound target positioning method based on self-supervision learning specifically comprises the following steps:

s1, collecting original data of an underwater acoustic signal, preprocessing the original data, dividing a data set into labeled data and unlabeled data according to whether a position label is arranged on the data set, and dividing the labeled data into a training set and a testing set which are independent of each other;

the pretreatment operation is divided into a total of 3 steps: and (4) data normalization, calculating a sampling covariance matrix, and performing standard normalization. All samples described in this example are represented by a standard normalized sampling covariance matrix.

To make the processing independent of the composite source spectrum, the received array acoustic field data is converted to a normalized sampled covariance matrix. By taking a discrete fourier transform of the input data of L receivers in a hydrophone array, the sound field data at frequency f can be represented as p (f) = [ p ] ₁ (f)，...，p _L (f)] ^T . The sound field is modeled as:

p(f)＝S(f)g(f，r)+ε

where ε is the noise, S (f) is the source, and g is the Green function, to reduce the sum | S (f) | of the effect of the sound field amplitude, the composite sound field is normalized by:

then, calculating a sampling covariance matrix:

where H represents the conjugate transpose operation, when C (f) = C ^H (f)，N _s Representing the number of snapshots formed. After the sampling covariance calculation is complete, the standard normalization is performed on all samples.

The pre-processing of the tags uses 0-1 normalization to normalize the depth and position of the target location to between 0-1, respectively:

the preprocessed data is a 2 × 20 × 20 matrix. The preprocessed data are divided into three independent data sets in total, 80% of the data sets are unlabeled samples, and the rest labeled samples are divided into training sets and testing sets. All unlabeled samples are used for training the self-supervision module, the labeled training set is used for training the locator module, and finally the performance of the model is evaluated on the locator module by using the labeled test set. In this embodiment, the unlabeled dataset includes 400000 unlabeled exemplars; the number of labeled data is 100000, and is divided into a training set and a test set which are independent of each other.

S2: and randomly destroying the sampling covariance matrix of the unlabeled data samples, taking the reconstructed destroyed part as a target, and training an automatic supervision module based on a Transformer model.

The pre-trained self-supervision module is a Transformer-based self-encoder structure, the input of which is a corrupted sampling covariance matrix, and the self-encoder is used for reconstructing the corrupted sampling covariance matrix. By means of the 'destruction-reconstruction' mode, the self-supervised pre-training model can learn the features with stronger generalization capability. The basic structure of the self-supervision module is shown in fig. 3, and mainly consists of four parts, namely, patch embedding, position embedding, transformer encoder and Transformer decoder, and table 1 below is the structure information of each part of the transform-based self-encoder.

TABLE 1 transform-based self-encoder architecture

The Patch embedding splits the original input into sub-blocks with the same size, and then maps each sub-block into a new feature space. In this embodiment, this is achieved using two-dimensional convolution, with the size of the convolution kernel and padding both set to the size of the sub-block, which in this embodiment is 4.Position embedding adds Position information to the feature vector of each sub-block, and sincos Position coding is used in the method. The Transformer encoder and Transformer decoder are identical in structure and both consist of a multi-head attention mechanism and MLP as shown in tables 2 and 3:

TABLE 2 transform encoder architecture

TABLE 3 transform decoder architecture

Firstly, dividing a sample into subblocks with the same size and mutual exclusion, randomly selecting a certain proportion of subblocks and covering the subblocks, and taking a visible part of subblocks as input by an automatic supervision module and reconstructing an original sample. And obtaining the characteristics of the visible sub-blocks after the characteristics of the Transform error encoder are extracted. According to the characteristics of the Transformer encoder, the input first dimension and the output first dimension are generally the same. The input first dimension of the self-encoder model is the number of visible sub-blocks, and the output first dimension is the number of all sub-blocks. Therefore, in order to reconstruct the whole original sample, the Transform error decoder cannot use only the visible sub-block characteristics as input, and needs to add partial characteristics to represent invisible sub-blocks, and the characteristics of the invisible sub-blocks are randomly generated, shared among the invisible sub-blocks and not updated. After adding the invisible sub-block features, decoding by a Transformerdecoder, and decoding the features into original samples.

In this embodiment, the width of the encoder is 768, and the width of the decoder is 512. In addition, attention in both encor and decoder is eight-head attention. The forward propagation of the input data in the module is as follows:

1. the size of the initial sample is 2 × 20 × 20, and is divided into 25 sub-blocks of the same size, 2 × 4 × 4;

2. each 2 × 4 × 4 subblock is represented as a feature vector of 768 dimensions after passing through the embedding layer, so that the whole input is mapped into a feature matrix of 25 × 768 dimensions, and each row represents a feature vector of one subblock;

3. adding a sincos position code with fixed weight to a feature matrix with 25 x 768 dimensions to ensure that the original position information of each feature vector is contained;

4. randomly selecting a part of feature vectors to cover 40 percent of the feature vectors, wherein the visible feature matrix is changed into a matrix with the size of 15 multiplied by 768;

5. it can be seen that the size of the feature matrix obtained after the feature matrix is propagated by 12 layers of Transformer encoders remains unchanged, and is still 15 × 768;

6. a random and shared eigenvector is generated for each discarded subblock, 10 subblocks are masked in the present invention, and thus 10 weight-shared eigenvectors are generated in total. The intermediate coding shared by the weights and the feature vector of the visible sub-block generated by the encoder form a new feature vector with the dimension of 25 multiplied by 768;

7. after the eigenvector is subjected to embedding operation, reducing the dimension into a two-dimensional matrix of 25 multiplied by 512, and similarly, each row of the matrix is the eigenvector of one sub-block;

8. adding sincos position coding to all the feature vectors again to ensure that the feature vectors of the invisible sub-blocks and the visible sub-blocks contain position information;

9. the feature matrix is subjected to dimensionality invariance after being decoded by a transform decoder of an 8-layer, is mapped into a 25 multiplied by 2 multiplied by 4 through a full connection layer, and is rearranged and reconstructed into a sampling covariance matrix with the original size of 2 multiplied by 20.

During training, the goal of the model is to minimize the mean square error of the reconstructed input and the masked portion of the original input:

wherein w is a parameter of the self-supervision module f, w ^* As final parameters of the model, x _m Represents the masked part of the sample covariance, and N is the number of samples.

S3: after the self-encoder training is finished, the parameters of the locator module are initialized by the encoder parameters,

the model structure of the locator module is shown in fig. 4. The self-supervision module mainly comprises two parts, wherein the first part is a Transformer encoder, and the structure of the first part is the same as that of the Transformer encoder in the self-supervision module. And after the self-monitoring module is trained, initializing by using the parameters of the self-monitoring module.

The second part is a plurality of full connection layers, and the features extracted by the encoder are mapped into distances and depths. In this embodiment, the depth and distance are mapped to between 0-1, respectively. The goal of the locator is to minimize the error of the true and predicted values, i.e., minimize the following two objective functions, respectively:

with the joint training approach, the final loss function is as follows:

L＝(1-α)L _r +αL _d

where alpha is used to balance the weights of distance and depth. In the present embodiment, α is set to 0.5.

S4: the parameters of the locator module are finely adjusted by using the training set in the labeled data set, so that the positioning error of the model is obviously reduced, and the positioning precision is improved.

Test example

In the experimental process, the performance of the locator module is compared when the pre-training module is not used and when the pre-training module is used under the condition that the sizes of the training sets are different. Firstly, a model training step using an automatic supervision module is carried out:

1. training an automatic supervision module by using 400000 label-free samples;

2. initializing the encoder part parameters of the locator module by using the encoder part parameters of the self-supervision module;

3. the localizer is trained with a labeled training set.

Model training without the use of an unsupervised module trains the localizer directly using the labeled training set. FIG. 5 shows the accuracy of the model as a function of the number of trainings when training the positioner in two different ways, with the training set accounting for 5% of the total data volume (25000 labeled samples). It can be clearly seen that, under the same number of iterations, the accuracy of using the pre-training module is significantly higher than the accuracy of not using the pre-training module. When the accuracy of using the self-supervision module reaches the highest accuracy without using the self-supervision module, only about 50% of the training time is used. Meanwhile, after the model is converged, the accuracy rate of the model without the pre-training module is only 61.63%, and the accuracy rate of the model with the pre-training module can reach 89.14%.

Tables 5 and 6 below are the errors predicted by the two models for depth and distance in both the ideal environment and the mismatched environment (sensor array offset 0.6 °).

Table 5 error under ideal conditions when training with 5% (25000 label samples) of data

Table 6 error under mismatch conditions when training using 5% (25000 labeled samples) of data

As can be seen from the table, the errors with the pre-training module are significantly less than the errors without the pre-training module. Fig. 6 and 7 show the positioning results of the model under an ideal environment and a mismatch environment (sensor array offset is 0.6 °), respectively. It can be seen that without the pre-training module, there is a significant error in the sound source estimation at a distance of about 5500m from the sensor array. This is not the case when the pre-training module is used. When the environment is mismatched, the errors of the two methods are obviously increased, but the positioning error of the pre-training module is obviously lower, and the error prediction is also obviously less.

The invention introduces the self-supervision learning into the underwater sound source positioning, and in the self-supervision learning, a model is pre-trained by using a sample without a label, and the model parameter is used as the initialization model parameter of a downstream specific task. In training downstream tasks, fine-tuning is performed on the basis of the parameters using the data containing the tags. Because the pre-trained model is trained by the label-free sample without any labeling information, the cost for obtaining the labeled training data in a large scale is greatly reduced. Meanwhile, the generalization characteristics of a large number of samples are learned in the training process, the overfitting problem of the model can be effectively relieved when the specific task is finely adjusted, the generalization capability of the model is improved, and the number of label samples required in the training process is reduced.

It should be noted that the above-mentioned contents only illustrate the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and it is obvious to those skilled in the art that several modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations fall within the protection scope of the claims of the present invention.

Claims

1. An underwater sound target positioning method based on self-supervision learning is characterized by comprising the following steps:

s1, collecting original data of an underwater acoustic signal, preprocessing the original data, dividing a data set into labeled data and unlabeled data according to whether position labels are carried or not, and dividing the labeled data into a training set and a testing set which are independent of each other again; all the data samples are represented by a standard normalized sampling covariance matrix;

s2: randomly destroying a sampling covariance matrix of a non-tag data sample, and training an auto-supervision module based on a Transformer model by taking a reconstructed damaged part as a target, wherein the auto-supervision module is an auto-encoder structure based on the Transformer model and is used for reconstructing the damaged sampling covariance matrix, and parameters of an encoder are used as initialization parameters of a locator model in the subsequent step S3; the goal of the model is to minimize the mean square error of the reconstructed input and the masked portion of the original input:

wherein w is a parameter of the self-supervision module f; w is a ^* Is the final parameter of the model; x is the number of _m Representing the masked portion of the sample covariance; n is the number of samples;

s3: after the self-encoder training is finished, initializing parameters of a locator module by using the encoder parameters, wherein the locator module aims at minimizing errors of a real value and a predicted value, namely an objective function of a minimum distance loss function and a depth loss function respectively:

the true distance and depth of the sample;

predicting distance and depth to the sample for the model;

2. The underwater sound target positioning method based on the self-supervised learning as recited in claim 1, wherein: the preprocessing step of the step S1 comprises the following steps: transforming the time domain signal to the frequency domain by Fourier transform, respectively calculating a sampling covariance matrix of the received signal in each frequency band, and performing standard normalization on the calculated sampling covariance matrix.

3. The underwater sound target positioning method based on the self-supervised learning as recited in claim 2, wherein: in the step 1, 80% of the samples are label-free samples, all the label-free samples are used for training the self-supervision module, the labeled training set is used for training the locator module, and the performance of the final model is performed on the locator module by using the labeled test set.

4. An underwater sound target positioning method based on self-supervised learning as recited in claim 2 or 3, wherein: the step S2 is an auto-supervision module based on a Transformer model, and comprises the following steps:

the method comprises the steps that a first sub-module, namely, a Patch embedding module, is composed of convolution layers, splits an original input into sub-blocks with the same size, maps each sub-block into a new feature space, and stretches an input block into vectors with the same dimension;

a second sub-module Position embedding, wherein the module is a fixed parameter and is used for adding Position information into the characteristics;

the third sub-module Transformer encoder and the fourth sub-module Transformer decoder have the same structure and consist of a multi-head attention mechanism and an MLP (multi-level prediction processing), an encoder maps original input to a feature space, and a decoder decodes the features back to the original input.

5. The underwater sound target positioning method based on the self-supervised learning as recited in claim 4, wherein: the working principle of each sub-module in the step S2 self-supervision module is as follows: dividing a data sample into subblocks with the same size and mutual exclusion, randomly selecting a certain proportion of subblocks and shielding the subblocks, taking a visible part of the sample as input by an automatic supervision module, and extracting the characteristics of the visible subblocks through the characteristics of a transducer encoder of a third submodule; according to the characteristics of the third sub-module Transformer encoder or the fourth sub-module Transformer decoder, the input first dimension and the output first dimension are the same, so that the input first dimension of the self-supervision module is the number of visible sub-blocks, the output first dimension is the number of all sub-blocks, the fourth sub-module Transformer decoder needs to add partial characteristics to represent invisible sub-blocks, and the characteristics are decoded into an original sample through the decoding of the fourth sub-module Transformer decoder.

6. The underwater sound target positioning method based on the self-supervised learning as recited in claim 5, wherein: the model structure of the locator module in the step S3 at least comprises a Transformer encoder sub-module and a full connection layer, and the Transformer encoder sub-module has the same structure as a Transformer encoder sub-module in the self-supervision module; and the full connection layer maps the features extracted by the transform encoder submodule into distance and depth.

7. The underwater sound target positioning method based on the self-supervised learning as recited in claim 6, wherein: the method adopts a joint training mode, and the final loss function is as follows:

L＝(1-α)L _r +αL _d

where α is a weight parameter that balances distance and depth.