CN113537113B

CN113537113B - Underwater sound target identification method based on composite neural network

Info

Publication number: CN113537113B
Application number: CN202110844909.5A
Authority: CN
Inventors: 徐丽; 钱婧捷; 李柏宽; 申林山; 闫鑫; 娄茹珍; 贾我欢; 李悦齐; 张立国
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2021-07-26
Filing date: 2021-07-26
Publication date: 2022-10-25
Anticipated expiration: 2041-07-26
Also published as: CN113537113A

Abstract

An underwater sound target identification method based on a composite neural network belongs to the technical field of underwater sound signal identification. The invention aims to solve the problem that the accuracy rate of underwater sound target identification is low by adopting the existing method. The invention designs a base layer network structure based on a composite neural network, firstly learns the time sequence characteristics of input audio sample data through an LSTM algorithm to obtain state information updated through the algorithm as an intermediate vector, further continuously transmits the state information in the layer through a CNN network, obtains the spatial characteristics of the input audio sample data through convolution pooling operation in the CNN network, and finally obtains an underwater acoustic target recognition result through a softmax function at the last layer of the CNN network. The invention can be applied to underwater acoustic signal identification.

Description

Underwater sound target identification method based on composite neural network

Technical Field

The invention belongs to the technical field of underwater acoustic signal identification, and particularly relates to an underwater acoustic target identification method based on a composite neural network.

Background

In recent years, with the development of machine learning, deep learning and other technologies, the underwater acoustic target recognition technology has achieved some new developments and research results. The detection and identification of the underwater acoustic target have key effects on underwater operation and underwater target perception, and with the informatization and intellectualization of naval equipment, the underwater acoustic target identification is a prerequisite condition for future underwater and water operations, so whether the underwater acoustic target can be identified and analyzed timely and accurately is an important factor for mastering the initiative of war in the marine war. Due to the fact that the purity of the data information acquired by the marine audio is not high, when some conventional algorithms are used for training, the accuracy rate of the model for predicting the data is not high enough, and the sample data cannot be recognized accurately. Although the CNN algorithm can relatively well identify data information, the structure of the CNN algorithm can cause the CNN algorithm to omit some time-related data information; LSTM has a good effect on identifying the temporal characteristics of data information, but no CNN has a good effect on data processing with spatial characteristics.

In summary, the existing method cannot achieve a good processing effect on both the time-related data information and the spatial feature data, and therefore, the accuracy of the underwater acoustic target identification by using the existing method is still low.

Disclosure of Invention

The invention aims to solve the problem that the accuracy of underwater sound target identification is low by adopting the existing method, and provides an underwater sound target identification method based on a composite neural network.

The technical scheme adopted by the invention for solving the technical problems is as follows:

an underwater acoustic target identification method based on a composite neural network specifically comprises the following steps:

step 1, segmenting an input sound signal through a window function to obtain a plurality of signals with the same length, and then respectively carrying out short-time Fourier transform on each signal to obtain a short-time Fourier transform result;

after the short-time Fourier transform result is converted into an energy spectrum, performing Mel filtering on the energy spectrum to obtain a Mel filtering result;

then, discrete cosine transform is carried out on the Mel filtering result to obtain the MFCC characteristics of the input sound signal;

step 2, inputting the MFCC characteristics obtained in the step 1 into an LSTM network to obtain an output result of the LSTM network;

and 3, inputting the output result of the LSTM network into the CNN network, and outputting the target identification result through the CNN network.

The beneficial effects of the invention are: the invention designs a base layer network structure based on a composite neural network, firstly learns the time sequence characteristics of input audio sample data through an LSTM algorithm to obtain state information updated through the algorithm as an intermediate vector, further continuously transmits the state information in the layer through a CNN network, obtains the spatial characteristics of the input audio sample data through convolution pooling operation in the CNN network, and finally obtains an underwater acoustic target recognition result through a softmax function at the last layer of the CNN network.

The target identification accuracy of the composite neural network algorithm is higher than that of the LSTM algorithm or the CNN algorithm which is used independently, and can reach 73%, and the initial identification accuracy and the convergence speed of the composite neural network are better than those of the LSTM algorithm and the CNN algorithm.

Drawings

FIG. 1 is a diagram of sound waveforms;

FIG. 2 is a feature extraction flow diagram;

FIG. 3 is a composite network identification flow diagram;

fig. 4 is a comparison graph of recognition accuracy of three network models based on the deep learning method.

Detailed Description

First embodiment this embodiment will be described with reference to fig. 3. The underwater sound target identification method based on the composite neural network specifically comprises the following steps:

step 1, segmenting an input sound signal through a window function to obtain a plurality of signals with the same segment length, and then respectively carrying out short-time Fourier transform on each segment of signals to obtain a short-time Fourier transform result;

after converting the short-time Fourier transform result into an energy spectrum, performing Mel filtering on the energy spectrum to obtain a Mel filtering result;

then, discrete cosine transform is carried out on the Mel filtering result to obtain the MFCC (Mel frequency cepstrum) characteristics of the input sound signal;

the LSTM network learns the time sequence characteristics of the audio sample data to obtain an intermediate vector with the time sequence characteristics; the specific process comprises the following steps:

step 2.1: the input data is interacted with the sigmoid to obtain the data to judge the retention degree, so that the data which can enter the network can meet the requirement;

step 2.2: the state information of the current data is obtained by performing product operation on the retention coefficient obtained by sigmoid of the input of the hidden layer and the output of the previous layer and the data obtained by the hidden layer and the output of the previous layer;

step 2.3: and combining the data information state of the previous layer with the current input signal to obtain a weight through sigmoid, putting the current data information state into tanh to obtain a numerical value and performing product calculation on the numerical value and the previous weight, and finally obtaining a 256-dimensional intermediate vector related to the original audio sample data.

The method analyzes the noise characteristics of the underwater target, applies the auditory perception characteristics of the target Mel Frequency Cepstrum Coefficient (MFCC) to the underwater sound target recognition, combines machine learning to study, obtains the LOFAR spectrogram through the time domain signal of the data, and further performs characteristic preprocessing on the data. According to the invention, through the research on two deep learning algorithms, namely a Convolutional Neural Network (CNN) and a Long and Short time Memory network (LSTM), an algorithm based on a composite Neural network is provided, the time sequence characteristics of voice are firstly learned by the LSTM to obtain an intermediate vector, then the spatial characteristics of a CNN learning sample are utilized on the basis of the intermediate vector, and finally a target identification result is output by utilizing a softmax function in the last layer of the CNN network, wherein the identification result is a ship target or marine sound.

The second embodiment is as follows: the difference between this embodiment and the first embodiment is that, in step 1, short-time fourier transform is performed on each segment of signal to obtain a short-time fourier transform result; the specific process comprises the following steps:

in the formula: STFT (t, f) is the result of short-time fourier transform, t is time, s (τ) is the input sound signal, h (·) represents a window function, which represents the complex conjugate, f is frequency, in Hz, e is the base of the natural logarithm, j is the imaginary unit, τ is the integral variable.

For the selected window functions with different lengths, the resolution of the time frequency can show the relative trend of an inverse function along with the coverage size of the time window.

Other steps and parameters are the same as those in the first embodiment.

The third concrete implementation mode: the difference between this embodiment and the first or second embodiment is that, in step 1, the short-time fourier transform result is converted into an energy spectrum, and the specific process is as follows:

SPEC(t,f)＝|STFT(t,f)| ²

where SPEC (t, f) is the energy spectrum.

Other steps and parameters are the same as those in the first or second embodiment.

The fourth concrete implementation mode is as follows: the difference between this embodiment and one of the first to third embodiments is that the structure of the CNN network specifically includes:

from the input layer, the CNN network sequentially includes the input layer, a first convolution layer, a first pooling layer, a second convolution layer, a second pooling layer, a third convolution layer, a third pooling layer, a fourth convolution layer, a full-link layer, and a softmax classification layer.

The invention designs a ten-layer network structure, and the more the number of layers of the convolutional neural network is, the stronger the nonlinear fitting capability is, and the higher the complexity of the recognizable features is. The more neurons convolved within a layer, the more rich the details of the extracted target are. In the network structure, relu is used as an activation function, and the final model is obtained by overlapping different characteristics of data. The softmax function in the last layer can visually express the result generated by classification, so that the result is more convincing.

Other steps and parameters are the same as those in one of the first to third embodiments.

The fifth concrete implementation mode is as follows: the difference between this embodiment and one of the first to fourth embodiments is that the activation function used by the CNN network is Relu.

Other steps and parameters are the same as in one of the first to fourth embodiments.

Examples

The invention is further described below with reference to the accompanying drawings.

The invention provides an underwater sound target identification method based on a composite neural network, which is characterized in that a base layer network structure based on the composite neural network is designed, the time sequence characteristics of audio sample data are firstly learned through an LSTM algorithm, state information updated through the algorithm is obtained and used as an intermediate vector, further the state information in the layer is continuously transmitted through a CNN network, and the spatial characteristics of the audio sample data are obtained through operations such as convolution pooling in the CNN algorithm.

The invention specifically comprises the following steps:

step 1: performing MFCC feature extraction on original sample data, and inputting an obtained result into a designed LSTM network;

step 1.1: in a method of recognizing a sound, mel-frequency cepstrum coefficients, which are cepstrum parameters extracted in the Mel-scale frequency domain, are a commonly used feature, and the Mel-scale-frequency relation formula is expressed as:

in the formula: f is frequency in Hz;

step 1.2: the method comprises the steps that an input data signal is divided through a window function to obtain a plurality of short signals with the same length, and then the data signal is analyzed through Fourier transform to obtain a spectrogram meeting requirements;

for each small part of data information, the displayed signal is stable, so that the expression of fourier calculation is:

the short-time fourier expression is:

in the formula: h (t) represents a window function, and represents a complex conjugate;

transforming the time-frequency function into an energy spectrum:

SPEC(t,f)＝|STFT(t，f)| ² (4)

for the selected window functions with different lengths, the resolution of the time frequency can present the relative trend of an inverse function along with the coverage size of the time window;

step 1.3: the data is transmitted continuously through Mel filtering;

step 1.4: and obtaining the MFCC characteristics of the original sample data through corresponding calculation changes such as discrete cosine and the like.

Step 2: the time sequence characteristics of the audio sample data are learned on the basis of basic characteristics by passing the input sample data information through an LSTM network, and an intermediate vector with the time sequence characteristics is obtained;

And step 3: on the basis of the intermediate vector, the intermediate vector is combined with the audio data and the convolutional neural network through operations such as convolutional pooling and the like through a CNN network to obtain the spatial characteristics of the audio data learning sample, so that a final training model is obtained;

step 3.1: the CNN can show good space characteristics according to the structure of the CNN, corresponding data characteristics can be obtained by performing convolution operation on input data, and main characteristics of the data are reserved through pooling operation;

step 3.2: the invention designs a ten-layer network structure, which comprises four convolutional layers, three pooling layers and a full-connection layer from an input layer, wherein a softmax function is used as the last layer; the design of CNN in the composite network structure is shown in table 1:

table 1 design of CNN in composite network architecture

Step 3.3: in the network structure, relu is used as an activation function, and a final model is obtained by overlapping different characteristics of data.

Respectively carrying out experimental verification on a traditional machine learning method (using MFCC characteristics and an SVM classifier), a convolutional neural network-based method (using MFCC characteristics and a traditional CNN network), a long-time memory network-based method (using MFCC characteristics and an RNN network) and the composite neural network-based method, and comparing experimental results;

experiments based on the traditional machine learning method: using MFCC features and an SVM classifier as a traditional machine learning method, wherein the SVM uses a one-to-one classification mode;

the Mel frequency cepstrum coefficient is a commonly applied feature extraction technology for audio signal identification, corresponding feature output is obtained by performing operations such as filtering processing and the like on original audio data, waveform data of audio is converted into tensor data containing time sequence, and cepstrum parameters extracted in Mel scale frequency domain are combined with analysis processing on time and frequency domains;

the SVM can obtain the recognition accuracy of the SVM in the model and any information data which can be accurately recognized from the current countable sample information data, and obtain the best method thereof to show the generalization; the SVM one-to-one classification method can obtain the accuracy of the model by counting the number of all predicted correct data;

experiments based on the convolutional neural network approach: in the traditional CNN experiment, the network structure has nine layers in total, the initial layer is an input layer, the next three convolutional layers, three pooling layers and one full-connection layer, and the last layer is a layer taking the softmax function as the end; in the network structure, relu is used as an activation function, and the softmax function is a function capable of visually representing the result generated by classification;

experiment based on long and short term memory networking method (LSTM): after MFCC feature extraction is carried out on input original audio data, on the basis of an RNN (radio network), a gate structure is added to neurons between each layer of a network hidden layer to obtain new relation between the neurons, a parameter which can be used as a discrimination leaving parameter is obtained by carrying out sigmoid function operation on input data of the hidden layer at the current stage and output data of the hidden layer at the previous stage, the parameter and a numerical value obtained by carrying out tanh function on the two items of data are subjected to product operation to obtain state information of current data, a corresponding result is obtained by carrying out weighting on the current input signal and the data information state of the previous stage through the sigmoid function, the state of the current data information is combined with the tanh function to obtain another numerical value, and the numerical value and the weighted result of the previous stage are subjected to product operation to obtain output data;

experiment based on the composite neural network method: firstly, performing MFCC feature extraction on original audio data information, and finally obtaining a 256-dimensional intermediate vector about original audio sample data through an LSTM network structure, and then performing spatial training on the intermediate vector through a designed CNN network;

the network structure of the CNN algorithm has ten layers in total, starting from an input layer, four convolutional layers, three pooling layers and one full-connection layer are arranged next to the input layer, and finally, a layer ending with a softmax function is used; in the network structure, relu is used as an activation function, and the softmax function is a function capable of visually representing the result generated by classification;

experiments carried out on the different algorithms can obtain the accuracy of each method for identifying the underwater sound target, and the accuracy of the model is expressed in the form of an image, so that the result can be analyzed more clearly and intuitively.

Referring to fig. 1, a sound waveform diagram; a flowchart of MFCC feature extraction in the conventional machine learning underwater sound target recognition method is shown in fig. 2.

Referring to fig. 3, it is a flow chart of the composite network identification of the present invention: firstly, performing MFCC feature extraction on original audio data information, then learning the time sequence feature of audio sample data through an LSTM algorithm by virtue of an LSTM network structure to obtain a 256-dimensional intermediate vector about the original audio sample data, then performing spatial training on the intermediate vector through a designed CNN network, obtaining the spatial feature of the audio sample data through operations such as convolution pooling in the CNN algorithm, and finally outputting a classification result through a softmax function in the CNN algorithm; the network structure of the CNN algorithm has ten layers in total, starting from an input layer, four convolutional layers, three pooling layers and one full-connection layer are arranged next to the input layer, and finally, a layer ending with a softmax function is used; in the network structure, relu is used as an activation function, and the softmax function is a function capable of visually representing the result generated by classification.

The experimental performance of the CNN and LSTM algorithms and the composite neural network algorithm is compared by adopting the identification accuracy (the accuracy after model training), the initial identification accuracy (the identification accuracy of an untrained model) and the convergence rate, and the identification accuracy images of three network models based on the deep learning method are shown in the figure 4, so that the results can be analyzed more clearly and intuitively by representing the accuracy of the models in the form of images, wherein the images are shown in the figure 4:

through images, the recognition accuracy of the CNN network model is about 63%, the recognition accuracy of the LSTM network model is about 67%, and the recognition accuracy of the composite neural network model is about 73%; on the accuracy, the LSTM algorithm is superior to the CNN algorithm, the accuracy of the composite neural network algorithm is higher than that of the LSTM algorithm, the convergence rate of the composite neural network algorithm is better than that of the LSTM algorithm, and the convergence rate of the CNN algorithm is slightly better than that of the composite neural network algorithm; the composite neural network algorithm model can embody a good recognition effect at the beginning, and is better than an LSTM model and a CNN model.

The above-described calculation examples of the present invention are merely to describe the calculation model and the calculation flow of the present invention in detail, and are not intended to limit the embodiments of the present invention. It will be apparent to those skilled in the art that other variations and modifications can be made on the basis of the foregoing description, and it is not intended to exhaust all of the embodiments, and all obvious variations and modifications which fall within the scope of the invention are intended to be included within the scope of the invention.

Claims

1. An underwater acoustic target identification method based on a composite neural network is characterized by comprising the following steps:

the input sample data information is processed through an LSTM network, and the time sequence characteristics of the audio sample data are learned on the basis of the basic characteristics to obtain an intermediate vector with the time sequence characteristics; the specific process comprises the following steps:

step 2.2: the state information of the current data is obtained by performing product operation on the retention coefficient obtained by sigmoid of the input of the hidden layer and the output of the previous layer and the data obtained by tanh of the input of the hidden layer and the output of the previous layer;

step 2.3: combining the data information state of the previous layer with the current input signal to obtain a weight through sigmoid, putting the current data information state into tanh to obtain a numerical value and performing product calculation with the previous weight, and finally obtaining a 256-dimensional intermediate vector related to the original audio sample data;

step 3, inputting the output result of the LSTM network into the CNN network, and outputting a target identification result through the CNN network;

the structure of the CNN network is specifically as follows:

from the input layer, the CNN network sequentially comprises the input layer, a first convolutional layer, a first pooling layer, a second convolutional layer, a second pooling layer, a third convolutional layer, a third pooling layer, a fourth convolutional layer, a full-connection layer and a softmax classification layer.

2. The underwater sound target identification method based on the composite neural network as claimed in claim 1, wherein in the step 1, short-time fourier transform is performed on each segment of signal respectively to obtain a short-time fourier transform result; the specific process comprises the following steps:

3. The underwater acoustic target identification method based on the composite neural network as claimed in claim 2, wherein in the step 1, the short-time fourier transform result is converted into an energy spectrum, and the specific process is as follows:

SPEC(t,f)＝|STFT(t,f)| ²

where SPEC (t, f) is the energy spectrum.

4. The underwater acoustic target recognition method based on the composite neural network as claimed in claim 3, wherein the activating function adopted by the CNN network is Relu.