CN111627419B

CN111627419B - Sound generation method based on underwater target and environmental information characteristics

Info

Publication number: CN111627419B
Application number: CN202010387814.0A
Authority: CN
Inventors: 王红滨; 沙忠澄; 何鸣; 王念滨; 周连科; 张毅; 何茜茜
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2020-05-09
Filing date: 2020-05-09
Publication date: 2022-03-22
Anticipated expiration: 2040-05-09
Also published as: CN111627419A

Abstract

A sound generation method based on underwater target and environmental information features belongs to the field of underwater sound signal generation research. The invention solves the problems that the underwater acoustic signal generated by the underwater target sound signal feature dictionary and the environmental sound signal feature dictionary constructed by the traditional feature extraction method has poor effect and the application of the existing TTS sound generation model in the underwater acoustic signal generation is limited. The method combines the characteristics of an auditory attention mechanism, and is used for distinguishing the characteristics of the underwater target sound signal and the environmental sound signal when the characteristics of the underwater target sound signal and the environmental sound signal are extracted, so that the characteristic accuracy of the underwater target sound signal and the environmental sound signal characteristic dictionary is improved. The characteristic dictionary is used as a sound production dictionary of the sound production model and is embedded into the sound production model, so that the effect of the generated underwater sound signal is improved. The method of the invention can be applied to the generation of underwater acoustic signals.

Description

Sound generation method based on underwater target and environmental information characteristics

Technical Field

The invention belongs to the field of underwater sound signal generation research, and particularly relates to a sound generation method based on underwater target and environmental information characteristics.

Background

Although the existing voice feature extraction method can achieve a good effect on voice audio feature extraction more or less, the audio data sets are all voice-like, namely voice audio of human speaking, and distinguishing the voice of human speaking from background noise of other non-voice-like is easier to achieve. But is directed to underwater sound signals, which are not human speech, and which may come from the sound emitted by the propellers of a vessel, as well as from motor sound or environmental noise. The traditional voice feature extraction method is difficult to distinguish the features of the radiation noise and other noises of the underwater target during feature extraction. Therefore, in order to distinguish the underwater sound signal of interest from other noise signals well, the ordinary speech feature extraction method is difficult to implement.

The traditional audio feature dictionary construction process mainly comprises two steps: a feature extraction process and a dictionary generation process. It is conceivable that, if an acquired audio data set doped with underwater sound signals and other background noises is obtained, even though the conventional feature extraction method can construct such an underwater target sound signal feature dictionary and an underwater environment sound signal feature dictionary, the accuracy of the generated sounds in classification or recognition experiments is definitely greatly reduced, and after all, the conventional feature extraction method cannot well distinguish signal radiation noises from other noises, so that when the underwater target sound signal feature dictionary and the underwater environment sound signal feature dictionary constructed by the conventional feature extraction method are used for generating underwater sound signals, the effect of the generated underwater sound signals is poor.

Meanwhile, although the existing TTS sound generation model has shown an effective result in speech generation, the existing TTS sound generation model is limited to human speech due to the fact that the pronunciation dictionary is the phoneme dictionary, and therefore the application of the existing TTS sound generation model in the generation of the underwater sound signal is limited, and therefore the existing TTS sound generation model cannot be used for generating the underwater sound signal.

Disclosure of Invention

The invention aims to solve the problems that an underwater target sound signal feature dictionary and an underwater environment sound signal feature dictionary constructed by a traditional feature extraction method cause poor effect of generated underwater sound signals when the underwater sound signals are generated, and the application of an existing TTS sound generation model in the generation of the underwater sound signals is limited, and provides a sound generation method based on underwater target and environment information features.

The technical scheme adopted by the invention for solving the technical problems is as follows: a sound generation method based on underwater target and environmental information characteristics comprises the following steps:

step one, for an underwater target S1, after a sound signal sample of the underwater target is collected, the collected sound signal sample is processed in parallel according to a frequency channel, and an auditory significance map based on frequency channel processing is constructed;

step two, framing the collected sound signal samples and the constructed auditory saliency maps according to time domain to obtain a plurality of groups of sound signals and auditory saliency maps, wherein the time lengths of the sound signals in each group are the same as those of the auditory saliency maps;

inputting the framed sound signals and the auditory saliency map into a convolutional neural network model, and extracting multi-channel characteristics corresponding to each frame of sound signals; carrying out linear combination on the multi-channel characteristics corresponding to each frame of sound signals according to a time sequence to generate a characteristic matrix of the collected sound signal samples;

step three, collecting M-1 sound signal samples of the underwater target S1, and respectively carrying out the processing of the step one and the step two on each collected sound signal sample to obtain a characteristic matrix corresponding to each sound signal sample;

superposing feature matrixes corresponding to all sound signal samples of the underwater target S1 according to channels to obtain a feature matrix corresponding to the underwater target S1;

step four, repeating the process from the step one to the step three for other underwater targets and underwater environment conditions to obtain a characteristic matrix corresponding to other underwater targets and a characteristic matrix corresponding to each underwater environment condition;

step five, respectively establishing the mapping relation between each underwater target characteristic word and the corresponding characteristic matrix and the mapping relation between each underwater environment characteristic word and the corresponding characteristic matrix, and forming a characteristic dictionary according to all the mapping relations;

step six, establishing a sound generation model, wherein the sound generation model comprises an encoder, a decoder and a post-processing network;

for underwater sound signals of a certain target to be generated in a certain environment, respectively finding out feature matrixes corresponding to underwater target feature words and underwater environment feature words from a feature dictionary according to underwater target feature words and underwater environment feature words corresponding to the underwater sound signals to be generated;

inputting the underwater target feature words and the underwater environment feature words into an encoder, combining the encoder with the found feature matrix, extracting high-level features from the underwater target feature words and the underwater environment feature words corresponding to the underwater acoustic signals to be generated, and processing the extracted high-level features to obtain the final representation of the encoder;

inputting the final representation of the encoder into a decoder, and outputting a Mel scale spectrogram by the decoder; and inputting the Mel scale spectrogram into a post-processing network, and generating the waveform of the underwater acoustic signal through the post-processing network.

The invention has the beneficial effects that: the invention provides a sound generation method based on underwater target and environment information characteristics, which combines the characteristics of an auditory attention mechanism, utilizes an auditory saliency calculation model to generate an underwater target sound signal and an underwater environment sound signal auditory saliency map as the priori knowledge of auditory attention, and makes the characteristics of the underwater target sound signal and the underwater environment sound signal obvious when the characteristics of the underwater target sound signal and the underwater environment sound signal are extracted, thereby improving the characteristic accuracy of an underwater target sound signal and an underwater environment sound signal characteristic dictionary. The feature dictionary is used as a sound production dictionary of the sound production model, and the sound production model is embedded, so that the effect of the generated water sound signal is improved.

The method comprises the steps of combining features of an underwater target and underwater environment, generating a sound generation model of a sound signal of the target in a certain environment, taking the feature dictionary as a sound production dictionary of the sound generation model, embedding the sound generation model, finding out a corresponding feature matrix from the feature dictionary according to features of the underwater target and the underwater environment of the underwater sound signal to be generated, inputting the feature matrix into an encoder based on SEQ2FEA, extracting high-level features, converting the high-level features into final features, inputting the final features into a decoder, obtaining mel output corresponding to the input features by the decoder based on causal convolution, and predicting vocoder sound production parameters through the mel output to generate a waveform of the underwater sound signal. The voice generation model can be effectively applied to the generation of the underwater acoustic signals, so that the application field of TTS is expanded from the generation of human voice to the generation of the underwater acoustic signals.

Drawings

FIG. 1 is a diagram of an auditory significance calculation model based on frequency channel processing;

FIG. 2 is a schematic diagram of feature extraction for underwater target sound signals and underwater environment sound signals;

FIG. 3 is a diagram of a convolutional neural network architecture;

FIG. 4 is a flow chart of feature dictionary generation for underwater target sound signals and underwater environment signals;

FIG. 5 is a diagram of a feature dictionary structure;

FIG. 6 is a block diagram of the SEQ2FEA network;

FIG. 7 is a schematic diagram of feature extraction performed by sets of multi-scale one-dimensional convolution filters of the FEA network of SEQ 2;

FIG. 8 is an architectural diagram of a sound generation model based on underwater target-environment characteristics;

FIG. 9 is a diagram of an encoder structure based on the FEA network of SEQ 2;

FIG. 10 is a block diagram of a decoder;

FIG. 11 is a diagram of a post-processing network architecture;

FIG. 12 is a schematic drawing of dropout;

FIG. 13 is a time-frequency diagram of a sound signal under a condition of 90% power of 150 m;

FIG. 14 is an auditory saliency map of a sound signal at 90% power of 150 m;

fig. 15 is a feature dictionary diagram constructed by selecting an 8-horsepower 100m sound signal;

fig. 16 is a feature dictionary diagram constructed by selecting an 8-horsepower 350m sound signal;

FIG. 17 is a feature dictionary diagram constructed by selecting a 90% power 150m sound signal;

FIG. 18 is a feature dictionary diagram constructed by selecting a large vessel fishing buoy sound signal;

FIG. 19 is a time-frequency diagram of a sound signal under 50power conditions;

FIG. 20 is an auditory saliency map of sound signals for the 50power condition;

FIG. 21 is a feature dictionary diagram constructed by selecting a 0power sound signal;

FIG. 22 is a feature dictionary diagram constructed by selecting 80power sound signals;

FIG. 23 is a feature dictionary diagram constructed by selecting word 0 sound signals;

fig. 24 is a feature dictionary diagram constructed by selecting a word 80 sound signal.

Detailed Description

The first embodiment is as follows: the method for generating the sound based on the underwater target and the environmental information features comprises the following steps:

step one, for an underwater target S1, after a sound signal sample of the underwater target is collected (the sound signal sample comprises an underwater target signal and a noise signal), the collected sound signal sample is processed in parallel according to frequency channels, and an auditory significance map based on frequency channel processing is constructed;

framing the sound signal sample and the auditory saliency map from the time domain starting moment, wherein each group comprises a frame of sound signal and a frame of auditory saliency map; the number of sets of sound signals and auditory saliency maps is related to the length of the collected sound signal samples and the length of each set of sound signals;

m represents the total number of sound signal samples for collecting the underwater target S1;

step five, respectively establishing the mapping relation between each underwater target characteristic word and the corresponding characteristic matrix and the mapping relation between each underwater environment characteristic word and the corresponding characteristic matrix, and forming a characteristic dictionary according to all the mapping relations; as shown in fig. 4 and 5.

The underwater target characteristic words are directly obtained according to the corresponding sound signal samples, and the underwater environment characteristic words are also obtained according to the corresponding sound signal samples; the feature dictionary is a dictionary capable of fully covering target features and environment features; the feature dictionary comprises all established mapping relations;

Based on the target-environment feature sound generation model, the overall model architecture is shown in fig. 8. The biggest difference between the target-environment feature-based sound generation model constructed by the invention and the existing TTS sound generation model is that a phoneme dictionary selects an acoustic signal feature dictionary based on an auditory attention mechanism, and a model structure selects a fully-convoluted neural network capable of performing parallel computation.

The target-environment feature-based sound generation model architecture can be roughly divided into an encoder based on SEQ2FEA, a decoder based on causal convolution and a post-processing network. The generating model takes an underwater sound signal feature dictionary as a sounding dictionary to be embedded in an encoder, an underwater target feature word (id) and an environment feature word (id) are input into the encoder, and a corresponding feature matrix is found from the feature dictionary according to the given id. The encoder based on the SEQ2FEA converts the multi-channel feature matrix into high-level features serving as internal features of the encoder, and the decoder generates a corresponding mel feature spectrogram through causal convolution training. Finally, the post-processing network converts it into a waveform of the output audio. The following briefly describes the various components in the target-environment feature based sound generation model.

Encoder based on SEQ2 FEA: the input is a sequence of characters, namely the aforementioned underwater object (environment) name id. The character sequence and the feature matrix in the feature dictionary firstly complete the projection from the input dimension to the target dimension through a full connection layer. Then, the SEQ2FEA network combines the feature matrix to extract corresponding high-level features from the input character sequence, processes the high-level features through a nonlinear change and a series of convolution blocks, and projects the high-level features back to the embedding dimension through a full-link layer to generate a vector h_kVector h_vBy vector h_kAccording to the formula

And (6) calculating. Vector h_vSum vector h_kTogether as the final representation of the encoder, is a robust, trainable, continuous vector. Such an encoder based on SEQ2FEA reduces the over-fitting problem compared to conventional encoders. The encoder structure is shown in fig. 9.

Decoder based on causal diffusion convolution: and modeling by using causal convolution, and feeding back the sample points at the 1-t-1 moment as conditions to the network to predict the sample point at the t-th moment. The principle of the dilation convolution is to inject holes into the standard convolution kernel, thereby increasing the receptive field of the model. The mel-scale spectrum is observed to perform better than the audio features in the experiment, so the mel-scale spectrum is selected as the feature representation of the low-dimensional audio. The internal representation corresponding to the vector generated by the encoder is input into the decoder. The decoder network starts from a dropout (weight loss strategy), and the purpose of preventing the neural network from being overfitted can be achieved through the dropout. Then reaches a rectification linear unit (ReLU) through a nonlinear change and a full connection layer, and preprocesses the input Mel-scale spectrum. Then, a query for processing the hidden state of the encoder is generated again by the causal convolution and attention block. And finally, outputting a Mel scale spectrogram through a full connection layer through a series of nonlinear conversion and product operation. The decoder structure is shown in fig. 10.

Post-processing converter network: the post-processing converter network adopts WaveNet as a vocoder, a Mel standard spectrum output by a decoder is used as the input of the post-processing network, audio generation parameters of the vocoder are obtained through calculation of a series of convolution blocks and two full connection layers, and the vocoder parameters are used as an external regulator to be input into the network after the weight of the vocoder parameters is improved. The WaveNet vocoder is trained using a true mel scale spectrum together with the audio waveform, and the mel scale spectrum amplitude sampled on the linear frequency scale is predicted by learning. Because the post-processing network predicts the parameters of the following WaveNet vocoder with the activation of the last hidden state of the decoder as input, the complete decoding sequence can be seen, and waveform samples corresponding to the output audio are directly generated through a series of non-causal convolutional block prediction vocoder generation parameters. The post-processing network architecture is shown in fig. 11.

The invention adopts the network architecture of full convolution to replace the architecture of the recurrent neural network, and relieves the problem of low model training speed.

Model training process

For the training process of the encoder, given an object id and an environment id, the corresponding feature matrix is found from the feature dictionary by using the object id and the environment id. The feature matrix size is 120 × 256. And then, respectively inputting the feature matrixes of the target and the environment into the full connection layer, wherein a calculation formula is shown as a formula (5).

In the above formula, x_iThe i-th element, y, representing the input_jRepresenting the output of the jth neuron in the fully-connected layer, w_ijRepresenting the weight between the ith element and the jth neuron, b_jAnd (4) representing the bias value of the jth neuron, and delta representing the activation function of the fully-connected layer, wherein the activation function adopts ReLU, and the functional formula of the ReLU is shown as formula (6).

ReLU(x)＝Max(0，x) (6)

Resizing the feature matrix through a full connection layer13 x 13, then inputting into the SEQ2FEA network, calculating by the SEQ2FEA to obtain the high level features of the target and environment feature matrix, and it can be known from the SEQ2FEA parameter table that the feature size calculated by the SEQ2FEA is 256 x 1. Then, the high-level features of the target and the environment are fused to obtain features with the size of 256 x 1. And then inputting the fused result into a plurality of volume blocks to further extract the feature after the fusion of the two, and obtaining a feature output through a full connection layer. Wherein a plurality of the convolution blocks, a full link layer and a residual link together form a residual block. Generation of vector h using results after residual concatenation_kVector h_vBy vector h_kAccording to the formula

And (6) calculating. Vector h_vSum vector h_kTogether as the final representation of the encoder, is a robust, trainable, continuous vector.

For the training process of the decoder, the input of the decoder is the internal representation of the encoder, and the decoder first goes through a full connection layer based on the dropout weight discarding strategy. As shown in FIG. 12, dropout acts to train neural networks over a limited amount of data, and for each training case, randomly culling a fixed percentage of the output of a given layer can significantly improve test set performance. Joint adaptability between neurons in the network is prevented, so that the capacity of the network is limited, and overfitting is prevented.

The result output by the full link layer is further processed by a ReLU function to adjust the output value and increase the nonlinearity of the model. The results are then input to a causal dilation convolution. The causal convolution is used to speed up model training and prediction, and therefore cannot use data at later time instants, which is the effect of the causal convolution. According to x₁...x_tAnd y₁…y_t-1To predict y_tSo that y is_tClose to the actual value. The calculation formula is shown in formula (7).

The principle of the dilation convolution is to inject holes into the standard convolution kernel, thereby increasing the receptive field of the model. Thereby achieving the purpose of obtaining more receptive fields by using less network layers. The look-up matrix transformed by the Mel spectrum is obtained after convolution, the attention module utilizes dot product attention operation, attention weight is calculated by the look-up matrix after convolution and the internal representation (matrix) of the encoder, and context vector is obtained by weighted summation. Finally, the result is compared with

And (4) making a product, and obtaining a corresponding Mel expression through a full connection layer.

For the training process of the post-processing converter network, the input of the training process is the Mel representation output by the decoder, the Mel representation is subjected to feature extraction of a series of rolling blocks, and the audio generation parameters of the vocoder are predicted through two full-connection layers. Relevant studies show that WaveNet has the best effect as a vocoder, so that the final sound signal waveform can be generated by taking WaveNet as a vocoder in an experimental post-processing network and using the vocoder and corresponding audio generation parameters.

The second embodiment is as follows: as shown in fig. 1. The first difference between the present embodiment and the specific embodiment is: in the first step, the collected sound signal samples are processed in parallel according to frequency channels, and an auditory saliency map based on frequency channel processing is constructed, wherein the specific process comprises the following steps:

convolving the collected sound signal sample with a band-pass filter superposed by 64 Gamma filters to obtain sound signal responses of 64 frequency channels;

then, the sound signal response of each frequency channel is convoluted in any direction through 8 one-dimensional Gaussian smoothing filters to obtain a convolution result; down-sampling the convolution result to obtain a representation F of the sound signal response of each frequency channel on 8 scales_iI 1,2, …,8, and reuse F_iCalculating the sound signal of each frequency channelResponding to auditory prominence on different scales; f_iA representation of the acoustic signal response for each frequency channel on the ith scale;

amplifying and normalizing the auditory significance of the sound signal response of each frequency channel on different scales to obtain the auditory significance of the sound signal response of each frequency channel on different scales after normalization;

and performing cross-scale integration on the auditory significance of each normalized frequency channel sound signal response on different scales to construct an auditory significance map based on frequency channel processing.

In the present embodiment, F is used_iCalculating the auditory significance of the sound signal response of each frequency channel on different scales is the existing center-periphery difference algorithm adopted.

Let the representation of the acoustic signal response of each frequency channel on the central (fine) scale be denoted as F_cenCen ∈ {1,2,3,4}, and the representation of the acoustic signal response of each frequency channel on a surround (coarse) scale is denoted as F_sur，sur＝cen+diff，diff∈{2,3,4}，

F_(cen,sur)＝|F_cen-F_sur|

For calculated F_(cen,sur)Carrying out comprehensive analysis to obtain the auditory significance of the sound signal response of each frequency channel on different scales;

the global non-linear amplification algorithm is shown in table 1.

TABLE 1

The third concrete implementation mode: the second embodiment is different from the first embodiment in that: the auditory significance of each normalized frequency channel sound signal response on different scales is integrated in a cross-scale mode, and an auditory significance map based on frequency channel processing is constructed, and the specific process is as follows:

in the formula, Map_j' represents the auditory significance, Map, of the j-th frequency channel sound signal response after cross-scale integration_jRepresenting the auditory significance of the normalized j-th frequency channel sound signal response on different scales, DoG represents a one-dimensional difference Gaussian filter with the length of 50ms,

stands for convolution, M_nor，jRepresenting the auditory significance of the j-th frequency channel sound signal response before normalization on different scales, wherein j is 1,2, …, 64;

for Map again_jPerforming linear integration to obtain an auditory saliency map A based on frequency channel processing;

the fourth concrete implementation mode: the third difference between the present embodiment and the specific embodiment is that: in the second step, the structure of the convolutional neural network model is as follows:

starting from the input end of the convolutional neural network model, the convolutional neural network model sequentially comprises a PCA layer, a convolutional layer 1, a pooling layer 1, a convolutional layer 2, a pooling layer 2 and a full-connection layer.

The fifth concrete implementation mode: as shown in fig. 2 and 3. The fourth difference between this embodiment and the specific embodiment is that: inputting the framed sound signals and the auditory saliency map into a convolutional neural network model, extracting multi-channel features corresponding to each frame of sound signals (the number of feature channels depends on the number of hydrophones used for collecting signal samples, and the number of channels is determined according to the number of the hydrophones), and linearly combining the multi-channel features corresponding to each frame of sound signals according to a time sequence to generate a feature matrix of the collected sound signal samples; the specific process comprises the following steps:

the framed auditory saliency maps pass through a PCA layer to obtain one-dimensional characteristics corresponding to each frame of auditory saliency map, the one-dimensional characteristics and sound signals with the same time dimension (namely, underwater sound signals in the same group) are subjected to characteristic splicing, and the spliced characteristics are sequentially subjected to operations of a convolutional layer 1, a pooling layer 1, a convolutional layer 2, a pooling layer 2 and a full-connection layer to obtain multi-channel characteristics corresponding to each frame of sound signals;

and carrying out linear combination on the multi-channel characteristics corresponding to each frame of sound signals according to a time sequence to generate a characteristic matrix of the collected sound signal samples, and carrying out normalization processing on the characteristic matrix to obtain a normalized characteristic matrix.

The method has the advantages that the method combines the principal component analysis results of the underwater acoustic signals and the auditory saliency map, utilizes the convolution neural network to extract the features, and achieves the purpose of adjusting the feature extraction by utilizing the principal component analysis results of the auditory saliency map through two times of convolution, two times of pooling and one time of full connection operation, and the final output result of the full connection layer is the multi-channel feature of each frame of underwater acoustic signals after framing.

Each frame of underwater sound signal auditory saliency map contains the sounding characteristics of the underwater target in the time dimension. Then, the invention inputs each frame of underwater acoustic signal multichannel characteristics extracted by the convolutional layer and the corresponding segmented auditory saliency map into the full connection layer by utilizing a convolutional-based neural network, the full connection layer adjusts the convolved multichannel characteristics through the characteristic mapping (saliency) distribution of the underwater target auditory saliency map, the more obvious the characteristics represented by the part (saliency area) with larger median in the two-dimensional time-frequency representation (saliency map), the larger the weight can be given to the characteristics of the part of the region of interest during the characteristic extraction, and the purpose of strengthening the characteristics of the underwater acoustic signals is achieved; the smaller the value, the less obvious the characteristic represented by the part (smooth area), and the smaller the weight given to the characteristic of the non-interested area, so as to weaken other noise characteristics. Therefore, the auditory saliency map is used as the priori knowledge of the auditory attention of the underwater sound signal, and the accuracy of feature extraction can be effectively improved by adjusting the convolved multi-channel features. And finally outputting the result of the full connection layer, wherein the result is the multichannel characteristic of each frame of underwater target sound signal after being adjusted according to the auditory saliency map. And finally, linearly combining the multichannel characteristics of each frame of the underwater acoustic signal with the adjusted weight according to the time sequence to generate a final characteristic matrix capable of well highlighting the characteristics of the underwater acoustic signal.

The size of the number at each position in the matrix varies according to the value of the number, and the meaning of the characteristic varies. Because all values in the obtained feature matrix are very large and very small, a fixed value range does not exist, and certain difficulty is brought to experimental calculation. In order to scale all values in the feature matrix to a given size range, the invention normalizes the feature matrix experimentally to scale all its internal values to a range of [0,1 ].

And the last full-connection layer adopts an activation function based on a normalized function, and the output result is a normalized feature matrix. Since the value range of the hyperbolic tangent function (tanh) is [ -1,1], the value range thereof needs to be adjusted to [0,1], and the normalized function after modification of the tanh function is shown in the following formula.

The sixth specific implementation mode: the fifth embodiment is different from the fifth embodiment in that: the encoder is based on an encoder of an SEQ2FEA network, and the decoder is based on a causal convolution decoder.

The structure of the FEA network of SEQ2 is shown in FIG. 6. The method is composed of a plurality of groups of multi-scale one-dimensional convolution filters (convolution kernels), a maximum pooling layer, a fixed-size one-dimensional convolution layer, a residual connection and a high-speed neural Network (high-speed Network). The SEQ2FEA network is used to extract high-level feature information from the input text sequence (i.e., underwater target feature words and environmental feature words). The input text sequence is firstly convolved with a plurality of groups of multi-scale one-dimensional convolution kernels, and multi-channel features on different scales are extracted by utilizing the plurality of groups of multi-scale convolution kernels. In order to ensure the translation invariance of the feature matrix, the result after convolution is input to a pooling layer for maximum pooling operation. The pooled results are transferred to a one-dimensional convolutional layer, and then the original input is connected with the multi-channel feature map by means of residual connection. And finally, inputting the result after connection into a high-speed neural network to extract the characteristics of higher layers.

The parameter table of the SEQ2FEA network is shown in table 2, in conv-k-c-ReLU, conv is one-dimensional convolution, k is width, c is number of output channels, and ReLU is activation function of full connection layer.

TABLE 2 SEQ2FEA parameter Table

First, the original sequence is input into a multi-scale convolutional filter bank. Wherein, the multi-scale filter group has K groups, each group is composed of N filters F with the same channel and the same length_ij，i∈[1，K]，j∈[1，N]And (4) forming. The length of the first group of filters is all 1, the length of the second group of filters is all 2, and so on. The reason for adopting the multi-scale convolution filter to replace the traditional fixed-scale filter is that, in general, the size of the convolution filter is a super-parameter which is set before an experiment, and the size of a better size needs to be continuously adjusted manually through comparing results of multiple experiments, however, the mode of the fixed-scale convolution filter is generally difficult to achieve an optimal result. Aiming at the problem, the invention selects a plurality of groups of multi-scale one-dimensional convolution filters to avoid the process of manually adjusting the sizes of the convolution filters, and the network can determine the sizes and the number of the convolution filters to be used better by learning in a data driving mode, thereby achieving the purpose of enhancing the self-adaptive capacity of the network to the situation. Groups of convolution filters of different sizes were grouped in 3 groups, each group consisting of 3 convolution filters of the same size. For example, the data of the two channels are subjected to feature extraction, and a schematic diagram of the feature extraction process is shown in fig. 7.

And performing feature extraction on the input sequence through the multi-scale convolution filter group to obtain a multi-channel feature map. And performing maximum pooling operation on the multi-channel feature map by using a maximum pooling layer so as to enhance the translation invariance of the map. And then inputting the maximally pooled multichannel feature map into the one-dimensional convolutional layer, and combining the multidimensional features to calculate higher-dimensional features to obtain a multichannel feature map. And then, connecting the original input with the multi-channel feature map by using a residual connection mode, wherein the purpose of using the residual connection is to solve the problem of gradient disappearance caused by the increase of the network depth.

Then, the result of the one-dimensional convolution layer connected with the original input sequence is input into a high-speed neural network to extract the characteristics of a higher layer. Deep neural networks have better effects than shallow neural networks, and have achieved good results in many ways. However, with the increase of the depth of the network, the problems of the deep neural network are also increasing, such as the well-known gradient vanishing problem, which causes the difficult problem of training the deep neural network. However, the high-speed neural network can well solve the problem of training the deep neural network, because the high-speed neural network can enable information to pass through each network layer smoothly at a high speed, the problem of gradient is effectively relieved, and the deep neural network no longer only has the effect of a shallow neural network. The reason for using a high-speed neural network is to solve the problem of gradient disappearance as the depth of the network increases.

A typical feed-forward neural network usually consists of L-layer networks, where the L-th layer (L e {1, 2.., L }) applies a non-linear transformation H (parameterized as W)_H，l) At its input x_lTo produce an output y_l. Thus, x_lIs an input to the network, y_lIs the output of the network. Ignoring the index and offset of the output layer, the calculation relationship is shown in equation (1) below.

y＝H(x，W_H) (1)

H is typically an affine transformation followed by a non-linear activation function, but in general other forms are possible. For high speed CNN networks, two non-linear transformations T (x,W_T) And C (x, W)_C) As shown in formula (2).

y＝H(x，W_H)·T(x，W_T)+x·C(x，W_C) (2)

For purposes of understanding, T is considered a conversion gate and C is considered a volume-carrying gate, as they also represent how results are produced by converting and volume-carrying inputs. For simplicity, C ═ 1-T is set in the present invention, as shown in formula (3).

y＝H(x，W_H)·T(x，W_T)+x·(1-T(x，W_T)) (3)

x，y，H(x，W_H) And T (x, W)_T) Must be the same in order for equation (4) to be effective.

Thus, a highway layer can smoothly change its behavior between a normal layer and simply transmitting itself, depending on the output of the switching gate. As if a smooth layer contains multiple computing units resulting in the ith unit computing y_i＝H_i(x) The highway network comprises a plurality of blocks such that the ith block calculates a block state H_i(x) And a transfer gate output T_i(x) In that respect Finally, it produces a block output y_i＝H_i(x)*T_i(x)+x_i*(1-T_i(x) Connect it to the next layer. The convolution high-speed path layer is constructed similarly to the fully-connected layer, and both weight sharing and local receptive fields are used in H and T conversion. 0 padding is used to ensure that the block state and change gate profiles are the same size as the input. Through a series of feature extraction processes, the output result of the high-speed neural network is the high-level feature of the feature matrix.

The seventh embodiment: the sixth embodiment is different from the sixth embodiment in that: the processing of the extracted high-level features to obtain the final representation of the encoder comprises the following specific processes:

after the extracted high-level features are subjected to nonlinear and convolution processing,then the processing result is processed through a full connection layer to generate a vector h_kAccording to a vector h_kCalculating a vector h_v：

Will vector h_kSum vector h_vTogether as the final representation of the encoder.

The specific implementation mode is eight: the seventh embodiment is different from the seventh embodiment in that: inputting the Mel scale spectrogram into a post-processing network, and generating the waveform of the underwater acoustic signal through the post-processing network, wherein the specific process comprises the following steps:

the post-processing network adopts WaveNet as a vocoder, a Mel scale spectrogram output by a decoder is used as the input of the post-processing network, the Mel scale spectrogram obtains audio generation parameters of the vocoder through the calculation of a rolling block and a full connection layer, and the waveform of a generated water sound signal is output through the vocoder.

Experimental verification

1. Experimental data set

Experiments were evaluated and validated with different types of audio data sets. The method comprises the steps of respectively setting a Songhua river data set, a silencing pool data set and a estuary reservoir data set, constructing an underwater target feature dictionary by the data sets through the method provided by the invention, training an environmental information acoustic model, and then comparing results generated by experiments with original audio under a measurement standard.

(1) Songhua river data set

The data set of Songjiang river contains sound data of ship body under the conditions of "8 horsepower 100 m", "8 horsepower 160 m", "8 horsepower 200 m", "8 horsepower 260 m", "8 horsepower 350 m", "90% power 150 m", "90% power 260 m", "90% power 350 m" and "large ship salvage buoy". The experimental conditions are limited, and the audio data of the fishing boat motor under different conditions are used as different underwater target sounds. According to the sampling rate of 50000Hz, the 100ms frame length effect is the best under the sampling rate through repeated verification of experiments. Thus, the audio is divided into 100ms audio data, the 8-horsepower 100m condition includes 9214 sub-audio data, the 8-horsepower 160m condition includes 12280 sub-audio data, the 8-horsepower 200m condition includes 9216 sub-audio data, the 8-horsepower 260m condition includes 12288 sub-audio data, the 8-horsepower 350m condition includes 15360 sub-audio data, the 90% power 150m condition includes 12288 sub-audio data, the 90% power 260m condition includes 12288 sub-audio data, the 90% power 350m condition includes 21504 sub-audio data, and the large ship salvage buoy condition includes 9216 sub-audio data.

There are 113654 segments of 100ms audio data consisting of 95000 training data sets and 18654 test data sets, each audio sample being labeled from 1 to 9 according to the corresponding condition. And then, the audio sample enters the model to be trained, and the model result is tested through the test set.

(2) Anechoic pool data set

The anechoic pool data set comprises sound data emitted by the ship body under the conditions of 0power, 20power, 50power, 80power, work0, work20, work50 and work 80. The experimental conditions are limited and the experimental data set of the invention is collected in a silencing pool of an underwater acoustic laboratory. The audio data under different conditions are taken as different underwater target sounds. According to the sampling rate of 65536Hz, the 80ms frame length effect at the sampling rate is best through repeated verification of experiments. The audio is then divided into audio data of 80ms length, and the audio data in each condition can be divided into 4500 sub-audio data of 80 ms.

There are 36000 pieces of 100ms audio data, consisting of 30600 training data sets and 5400 testing data sets, each audio sample being labeled from 1 to 8 according to the corresponding condition. And then, the audio sample enters the model to be trained, and the model result is tested through the test set.

(3) River mouth reservoir data set

The estuary reservoir data set is audio data collected by estuary reservoirs in deqing county of Zhejiang province in laboratories. According to different recording conditions, audio data sets under the conditions of 'rainy in the morning', 'clear in the morning', 'rainy in the afternoon' and 'clear in the afternoon' are divided, and since the influence of the water temperature on the change of the environmental sound is also great, the recording time is also taken as a standard of different environmental conditions. According to the sampling rate of 48000Hz, the 100ms frame length effect is the best under the sampling rate through repeated verification of experiments. The audio is then divided into audio data of 100ms length, and the ambient sound collected under four conditions contains 12000 sub-audio data of 100 ms.

48000 pieces of 100ms audio data are totally formed, each audio sample is labeled from 1 to 4 according to corresponding conditions, the environment audio samples enter the model to be trained, and the model result is tested through the test set.

In the reservoir, besides the sound of pure environment, the sound of the electric small wooden ship sailing in the reservoir on the same day is collected, and the small wooden ship has two sailing speeds, namely a high speed and a low speed, which are respectively 2m/s and 1 m/s. The sound emitted by the small wooden ship at the two navigational speeds is collected as underwater target sound data sets under different conditions. After framing, the data sets under high-speed and low-speed conditions each contain 12000 sub audio data of 100ms, so there is 24000 segments of 100ms audio data, consisting of 20400 training sets and 3600 test sets. Each audio sample is labeled by 1 and 2 according to the corresponding condition.

2. Experiment platform and environment

The feature extraction model adopted by the experiment is built by using Google open source tool Tensorflow, and the auditory significance calculation model is added in the feature extraction process to improve the auditory significance of the features. The model adopted by the experiment is a sound generation model based on target-environment characteristics, which is improved on the basis of the existing model. The software environment used for the experiment is shown in table 3.

TABLE 3 software Environment

The hardware environment used for the experiment is shown in table 4.

TABLE 4 hardware Environment

3.1 Underwater target acoustic feature dictionary construction experiment process

The main experimental process for constructing the underwater sound signal feature dictionary based on the auditory attention mechanism is as follows:

(1) the underwater sound signal data is input to the auditory significance calculation model through a band-pass filter.

(2) An auditory saliency map is generated from the saliency computation model.

(3) Feature extraction is performed by CNN convolution using the auditory saliency map as a priori knowledge.

(4) And generating a feature dictionary according to the feature matrix.

In the experiment, firstly, the underwater target sound signal is subjected to audio framing according to certain conditions, and through repeated experimental comparison, 100ms is selected as the frame length of each audio. The method divides the collected underwater sound signals into frames, and then convolutes the underwater sound signals with the band-pass filter to obtain sound signal responses of a plurality of frequency channels. Then, the auditory significance of the sound signal response of each frequency channel on different scales is calculated through a center-periphery difference algorithm. Next, the auditory saliency on each frequency channel is integrated across the scale. Finally, the significance of each frequency channel is combined into an auditory significance map of the final underwater sound signal through linear integration.

And the next step of the experiment is to input the generated acoustic significance map of the underwater sound signal into a convolution neural network built by the experiment, and after performing convolution operation on the acoustic significance map capable of expressing each characteristic of the underwater sound signal, a characteristic matrix of the characteristic of the underwater sound signal is generated. For convenience of experimental calculation, the matrix is normalized and then multiplied by 255 to obtain a feature matrix with a value between 0 and 255.

3.2 Songhuajiang data set Experimental results and analysis

Aiming at the pinhuajiang data set, a sound signal with the power of 90% being 150m is selected as the underwater sound signal of the experiment, the time-frequency graph and the auditory saliency map of the underwater sound signal are shown in fig. 13 and 14, and as is obvious from fig. 13 and 14, the auditory saliency map strengthens the time-frequency characteristics of the noticed target on the basis of the time-frequency graph and weakens other meaningless characteristics.

In the experiment, sound signals under four different conditions of 8 horsepower 100m, 8 horsepower 350m, 90% power 150m and a large ship fishing buoy are respectively selected to construct feature dictionaries, and the feature dictionaries constructed by different types of sound signals are respectively shown in fig. 15, 16, 17 and 18.

As can be seen from fig. 15 to 18, the underwater acoustic signal feature dictionary can visually reflect the time-frequency structures of different underwater target acoustic signals, and each feature word (atom) in the feature dictionary can represent the local time domain and frequency domain features corresponding to the acoustic signal. Different types of sound signals have different structures of acoustic features, and feature dictionaries obtained by calculation are different.

In order to verify the performance superiority of the feature dictionary based on the auditory attention mechanism compared with the conventional feature dictionary, the audio data sets collected in songhua river are used as the experimental training data sets, and the underwater sound signals under the conditions of "8 horsepower 100 m", "8 horsepower 160 m", "8 horsepower 200 m", "8 horsepower 260 m", "8 horsepower 350 m", "90% power 150 m", "90% power 260 m", "90% power 350 m" and "large boat fishing buoy" are labeled with "1" to "9". Comparing the accuracy of underwater target multi-classification experiments on the underwater sound signal feature dictionary constructed by the invention based on the auditory attention mechanism and the underwater sound signal feature dictionary constructed by the traditional method, wherein two-classification experiments and three-classification experiments are selected.

The feature dictionary constructed by the traditional method is slightly lower in accuracy of underwater target recognition than an acoustic signal feature dictionary based on an auditory attention mechanism. Through the experiment, the auditory significance calculation model capable of simulating the auditory attention nerve information processing mechanism of the human ear is fully proved to be introduced in the process of constructing the feature dictionary, attention is focused on the concerned underwater target sound, the underwater target sound and other noises in the underwater target audio data collected by the method can be effectively distinguished, and therefore the interference of other noises on the experiment is reduced, and the feasibility of introducing the auditory attention mechanism in the construction of the underwater sound signal feature dictionary is verified.

3.3 silencing pool data set experiment results and analysis

Aiming at the anechoic pool data set, the sound signal under the condition of 50 powers is selected as the sound signal of the underwater target of the experiment, the time-frequency graph and the auditory saliency map of the sound signal are shown in fig. 19 and 20, and as is obvious from fig. 19 and 20, the auditory saliency map strengthens the time-frequency characteristics of the noticed target on the basis of the time-frequency graph and weakens other meaningless characteristics.

In the experiment, sound signals under four different conditions of 0power, 80power, work0 and work80 are respectively selected to construct feature dictionaries, and feature dictionaries constructed by different types of sound signals are shown in fig. 21 to 24.

As can be seen from fig. 21 to 24, for the muffling water pool data set, the underwater sound signal feature dictionaries acquired under different conditions can reflect time-frequency structures of different underwater target sound signals.

In order to verify again the superiority of the feature dictionary based on the auditory attention mechanism over the conventional feature dictionary, the acoustic signals under the conditions of "0 power", "20 power", "50 power", "80 power", "work 0", "work 20", "work 50" and "work 80" were labeled with "1" to "8" using the audio data set collected by the anechoic pool as the experimental training data set. The underwater target multi-classification experiment accuracy rate comparison is carried out on the underwater sound signal feature dictionary constructed based on the auditory attention mechanism and the underwater sound signal feature dictionary constructed by the traditional method. The experimental model adopts a Lenet model, and the model input is a feature matrix of 13 × 13 in a feature dictionary.

Compared with an underwater sound signal feature dictionary based on an auditory attention mechanism, the feature dictionary constructed by the traditional method has the advantages that the difference is not very large, the accuracy of some experimental results is equal, and the accuracy of some experimental results is slightly low. Because the audio data collected by the anechoic pool is the data which is obtained by collecting the radiation noise of the ship after noise with interference factors is silenced, and noise filtration is carried out in advance on the collection of the data set, the classification experiment result is indeed helped, so the effect based on the auditory attention mechanism is not particularly obvious, but the effect is still available. Through the experiment, the feasibility of adding an auditory attention mechanism in the construction of the underwater sound signal feature dictionary can be still proved.

4.1 Sound Generation Experimental procedure based on target-Environment information features

The experimental flow of sound generation based on underwater target-environment information characteristics is as follows:

(1) and inputting an underwater target id and an environment id.

(2) And finding a corresponding feature matrix from the feature dictionary according to the id.

(3) The SEQ2FEA network converts the multi-channel feature matrix into high-level features serving as internal representation of an encoder, and a decoder generates a corresponding Mel feature spectrogram through causal convolution training.

(4) And outputting waveform data fitting the underwater target and the environmental information characteristics.

In the experiment, an underwater acoustic signal feature dictionary based on an auditory attention mechanism is adopted as a pronunciation dictionary of a constructed sound generation model, because the underwater target sound feature dictionary comprises an underwater target (environment) name and a feature matrix corresponding to the underwater target (environment), after text information of the underwater target and the environment is input, the model can respectively find the sounding feature matrixes corresponding to the underwater target and the environment in the feature dictionary according to the input text content, the sounding feature matrixes are transmitted to a decoder through an SEQ2FEA network, the decoder predicts the Mel expression of corresponding features through causal convolution, predicts the parameters of a WaveNet vocoder, and generates a final sound signal waveform.

4.2 Experimental results and analysis

In the experiment, sound signals collected in a estuary reservoir data set under a 'clear morning' condition are used as environmental characteristic sounds, and sound signals of an electric small wooden ship under the 'high-speed' and 'low-speed' conditions are used as 2 groups of different underwater target sounds. In the experiment, 2 groups of cross validation experiments are carried out on 2 groups of underwater target sounds and 1 group of environment characteristic sounds. Respectively simulating and generating sounds under the conditions of 'clear-high speed in the morning' and 'clear-low speed in the morning'.

Because the weather of the estuary reservoir collected data is changeable on the same day, the sound data under various conditions are collected by the invention, the sound data under the two conditions have original audio, and a similarity comparison experiment can be carried out on the generated audio and the original audio. The similarity comparison experiment mainly comprises two experiments, one is a time domain comparison experiment, and features of the original audio and the generated audio on time domains are compared by mainly utilizing a DTW (dynamic time programming) algorithm; the other is a frequency domain comparison experiment, which mainly compares the frequency domain characteristics on the original audio and the generated audio Lofar image.

The time domain comparison experiment is divided into two steps, wherein the first step is to convert the original audio and the generated audio into a characteristic matrix, and to convert each column of the characteristic matrix into a one-dimensional sequence for drawing comparison by taking the maximum value, the minimum value and the mean value; the second step is to calculate the distance between two sets of one-dimensional sequences by using the DTW algorithm, and if the distance between the two sets of one-dimensional sequences is smaller than a certain threshold (here, the threshold is 0.9), it indicates that the two time sequences are similar. The threshold value in the DTW distance calculation experiment is dynamically determined according to the values of the two sequences, and the threshold value of 0.9 is obtained through experimental verification.

Experiments show that the representation of the generated audio feature matrix in one dimension is approximately consistent with the representation of the original audio feature matrix in one dimension, but some small differences still exist in local areas, which may be caused by the destructive operation of feature extraction caused by a series of operations of converting an audio image into a matrix in the experimental process.

TABLE 5 DTW-based one-dimensional matrix distance metric results

From table 5, it can be seen that the distances between the generated audio and the corresponding original audio in the "clear-high speed at morning" and "clear-low speed at morning" are both much smaller than the threshold of 0.9 on the one-dimensional sequence, so the similarity between the two sets of sequences is very high from the experimental result. Thereby further verifying the feasibility of the method of the invention.

The frequency domain comparison experiment is also divided into two steps, wherein the first step is to convert the original audio and the generated audio into corresponding Lofar images; the second step is to observe the feature distribution of the two in the frequency domain by means of image comparison.

It can be seen through experiments that the frequency components of the generated signal and the original signal are substantially matched, but the generated signal has some noise frequency components compared with the original signal, and frequency components which are not present in the original signal are increased. However, the noise signals are less, and the influence on the whole sound signal is also less, which shows that the sound generation model has better performance in the aspect of sound generation, and further verifies the feasibility of the method of the invention.

The above-described calculation examples of the present invention are merely to explain the calculation model and the calculation flow of the present invention in detail, and are not intended to limit the embodiments of the present invention. It will be apparent to those skilled in the art that other variations and modifications of the present invention can be made based on the above description, and it is not intended to be exhaustive or to limit the invention to the precise form disclosed, and all such modifications and variations are possible and contemplated as falling within the scope of the invention.

Claims

1. A sound generation method based on underwater target and environmental information features is characterized by comprising the following steps:

2. The method for generating the sound based on the underwater target and the environmental information characteristics according to claim 1, wherein in the first step, the collected sound signal samples are processed in parallel according to frequency channels to construct the auditory saliency map based on the frequency channel processing, and the specific process is as follows:

then, the sound signal response of each frequency channel is convoluted in any direction through 8 one-dimensional Gaussian smoothing filters to obtain a convolution result; down-sampling the convolution result to obtain a representation F of the sound signal response of each frequency channel on 8 scales_iI 1,2, …,8, and reuse F_iCalculating the auditory significance of the sound signal response of each frequency channel on different scales;

3. The method for generating sound based on underwater target and environmental information features according to claim 2, wherein the auditory significance of each normalized frequency channel sound signal response on different scales is integrated across scales to construct an auditory significance map based on frequency channel processing, and the specific process is as follows:

stands for convolution, M_nor,jRepresenting the auditory significance of the j-th frequency channel sound signal response before normalization on different scales, wherein j is 1,2, …, 64;

4. the method according to claim 3, wherein in the second step, the convolutional neural network model has a structure:

5. The method for generating sound based on underwater target and environmental information characteristics according to claim 4, wherein in the second step, the framed sound signals and the auditory saliency map are input into a convolutional neural network model, multichannel characteristics corresponding to each frame of sound signals are extracted, and the multichannel characteristics corresponding to each frame of sound signals are linearly combined according to a time sequence to generate a characteristic matrix of the collected sound signal samples; the specific process comprises the following steps:

the framed auditory saliency maps pass through a PCA layer to obtain one-dimensional characteristics corresponding to each frame of auditory saliency map, the one-dimensional characteristics and sound signals with the same time dimension are subjected to characteristic splicing, and the spliced characteristics are sequentially subjected to operations of a convolutional layer 1, a pooling layer 1, a convolutional layer 2, a pooling layer 2 and a full connection layer to obtain multi-channel characteristics corresponding to each frame of sound signals;

6. The method as claimed in claim 5, wherein the encoder is an encoder based on the FEA network of SEQ2, and the decoder is a decoder based on causal convolution.

7. The method as claimed in claim 6, wherein the Mel scale spectrogram is input into a post-processing network, and the post-processing network generates the waveform of the underwater acoustic signal by the following steps: