CN117612513A

CN117612513A - Deep learning-based dolphin sound generation method

Info

Publication number: CN117612513A
Application number: CN202410091532.4A
Authority: CN
Inventors: 冯子仪; 尹晓峰; 张培珍
Original assignee: Guangdong Ocean University
Current assignee: Guangdong Ocean University
Priority date: 2024-01-23
Filing date: 2024-01-23
Publication date: 2024-02-27
Anticipated expiration: 2044-01-23
Also published as: CN117612513B

Abstract

The invention discloses a dolphin sound generation method based on deep learning, which belongs to the technical field of sound synthesis. According to the method, the original dolphin sound is decomposed, so that each EMD original signal only contains part of characteristics of the original dolphin sound, the data volume is less, the complexity is lower, and the precision of generating the dolphin sound can be improved.

Description

Deep learning-based dolphin sound generation method

Technical Field

The invention relates to the technical field of sound synthesis, in particular to a dolphin sound generation method based on deep learning.

Background

In recent years, there have been studies attempting to simulate sound generation of dolphin using artificial intelligence technology. For example, a dolphin call generation method based on generating an countermeasure network is proposed, which generates dolphin calls by training a discriminator model and a generator model. The technology is not only helpful for deeply understanding the sound language of the dolphin, but also provides a direction for artificially synthesizing the sound of the dolphin.

The sound emitted by the dolphin comprises the following characteristics: high frequency, wide band, consisting of a series of complex and varying tones, and with sonar and echo localization. Therefore, the dolphin sound is integrated with a large amount of information, has the characteristics of high frequency and wide frequency band, and if the existing dolphin sound generation method is adopted, the dolphin sound sample is directly used as a training sample, and the dolphin sound generated by the trained generator model is low in precision.

Disclosure of Invention

Aiming at the defects in the prior art, the dolphin sound generation method based on deep learning provided by the invention solves the problem that the precision of dolphin sound generated by the existing dolphin sound generation method is low.

In order to achieve the aim of the invention, the invention adopts the following technical scheme: a dolphin sound generation method based on deep learning comprises the following steps:

s1, decomposing original dolphin sound by adopting an EMD decomposition algorithm to obtain a plurality of EMD original signals;

s2, extracting singular values from each EMD original signal, and constructing a singular value feature matrix;

s3, extracting time-frequency characteristics of each EMD original signal, and constructing a time-frequency characteristic matrix;

s4, processing the singular value feature matrix and the time-frequency feature matrix by adopting a deep learning model to generate an EMD estimation signal;

s5, combining the EMD estimation signals to generate dolphin sound.

The beneficial effects of the invention are as follows: according to the method, an EMD decomposition algorithm is adopted to decompose dolphin original sound to obtain a plurality of EMD original signals, each EMD original signal contains partial sound characteristics, singular values and time-frequency characteristics are extracted, the sound characteristics of each EMD original signal are obtained, a deep learning model is adopted to process the singular value characteristic matrix and the time-frequency characteristic matrix, EMD estimated signals are generated, and the EMD estimated signals are overlapped and combined to obtain dolphin sound. According to the method, the original dolphin sound is decomposed, so that each EMD original signal only contains part of characteristics of the original dolphin sound, the data volume is less, the complexity is lower, and the precision of generating the dolphin sound can be improved.

Further, the step S2 includes the following sub-steps:

s21, constructing a corresponding track matrix according to each EMD original signal;

s22, performing singular value decomposition on the track matrix to obtain a singular value eigenvector;

s23, constructing a singular value feature matrix according to the singular value feature vector, wherein A=a ^T a, wherein A is a singular value eigenvector, a is a singular value eigenvector, and T is a transposition operation.

Further, the step S3 includes the following sub-steps:

s31, extracting time domain features of each EMD original signal, wherein the time domain features comprise: peak-to-peak, skewness, kurtosis, and form factor;

s32, carrying out frequency domain transformation on each EMD original signal to obtain a frequency domain signal;

s33, extracting frequency domain characteristics of the frequency domain signals, wherein the frequency domain characteristics comprise: spectrum amplitude mean value, spectrum amplitude center of gravity, power spectrum density and cepstrum coefficient;

s34, constructing a time-frequency feature vector by taking the time-domain features and the frequency-domain features as elements;

s35, constructing a time-frequency characteristic matrix according to the time-frequency characteristic vector, wherein B=b ^T B, wherein B is a time-frequency characteristic matrix, and B is a time-frequency characteristic vectorT is a transpose operation.

The beneficial effects of the above further scheme are: according to the invention, the sound characteristics of the dolphin original sound are expressed through the singular value characteristic matrix and the time-frequency characteristic matrix, so that the data volume of the EMD original signal is reduced.

Further, the deep learning model in S4 includes: the system comprises a singular value feature processing network, a time-frequency feature processing network, a feature splicing unit, a first LSTM unit, a Concat layer, a second LSTM unit and a plurality of output units;

the input end of the singular value feature processing network is used for inputting a singular value feature matrix, and the output end of the singular value feature processing network is connected with the first input end of the feature splicing unit; the input end of the time-frequency characteristic processing network is used for inputting a time-frequency characteristic matrix, and the output end of the time-frequency characteristic processing network is connected with the second input end of the characteristic splicing unit; the output ends of the characteristic splicing units are respectively connected with the input ends of a plurality of Cell modules in the first LSTM unit; the input ends of the Concat layer are respectively connected with the output ends of the plurality of Cell modules in the first LSTM unit, and the output ends of the Concat layer are respectively connected with the input ends of the plurality of Cell modules in the second LSTM unit; the input end of each output unit is connected with the output end of one Cell module in the second LSTM unit, and the output end of each output unit is used as the output end of the deep learning model; each Cell module in the first LSTM unit is used for inputting a characteristic value in the output characteristics of the characteristic splicing unit; each Cell module in the second LSTM unit is configured to input an output feature of the Concat layer.

The beneficial effects of the above further scheme are: according to the method, a singular value feature processing network and a time-frequency feature processing network are arranged to respectively process a singular value feature matrix and a time-frequency feature matrix, further feature extraction is achieved, feature extraction and splicing of two networks are carried out by adopting a feature splicing unit, each Cell module in a first LSTM unit is used for processing a feature value, the memory of LSTM is utilized, the relationship of each feature value in the feature splicing unit output feature is better considered by the first LSTM unit, the output of the first LSTM unit is spliced into a vector by adopting a Concat layer, and the vector is input to each Cell module in a second LSTM unit, so that the second LSTM unit not only considers the output of the last Cell module in the second LSTM unit, but also synthesizes the output of the first LSTM unit, and the precision of generating dolphin sound is improved.

Further, the singular value feature processing network and the time-frequency feature processing network have the same structure, and both the singular value feature processing network and the time-frequency feature processing network comprise: a first convolution block, a second convolution block, a third convolution block, a first upsampling layer, a second upsampling layer, a third upsampling layer, an adder A1 and a feature saliency processing layer;

the input end of the first convolution block is used as the input end of a singular value feature processing network or a time-frequency feature processing network, and the output end of the first convolution block is respectively connected with the input end of the first up-sampling layer and the input end of the second convolution block; the output end of the second convolution block is respectively connected with the input end of the second up-sampling layer and the input end of the third convolution block; the output end of the third convolution block is connected with the input end of the third up-sampling layer; the input end of the adder A1 is respectively connected with the output end of the first upsampling layer, the output end of the second upsampling layer and the output end of the third upsampling layer, and the output end of the adder A1 is connected with the input end of the characteristic significant processing layer; and the output end of the characteristic significant processing layer is used as the output end of a singular value characteristic processing network or a time-frequency characteristic processing network.

The beneficial effects of the above further scheme are: in the invention, three convolution blocks are adopted to gradually extract the features, an up-sampling layer is arranged at the features with different depths to carry out up-sampling treatment, the data quantity of the features with different depths is enriched, then an adder A1 is adopted to carry out fusion treatment, and a feature salient treatment layer is adopted to salient the salient features.

Further, the expression of the feature saliency processing layer is:

，

wherein x is _i,z The ith eigenvalue, x, output for the feature saliency processing layer _i The ith eigenvalue, x, of the layer input is processed for feature saliency _max I is a positive integer, which is the maximum characteristic value in the output characteristics of the adder A1.

The beneficial effects of the above further scheme are: the characteristic salient processing layer can normalize the input characteristic values on one hand and can salient the salient characteristics on the other hand, so that the large characteristic values are distinguished from the small characteristic values more remarkably.

Further, the expression of the feature stitching unit is:

，

wherein H is the output characteristic of the characteristic splicing unit, maxpool is the maximum pooling operation, avgpool is the average pooling operation, and X _Q Processing the output characteristics of the network for singular value characteristics, X _S The output characteristics of the network are processed for time-frequency characteristics,is Hadamard product.

The beneficial effects of the above further scheme are: and when the characteristic is spliced, the maximum characteristic and the average characteristic are respectively extracted by adopting the maximum pooling operation and the average pooling operation, so that the characteristic is further simplified.

Further, the expression of the output unit is:

，

wherein y is _k An output value h of the kth output unit _j,k The j-th characteristic value, w, output by the Cell module in the second LSTM unit _j,k Is h _j,k Weights of b _j,k Is h _j,k N is the number of eigenvalues output by a Cell module in the second LSTM unit, j and k are positive integers, and sigmoid is an activation function.

The beneficial effects of the above further scheme are: in the invention, the output unit considers all the characteristic values output by each Cell module in the second LSTM unit, selects a sigmoid linear activation function, and outputs the amplitude value of each discrete point on the EMD estimation signal.

Further, the loss function of the deep learning model in the step S4 during training is as follows:

，

wherein L is a loss function, y _k Output value g of kth output unit _k For the kth tag value, a plurality of output values y _k Forming an EMD estimation signal, r _k For the kth training parameter, M is the number of output values, and k is a positive integer.

Further, the expression of the kth training parameter is:

，

where exp is an exponential function based on a natural constant.

The beneficial effects of the above further scheme are: the deep learning model is output into a plurality of y _k A plurality of output values y _k An EMD estimation signal is formed, and because the deep learning model uses two layers of LSTM, the LSTM training time is long, in order to shorten the training time, the training parameters are set, the loss function is enhanced, and the training speed of the deep learning model is accelerated.

Drawings

FIG. 1 is a flow chart of a dolphin sound generation method based on deep learning;

FIG. 2 is a schematic diagram of a deep learning model;

fig. 3 is a schematic structural diagram of a singular value feature processing network or a time-frequency feature processing network.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.

As shown in fig. 1, a dolphin sound generation method based on deep learning includes the following steps:

s5, combining the EMD estimation signals to generate dolphin sound.

In this embodiment, a singular value feature matrix and a time-frequency feature matrix of an EMD original signal correspondingly generate an EMD estimation signal.

In this embodiment, a plurality of EMD estimation signals are combined, specifically: the co-located amplitude values are added.

The step S2 comprises the following sub-steps:

The singular value eigenvector consists of the singular values.

The step S3 comprises the following substeps:

s35, constructing a time-frequency characteristic matrix according to the time-frequency characteristic vector, wherein B=b ^T B, wherein B is a time-frequency characteristic matrix, B is a time-frequency characteristic vector, and T is transposition operation.

According to the invention, the sound characteristics of the dolphin original sound are expressed through the singular value characteristic matrix and the time-frequency characteristic matrix, so that the data volume of the EMD original signal is reduced.

As shown in fig. 2, the deep learning model in S4 includes: the system comprises a singular value feature processing network, a time-frequency feature processing network, a feature splicing unit, a first LSTM unit, a Concat layer, a second LSTM unit and a plurality of output units;

According to the method, a singular value feature processing network and a time-frequency feature processing network are arranged to respectively process a singular value feature matrix and a time-frequency feature matrix, further feature extraction is achieved, feature extraction and splicing of two networks are carried out by adopting a feature splicing unit, each Cell module in a first LSTM unit is used for processing a feature value, the memory of LSTM is utilized, the relationship of each feature value in the feature splicing unit output feature is better considered by the first LSTM unit, the output of the first LSTM unit is spliced into a vector by adopting a Concat layer, and the vector is input to each Cell module in a second LSTM unit, so that the second LSTM unit not only considers the output of the last Cell module in the second LSTM unit, but also synthesizes the output of the first LSTM unit, and the precision of generating dolphin sound is improved.

In the invention, the connection mode of the Cell module in the LSTM unit is the same as that of the prior art, and the input of the current Cell module in the first LSTM unit comprises the following steps: the feature stitching unit outputs a feature value in the features and outputs the last Cell module connected with the current Cell module in the first LSTM unit.

The inputs of the current Cell module in the second LSTM Cell include: the output of the Concat layer and the output of the last Cell module connected with the current Cell module in the second LSTM unit.

As shown in fig. 3, the singular value feature processing network and the time-frequency feature processing network have the same structure, and each of them includes: a first convolution block, a second convolution block, a third convolution block, a first upsampling layer, a second upsampling layer, a third upsampling layer, an adder A1 and a feature saliency processing layer;

In this embodiment, the convolution block includes: convolutional layer, BN layer, and ReLU layer.

In the invention, three convolution blocks are adopted to gradually extract the features, an up-sampling layer is arranged at the features with different depths to carry out up-sampling treatment, the data quantity of the features with different depths is enriched, then an adder A1 is adopted to carry out fusion treatment, and a feature salient treatment layer is adopted to salient the salient features.

The expression of the characteristic significant processing layer is as follows:

，

The characteristic salient processing layer can normalize the input characteristic values on one hand and can salient the salient characteristics on the other hand, so that the large characteristic values are distinguished from the small characteristic values more remarkably.

The expression of the characteristic splicing unit is as follows:

，

And when the characteristic is spliced, the maximum characteristic and the average characteristic are respectively extracted by adopting the maximum pooling operation and the average pooling operation, so that the characteristic is further simplified.

The expression of the output unit is:

，

In the invention, the output unit considers all the characteristic values output by each Cell module in the second LSTM unit, selects a sigmoid linear activation function, and outputs the amplitude value of each discrete point on the EMD estimation signal.

In this embodiment, the deep learning model needs to be trained before use, and the training samples are: the training sample labels are: and (3) decomposing the dolphin original sound, and forming an amplitude value vector by each amplitude value of each EMD original signal after the decomposition.

The loss function of the deep learning model during training is as follows:

，

The expression of the kth training parameter is:

，

where exp is an exponential function based on a natural constant.

The deep learning model is output into a plurality of y _k A plurality of output values y _k Forming an EMD estimation signal, since the deep learning model of the invention uses two layers of LSTM, LSTMThe training time is long, so that in order to shorten the training time, the training parameters are set, the loss function is enhanced, and the training speed of the deep learning model is accelerated.

According to the method, an EMD decomposition algorithm is adopted to decompose dolphin original sound to obtain a plurality of EMD original signals, each EMD original signal contains partial sound characteristics, singular values and time-frequency characteristics are extracted, the sound characteristics of each EMD original signal are obtained, a deep learning model is adopted to process the singular value characteristic matrix and the time-frequency characteristic matrix, EMD estimated signals are generated, and the EMD estimated signals are overlapped and combined to obtain dolphin sound. According to the method, the original dolphin sound is decomposed, so that each EMD original signal only contains part of characteristics of the original dolphin sound, the data volume is less, the complexity is lower, and the precision of generating the dolphin sound can be improved.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The dolphin sound generation method based on deep learning is characterized by comprising the following steps of:

s5, combining the EMD estimation signals to generate dolphin sound.

2. The deep learning based dolphin sound generation method of claim 1 wherein said S2 comprises the sub-steps of:

3. The deep learning based dolphin sound generation method of claim 1 wherein said S3 comprises the sub-steps of:

4. The deep learning-based dolphin sound generation method of claim 1 wherein said deep learning model in S4 comprises: the system comprises a singular value feature processing network, a time-frequency feature processing network, a feature splicing unit, a first LSTM unit, a Concat layer, a second LSTM unit and a plurality of output units;

5. The deep learning-based dolphin sound generation method of claim 4, wherein said singular value feature processing network and said time-frequency feature processing network are identical in structure, each comprising: a first convolution block, a second convolution block, a third convolution block, a first upsampling layer, a second upsampling layer, a third upsampling layer, an adder A1 and a feature saliency processing layer;

6. The deep learning based dolphin sound generation method of claim 5 wherein said feature saliency processing layer is expressed as:

，

7. The deep learning based dolphin sound generation method of claim 4 wherein said feature stitching unit has the expression:

，

8. The deep learning based dolphin sound generation method of claim 4 wherein said output unit has the expression:

，

9. The deep learning-based dolphin sound generation method of claim 1, wherein said deep learning model in S4 has a loss function during training of:

，

10. The deep learning based dolphin sound generation method of claim 9 wherein said kth training parameter is expressed as:

，

where exp is an exponential function based on a natural constant.