CN117612513A - Deep learning-based dolphin sound generation method - Google Patents
Deep learning-based dolphin sound generation method Download PDFInfo
- Publication number
- CN117612513A CN117612513A CN202410091532.4A CN202410091532A CN117612513A CN 117612513 A CN117612513 A CN 117612513A CN 202410091532 A CN202410091532 A CN 202410091532A CN 117612513 A CN117612513 A CN 117612513A
- Authority
- CN
- China
- Prior art keywords
- output
- feature
- time
- characteristic
- frequency
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 241001481833 Coryphaena hippurus Species 0.000 title claims abstract description 54
- 238000000034 method Methods 0.000 title claims abstract description 32
- 238000013135 deep learning Methods 0.000 title claims abstract description 18
- 239000011159 matrix material Substances 0.000 claims description 47
- 238000013136 deep learning model Methods 0.000 claims description 22
- 230000006870 function Effects 0.000 claims description 16
- 238000005070 sampling Methods 0.000 claims description 13
- 238000011176 pooling Methods 0.000 claims description 10
- 238000000354 decomposition reaction Methods 0.000 claims description 9
- 238000001228 spectrum Methods 0.000 claims description 9
- 230000004913 activation Effects 0.000 claims description 5
- 230000017105 transposition Effects 0.000 claims description 5
- 230000005484 gravity Effects 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 230000015572 biosynthetic process Effects 0.000 abstract description 2
- 238000003786 synthesis reaction Methods 0.000 abstract description 2
- 230000009286 beneficial effect Effects 0.000 description 8
- 238000000605 extraction Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/027—Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/033—Voice editing, e.g. manipulating the voice of the synthesiser
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Life Sciences & Earth Sciences (AREA)
- Human Computer Interaction (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Signal Processing (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a dolphin sound generation method based on deep learning, which belongs to the technical field of sound synthesis. According to the method, the original dolphin sound is decomposed, so that each EMD original signal only contains part of characteristics of the original dolphin sound, the data volume is less, the complexity is lower, and the precision of generating the dolphin sound can be improved.
Description
Technical Field
The invention relates to the technical field of sound synthesis, in particular to a dolphin sound generation method based on deep learning.
Background
In recent years, there have been studies attempting to simulate sound generation of dolphin using artificial intelligence technology. For example, a dolphin call generation method based on generating an countermeasure network is proposed, which generates dolphin calls by training a discriminator model and a generator model. The technology is not only helpful for deeply understanding the sound language of the dolphin, but also provides a direction for artificially synthesizing the sound of the dolphin.
The sound emitted by the dolphin comprises the following characteristics: high frequency, wide band, consisting of a series of complex and varying tones, and with sonar and echo localization. Therefore, the dolphin sound is integrated with a large amount of information, has the characteristics of high frequency and wide frequency band, and if the existing dolphin sound generation method is adopted, the dolphin sound sample is directly used as a training sample, and the dolphin sound generated by the trained generator model is low in precision.
Disclosure of Invention
Aiming at the defects in the prior art, the dolphin sound generation method based on deep learning provided by the invention solves the problem that the precision of dolphin sound generated by the existing dolphin sound generation method is low.
In order to achieve the aim of the invention, the invention adopts the following technical scheme: a dolphin sound generation method based on deep learning comprises the following steps:
s1, decomposing original dolphin sound by adopting an EMD decomposition algorithm to obtain a plurality of EMD original signals;
s2, extracting singular values from each EMD original signal, and constructing a singular value feature matrix;
s3, extracting time-frequency characteristics of each EMD original signal, and constructing a time-frequency characteristic matrix;
s4, processing the singular value feature matrix and the time-frequency feature matrix by adopting a deep learning model to generate an EMD estimation signal;
s5, combining the EMD estimation signals to generate dolphin sound.
The beneficial effects of the invention are as follows: according to the method, an EMD decomposition algorithm is adopted to decompose dolphin original sound to obtain a plurality of EMD original signals, each EMD original signal contains partial sound characteristics, singular values and time-frequency characteristics are extracted, the sound characteristics of each EMD original signal are obtained, a deep learning model is adopted to process the singular value characteristic matrix and the time-frequency characteristic matrix, EMD estimated signals are generated, and the EMD estimated signals are overlapped and combined to obtain dolphin sound. According to the method, the original dolphin sound is decomposed, so that each EMD original signal only contains part of characteristics of the original dolphin sound, the data volume is less, the complexity is lower, and the precision of generating the dolphin sound can be improved.
Further, the step S2 includes the following sub-steps:
s21, constructing a corresponding track matrix according to each EMD original signal;
s22, performing singular value decomposition on the track matrix to obtain a singular value eigenvector;
s23, constructing a singular value feature matrix according to the singular value feature vector, wherein A=a T a, wherein A is a singular value eigenvector, a is a singular value eigenvector, and T is a transposition operation.
Further, the step S3 includes the following sub-steps:
s31, extracting time domain features of each EMD original signal, wherein the time domain features comprise: peak-to-peak, skewness, kurtosis, and form factor;
s32, carrying out frequency domain transformation on each EMD original signal to obtain a frequency domain signal;
s33, extracting frequency domain characteristics of the frequency domain signals, wherein the frequency domain characteristics comprise: spectrum amplitude mean value, spectrum amplitude center of gravity, power spectrum density and cepstrum coefficient;
s34, constructing a time-frequency feature vector by taking the time-domain features and the frequency-domain features as elements;
s35, constructing a time-frequency characteristic matrix according to the time-frequency characteristic vector, wherein B=b T B, wherein B is a time-frequency characteristic matrix, and B is a time-frequency characteristic vectorT is a transpose operation.
The beneficial effects of the above further scheme are: according to the invention, the sound characteristics of the dolphin original sound are expressed through the singular value characteristic matrix and the time-frequency characteristic matrix, so that the data volume of the EMD original signal is reduced.
Further, the deep learning model in S4 includes: the system comprises a singular value feature processing network, a time-frequency feature processing network, a feature splicing unit, a first LSTM unit, a Concat layer, a second LSTM unit and a plurality of output units;
the input end of the singular value feature processing network is used for inputting a singular value feature matrix, and the output end of the singular value feature processing network is connected with the first input end of the feature splicing unit; the input end of the time-frequency characteristic processing network is used for inputting a time-frequency characteristic matrix, and the output end of the time-frequency characteristic processing network is connected with the second input end of the characteristic splicing unit; the output ends of the characteristic splicing units are respectively connected with the input ends of a plurality of Cell modules in the first LSTM unit; the input ends of the Concat layer are respectively connected with the output ends of the plurality of Cell modules in the first LSTM unit, and the output ends of the Concat layer are respectively connected with the input ends of the plurality of Cell modules in the second LSTM unit; the input end of each output unit is connected with the output end of one Cell module in the second LSTM unit, and the output end of each output unit is used as the output end of the deep learning model; each Cell module in the first LSTM unit is used for inputting a characteristic value in the output characteristics of the characteristic splicing unit; each Cell module in the second LSTM unit is configured to input an output feature of the Concat layer.
The beneficial effects of the above further scheme are: according to the method, a singular value feature processing network and a time-frequency feature processing network are arranged to respectively process a singular value feature matrix and a time-frequency feature matrix, further feature extraction is achieved, feature extraction and splicing of two networks are carried out by adopting a feature splicing unit, each Cell module in a first LSTM unit is used for processing a feature value, the memory of LSTM is utilized, the relationship of each feature value in the feature splicing unit output feature is better considered by the first LSTM unit, the output of the first LSTM unit is spliced into a vector by adopting a Concat layer, and the vector is input to each Cell module in a second LSTM unit, so that the second LSTM unit not only considers the output of the last Cell module in the second LSTM unit, but also synthesizes the output of the first LSTM unit, and the precision of generating dolphin sound is improved.
Further, the singular value feature processing network and the time-frequency feature processing network have the same structure, and both the singular value feature processing network and the time-frequency feature processing network comprise: a first convolution block, a second convolution block, a third convolution block, a first upsampling layer, a second upsampling layer, a third upsampling layer, an adder A1 and a feature saliency processing layer;
the input end of the first convolution block is used as the input end of a singular value feature processing network or a time-frequency feature processing network, and the output end of the first convolution block is respectively connected with the input end of the first up-sampling layer and the input end of the second convolution block; the output end of the second convolution block is respectively connected with the input end of the second up-sampling layer and the input end of the third convolution block; the output end of the third convolution block is connected with the input end of the third up-sampling layer; the input end of the adder A1 is respectively connected with the output end of the first upsampling layer, the output end of the second upsampling layer and the output end of the third upsampling layer, and the output end of the adder A1 is connected with the input end of the characteristic significant processing layer; and the output end of the characteristic significant processing layer is used as the output end of a singular value characteristic processing network or a time-frequency characteristic processing network.
The beneficial effects of the above further scheme are: in the invention, three convolution blocks are adopted to gradually extract the features, an up-sampling layer is arranged at the features with different depths to carry out up-sampling treatment, the data quantity of the features with different depths is enriched, then an adder A1 is adopted to carry out fusion treatment, and a feature salient treatment layer is adopted to salient the salient features.
Further, the expression of the feature saliency processing layer is:
,
wherein x is i,z The ith eigenvalue, x, output for the feature saliency processing layer i The ith eigenvalue, x, of the layer input is processed for feature saliency max I is a positive integer, which is the maximum characteristic value in the output characteristics of the adder A1.
The beneficial effects of the above further scheme are: the characteristic salient processing layer can normalize the input characteristic values on one hand and can salient the salient characteristics on the other hand, so that the large characteristic values are distinguished from the small characteristic values more remarkably.
Further, the expression of the feature stitching unit is:
,
wherein H is the output characteristic of the characteristic splicing unit, maxpool is the maximum pooling operation, avgpool is the average pooling operation, and X Q Processing the output characteristics of the network for singular value characteristics, X S The output characteristics of the network are processed for time-frequency characteristics,is Hadamard product.
The beneficial effects of the above further scheme are: and when the characteristic is spliced, the maximum characteristic and the average characteristic are respectively extracted by adopting the maximum pooling operation and the average pooling operation, so that the characteristic is further simplified.
Further, the expression of the output unit is:
,
wherein y is k An output value h of the kth output unit j,k The j-th characteristic value, w, output by the Cell module in the second LSTM unit j,k Is h j,k Weights of b j,k Is h j,k N is the number of eigenvalues output by a Cell module in the second LSTM unit, j and k are positive integers, and sigmoid is an activation function.
The beneficial effects of the above further scheme are: in the invention, the output unit considers all the characteristic values output by each Cell module in the second LSTM unit, selects a sigmoid linear activation function, and outputs the amplitude value of each discrete point on the EMD estimation signal.
Further, the loss function of the deep learning model in the step S4 during training is as follows:
,
wherein L is a loss function, y k Output value g of kth output unit k For the kth tag value, a plurality of output values y k Forming an EMD estimation signal, r k For the kth training parameter, M is the number of output values, and k is a positive integer.
Further, the expression of the kth training parameter is:
,
where exp is an exponential function based on a natural constant.
The beneficial effects of the above further scheme are: the deep learning model is output into a plurality of y k A plurality of output values y k An EMD estimation signal is formed, and because the deep learning model uses two layers of LSTM, the LSTM training time is long, in order to shorten the training time, the training parameters are set, the loss function is enhanced, and the training speed of the deep learning model is accelerated.
Drawings
FIG. 1 is a flow chart of a dolphin sound generation method based on deep learning;
FIG. 2 is a schematic diagram of a deep learning model;
fig. 3 is a schematic structural diagram of a singular value feature processing network or a time-frequency feature processing network.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.
As shown in fig. 1, a dolphin sound generation method based on deep learning includes the following steps:
s1, decomposing original dolphin sound by adopting an EMD decomposition algorithm to obtain a plurality of EMD original signals;
s2, extracting singular values from each EMD original signal, and constructing a singular value feature matrix;
s3, extracting time-frequency characteristics of each EMD original signal, and constructing a time-frequency characteristic matrix;
s4, processing the singular value feature matrix and the time-frequency feature matrix by adopting a deep learning model to generate an EMD estimation signal;
s5, combining the EMD estimation signals to generate dolphin sound.
In this embodiment, a singular value feature matrix and a time-frequency feature matrix of an EMD original signal correspondingly generate an EMD estimation signal.
In this embodiment, a plurality of EMD estimation signals are combined, specifically: the co-located amplitude values are added.
The step S2 comprises the following sub-steps:
s21, constructing a corresponding track matrix according to each EMD original signal;
s22, performing singular value decomposition on the track matrix to obtain a singular value eigenvector;
s23, constructing a singular value feature matrix according to the singular value feature vector, wherein A=a T a, wherein A is a singular value eigenvector, a is a singular value eigenvector, and T is a transposition operation.
The singular value eigenvector consists of the singular values.
The step S3 comprises the following substeps:
s31, extracting time domain features of each EMD original signal, wherein the time domain features comprise: peak-to-peak, skewness, kurtosis, and form factor;
s32, carrying out frequency domain transformation on each EMD original signal to obtain a frequency domain signal;
s33, extracting frequency domain characteristics of the frequency domain signals, wherein the frequency domain characteristics comprise: spectrum amplitude mean value, spectrum amplitude center of gravity, power spectrum density and cepstrum coefficient;
s34, constructing a time-frequency feature vector by taking the time-domain features and the frequency-domain features as elements;
s35, constructing a time-frequency characteristic matrix according to the time-frequency characteristic vector, wherein B=b T B, wherein B is a time-frequency characteristic matrix, B is a time-frequency characteristic vector, and T is transposition operation.
According to the invention, the sound characteristics of the dolphin original sound are expressed through the singular value characteristic matrix and the time-frequency characteristic matrix, so that the data volume of the EMD original signal is reduced.
As shown in fig. 2, the deep learning model in S4 includes: the system comprises a singular value feature processing network, a time-frequency feature processing network, a feature splicing unit, a first LSTM unit, a Concat layer, a second LSTM unit and a plurality of output units;
the input end of the singular value feature processing network is used for inputting a singular value feature matrix, and the output end of the singular value feature processing network is connected with the first input end of the feature splicing unit; the input end of the time-frequency characteristic processing network is used for inputting a time-frequency characteristic matrix, and the output end of the time-frequency characteristic processing network is connected with the second input end of the characteristic splicing unit; the output ends of the characteristic splicing units are respectively connected with the input ends of a plurality of Cell modules in the first LSTM unit; the input ends of the Concat layer are respectively connected with the output ends of the plurality of Cell modules in the first LSTM unit, and the output ends of the Concat layer are respectively connected with the input ends of the plurality of Cell modules in the second LSTM unit; the input end of each output unit is connected with the output end of one Cell module in the second LSTM unit, and the output end of each output unit is used as the output end of the deep learning model; each Cell module in the first LSTM unit is used for inputting a characteristic value in the output characteristics of the characteristic splicing unit; each Cell module in the second LSTM unit is configured to input an output feature of the Concat layer.
According to the method, a singular value feature processing network and a time-frequency feature processing network are arranged to respectively process a singular value feature matrix and a time-frequency feature matrix, further feature extraction is achieved, feature extraction and splicing of two networks are carried out by adopting a feature splicing unit, each Cell module in a first LSTM unit is used for processing a feature value, the memory of LSTM is utilized, the relationship of each feature value in the feature splicing unit output feature is better considered by the first LSTM unit, the output of the first LSTM unit is spliced into a vector by adopting a Concat layer, and the vector is input to each Cell module in a second LSTM unit, so that the second LSTM unit not only considers the output of the last Cell module in the second LSTM unit, but also synthesizes the output of the first LSTM unit, and the precision of generating dolphin sound is improved.
In the invention, the connection mode of the Cell module in the LSTM unit is the same as that of the prior art, and the input of the current Cell module in the first LSTM unit comprises the following steps: the feature stitching unit outputs a feature value in the features and outputs the last Cell module connected with the current Cell module in the first LSTM unit.
The inputs of the current Cell module in the second LSTM Cell include: the output of the Concat layer and the output of the last Cell module connected with the current Cell module in the second LSTM unit.
As shown in fig. 3, the singular value feature processing network and the time-frequency feature processing network have the same structure, and each of them includes: a first convolution block, a second convolution block, a third convolution block, a first upsampling layer, a second upsampling layer, a third upsampling layer, an adder A1 and a feature saliency processing layer;
the input end of the first convolution block is used as the input end of a singular value feature processing network or a time-frequency feature processing network, and the output end of the first convolution block is respectively connected with the input end of the first up-sampling layer and the input end of the second convolution block; the output end of the second convolution block is respectively connected with the input end of the second up-sampling layer and the input end of the third convolution block; the output end of the third convolution block is connected with the input end of the third up-sampling layer; the input end of the adder A1 is respectively connected with the output end of the first upsampling layer, the output end of the second upsampling layer and the output end of the third upsampling layer, and the output end of the adder A1 is connected with the input end of the characteristic significant processing layer; and the output end of the characteristic significant processing layer is used as the output end of a singular value characteristic processing network or a time-frequency characteristic processing network.
In this embodiment, the convolution block includes: convolutional layer, BN layer, and ReLU layer.
In the invention, three convolution blocks are adopted to gradually extract the features, an up-sampling layer is arranged at the features with different depths to carry out up-sampling treatment, the data quantity of the features with different depths is enriched, then an adder A1 is adopted to carry out fusion treatment, and a feature salient treatment layer is adopted to salient the salient features.
The expression of the characteristic significant processing layer is as follows:
,
wherein x is i,z The ith eigenvalue, x, output for the feature saliency processing layer i The ith eigenvalue, x, of the layer input is processed for feature saliency max I is a positive integer, which is the maximum characteristic value in the output characteristics of the adder A1.
The characteristic salient processing layer can normalize the input characteristic values on one hand and can salient the salient characteristics on the other hand, so that the large characteristic values are distinguished from the small characteristic values more remarkably.
The expression of the characteristic splicing unit is as follows:
,
wherein H is the output characteristic of the characteristic splicing unit, maxpool is the maximum pooling operation, avgpool is the average pooling operation, and X Q Processing the output characteristics of the network for singular value characteristics, X S The output characteristics of the network are processed for time-frequency characteristics,is Hadamard product.
And when the characteristic is spliced, the maximum characteristic and the average characteristic are respectively extracted by adopting the maximum pooling operation and the average pooling operation, so that the characteristic is further simplified.
The expression of the output unit is:
,
wherein y is k An output value h of the kth output unit j,k The j-th characteristic value, w, output by the Cell module in the second LSTM unit j,k Is h j,k Weights of b j,k Is h j,k N is the number of eigenvalues output by a Cell module in the second LSTM unit, j and k are positive integers, and sigmoid is an activation function.
In the invention, the output unit considers all the characteristic values output by each Cell module in the second LSTM unit, selects a sigmoid linear activation function, and outputs the amplitude value of each discrete point on the EMD estimation signal.
In this embodiment, the deep learning model needs to be trained before use, and the training samples are: the training sample labels are: and (3) decomposing the dolphin original sound, and forming an amplitude value vector by each amplitude value of each EMD original signal after the decomposition.
The loss function of the deep learning model during training is as follows:
,
wherein L is a loss function, y k Output value g of kth output unit k For the kth tag value, a plurality of output values y k Forming an EMD estimation signal, r k For the kth training parameter, M is the number of output values, and k is a positive integer.
The expression of the kth training parameter is:
,
where exp is an exponential function based on a natural constant.
The deep learning model is output into a plurality of y k A plurality of output values y k Forming an EMD estimation signal, since the deep learning model of the invention uses two layers of LSTM, LSTMThe training time is long, so that in order to shorten the training time, the training parameters are set, the loss function is enhanced, and the training speed of the deep learning model is accelerated.
According to the method, an EMD decomposition algorithm is adopted to decompose dolphin original sound to obtain a plurality of EMD original signals, each EMD original signal contains partial sound characteristics, singular values and time-frequency characteristics are extracted, the sound characteristics of each EMD original signal are obtained, a deep learning model is adopted to process the singular value characteristic matrix and the time-frequency characteristic matrix, EMD estimated signals are generated, and the EMD estimated signals are overlapped and combined to obtain dolphin sound. According to the method, the original dolphin sound is decomposed, so that each EMD original signal only contains part of characteristics of the original dolphin sound, the data volume is less, the complexity is lower, and the precision of generating the dolphin sound can be improved.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. The dolphin sound generation method based on deep learning is characterized by comprising the following steps of:
s1, decomposing original dolphin sound by adopting an EMD decomposition algorithm to obtain a plurality of EMD original signals;
s2, extracting singular values from each EMD original signal, and constructing a singular value feature matrix;
s3, extracting time-frequency characteristics of each EMD original signal, and constructing a time-frequency characteristic matrix;
s4, processing the singular value feature matrix and the time-frequency feature matrix by adopting a deep learning model to generate an EMD estimation signal;
s5, combining the EMD estimation signals to generate dolphin sound.
2. The deep learning based dolphin sound generation method of claim 1 wherein said S2 comprises the sub-steps of:
s21, constructing a corresponding track matrix according to each EMD original signal;
s22, performing singular value decomposition on the track matrix to obtain a singular value eigenvector;
s23, constructing a singular value feature matrix according to the singular value feature vector, wherein A=a T a, wherein A is a singular value eigenvector, a is a singular value eigenvector, and T is a transposition operation.
3. The deep learning based dolphin sound generation method of claim 1 wherein said S3 comprises the sub-steps of:
s31, extracting time domain features of each EMD original signal, wherein the time domain features comprise: peak-to-peak, skewness, kurtosis, and form factor;
s32, carrying out frequency domain transformation on each EMD original signal to obtain a frequency domain signal;
s33, extracting frequency domain characteristics of the frequency domain signals, wherein the frequency domain characteristics comprise: spectrum amplitude mean value, spectrum amplitude center of gravity, power spectrum density and cepstrum coefficient;
s34, constructing a time-frequency feature vector by taking the time-domain features and the frequency-domain features as elements;
s35, constructing a time-frequency characteristic matrix according to the time-frequency characteristic vector, wherein B=b T B, wherein B is a time-frequency characteristic matrix, B is a time-frequency characteristic vector, and T is transposition operation.
4. The deep learning-based dolphin sound generation method of claim 1 wherein said deep learning model in S4 comprises: the system comprises a singular value feature processing network, a time-frequency feature processing network, a feature splicing unit, a first LSTM unit, a Concat layer, a second LSTM unit and a plurality of output units;
the input end of the singular value feature processing network is used for inputting a singular value feature matrix, and the output end of the singular value feature processing network is connected with the first input end of the feature splicing unit; the input end of the time-frequency characteristic processing network is used for inputting a time-frequency characteristic matrix, and the output end of the time-frequency characteristic processing network is connected with the second input end of the characteristic splicing unit; the output ends of the characteristic splicing units are respectively connected with the input ends of a plurality of Cell modules in the first LSTM unit; the input ends of the Concat layer are respectively connected with the output ends of the plurality of Cell modules in the first LSTM unit, and the output ends of the Concat layer are respectively connected with the input ends of the plurality of Cell modules in the second LSTM unit; the input end of each output unit is connected with the output end of one Cell module in the second LSTM unit, and the output end of each output unit is used as the output end of the deep learning model; each Cell module in the first LSTM unit is used for inputting a characteristic value in the output characteristics of the characteristic splicing unit; each Cell module in the second LSTM unit is configured to input an output feature of the Concat layer.
5. The deep learning-based dolphin sound generation method of claim 4, wherein said singular value feature processing network and said time-frequency feature processing network are identical in structure, each comprising: a first convolution block, a second convolution block, a third convolution block, a first upsampling layer, a second upsampling layer, a third upsampling layer, an adder A1 and a feature saliency processing layer;
the input end of the first convolution block is used as the input end of a singular value feature processing network or a time-frequency feature processing network, and the output end of the first convolution block is respectively connected with the input end of the first up-sampling layer and the input end of the second convolution block; the output end of the second convolution block is respectively connected with the input end of the second up-sampling layer and the input end of the third convolution block; the output end of the third convolution block is connected with the input end of the third up-sampling layer; the input end of the adder A1 is respectively connected with the output end of the first upsampling layer, the output end of the second upsampling layer and the output end of the third upsampling layer, and the output end of the adder A1 is connected with the input end of the characteristic significant processing layer; and the output end of the characteristic significant processing layer is used as the output end of a singular value characteristic processing network or a time-frequency characteristic processing network.
6. The deep learning based dolphin sound generation method of claim 5 wherein said feature saliency processing layer is expressed as:
,
wherein x is i,z The ith eigenvalue, x, output for the feature saliency processing layer i The ith eigenvalue, x, of the layer input is processed for feature saliency max I is a positive integer, which is the maximum characteristic value in the output characteristics of the adder A1.
7. The deep learning based dolphin sound generation method of claim 4 wherein said feature stitching unit has the expression:
,
wherein H is the output characteristic of the characteristic splicing unit, maxpool is the maximum pooling operation, avgpool is the average pooling operation, and X Q Processing the output characteristics of the network for singular value characteristics, X S The output characteristics of the network are processed for time-frequency characteristics,is Hadamard product.
8. The deep learning based dolphin sound generation method of claim 4 wherein said output unit has the expression:
,
wherein y is k An output value h of the kth output unit j,k The j-th characteristic value, w, output by the Cell module in the second LSTM unit j,k Is h j,k Weights of b j,k Is h j,k N is the number of eigenvalues output by a Cell module in the second LSTM unit, j and k are positive integers, and sigmoid is an activation function.
9. The deep learning-based dolphin sound generation method of claim 1, wherein said deep learning model in S4 has a loss function during training of:
,
wherein L is a loss function, y k Output value g of kth output unit k For the kth tag value, a plurality of output values y k Forming an EMD estimation signal, r k For the kth training parameter, M is the number of output values, and k is a positive integer.
10. The deep learning based dolphin sound generation method of claim 9 wherein said kth training parameter is expressed as:
,
where exp is an exponential function based on a natural constant.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410091532.4A CN117612513B (en) | 2024-01-23 | 2024-01-23 | Deep learning-based dolphin sound generation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410091532.4A CN117612513B (en) | 2024-01-23 | 2024-01-23 | Deep learning-based dolphin sound generation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117612513A true CN117612513A (en) | 2024-02-27 |
CN117612513B CN117612513B (en) | 2024-04-26 |
Family
ID=89960238
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410091532.4A Active CN117612513B (en) | 2024-01-23 | 2024-01-23 | Deep learning-based dolphin sound generation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117612513B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113192504A (en) * | 2021-04-29 | 2021-07-30 | 浙江大学 | Domain-adaptation-based silent voice attack detection method |
CN116863959A (en) * | 2023-09-04 | 2023-10-10 | 哈尔滨工业大学(威海) | Dolphin sound generating method based on generating countermeasure network |
KR20230167617A (en) * | 2022-06-02 | 2023-12-11 | 계명대학교 산학협력단 | Method and apparatus for generating multi-scale feature-based transformer models |
-
2024
- 2024-01-23 CN CN202410091532.4A patent/CN117612513B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113192504A (en) * | 2021-04-29 | 2021-07-30 | 浙江大学 | Domain-adaptation-based silent voice attack detection method |
KR20230167617A (en) * | 2022-06-02 | 2023-12-11 | 계명대학교 산학협력단 | Method and apparatus for generating multi-scale feature-based transformer models |
CN116863959A (en) * | 2023-09-04 | 2023-10-10 | 哈尔滨工业大学(威海) | Dolphin sound generating method based on generating countermeasure network |
Also Published As
Publication number | Publication date |
---|---|
CN117612513B (en) | 2024-04-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kameoka et al. | ACVAE-VC: Non-parallel voice conversion with auxiliary classifier variational autoencoder | |
EP3926623A1 (en) | Speech recognition method and apparatus, and neural network training method and apparatus | |
CN110136731B (en) | Cavity causal convolution generation confrontation network end-to-end bone conduction voice blind enhancement method | |
CN109887489B (en) | Speech dereverberation method based on depth features for generating countermeasure network | |
CN108777146A (en) | Speech model training method, method for distinguishing speek person, device, equipment and medium | |
CN109887484A (en) | A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device | |
Gabor | Communication theory and cybernetics | |
CN110060657B (en) | SN-based many-to-many speaker conversion method | |
CN105654939A (en) | Voice synthesis method based on voice vector textual characteristics | |
CN108538283B (en) | Method for converting lip image characteristics into voice coding parameters | |
CN111931820A (en) | Water target radiation noise LOFAR spectrogram spectrum extraction method based on convolution residual error network | |
CN111696520A (en) | Intelligent dubbing method, device, medium and electronic equipment | |
An et al. | Speech Emotion Recognition algorithm based on deep learning algorithm fusion of temporal and spatial features | |
CN110084250A (en) | A kind of method and system of iamge description | |
Chen et al. | Learning to generate steganographic cover for audio steganography using gan | |
Cao et al. | Underwater target classification at greater depths using deep neural network with joint multiple‐domain feature | |
CN113539232A (en) | Muslim class voice data set-based voice synthesis method | |
JP2019168608A (en) | Learning device, acoustic generation device, method, and program | |
CN113782044B (en) | Voice enhancement method and device | |
CN117612513B (en) | Deep learning-based dolphin sound generation method | |
CN115908662B (en) | Speaker video generation model training and using method, device and equipment | |
CN113421576B (en) | Voice conversion method, device, equipment and storage medium | |
Pan et al. | Bone-conducted speech to air-conducted speech conversion based on cycleconsistent adversarial networks | |
Zhu et al. | DPTCN-ATPP: Multi-scale end-to-end modeling for single-channel speech separation | |
CN113870887A (en) | Single-channel speech enhancement method and device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |