CN117612513A - Deep learning-based dolphin sound generation method - Google Patents

Deep learning-based dolphin sound generation method Download PDF

Info

Publication number
CN117612513A
CN117612513A CN202410091532.4A CN202410091532A CN117612513A CN 117612513 A CN117612513 A CN 117612513A CN 202410091532 A CN202410091532 A CN 202410091532A CN 117612513 A CN117612513 A CN 117612513A
Authority
CN
China
Prior art keywords
output
feature
time
characteristic
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410091532.4A
Other languages
Chinese (zh)
Other versions
CN117612513B (en
Inventor
冯子仪
尹晓峰
张培珍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Ocean University
Original Assignee
Guangdong Ocean University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Ocean University filed Critical Guangdong Ocean University
Priority to CN202410091532.4A priority Critical patent/CN117612513B/en
Publication of CN117612513A publication Critical patent/CN117612513A/en
Application granted granted Critical
Publication of CN117612513B publication Critical patent/CN117612513B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/027Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a dolphin sound generation method based on deep learning, which belongs to the technical field of sound synthesis. According to the method, the original dolphin sound is decomposed, so that each EMD original signal only contains part of characteristics of the original dolphin sound, the data volume is less, the complexity is lower, and the precision of generating the dolphin sound can be improved.

Description

Deep learning-based dolphin sound generation method
Technical Field
The invention relates to the technical field of sound synthesis, in particular to a dolphin sound generation method based on deep learning.
Background
In recent years, there have been studies attempting to simulate sound generation of dolphin using artificial intelligence technology. For example, a dolphin call generation method based on generating an countermeasure network is proposed, which generates dolphin calls by training a discriminator model and a generator model. The technology is not only helpful for deeply understanding the sound language of the dolphin, but also provides a direction for artificially synthesizing the sound of the dolphin.
The sound emitted by the dolphin comprises the following characteristics: high frequency, wide band, consisting of a series of complex and varying tones, and with sonar and echo localization. Therefore, the dolphin sound is integrated with a large amount of information, has the characteristics of high frequency and wide frequency band, and if the existing dolphin sound generation method is adopted, the dolphin sound sample is directly used as a training sample, and the dolphin sound generated by the trained generator model is low in precision.
Disclosure of Invention
Aiming at the defects in the prior art, the dolphin sound generation method based on deep learning provided by the invention solves the problem that the precision of dolphin sound generated by the existing dolphin sound generation method is low.
In order to achieve the aim of the invention, the invention adopts the following technical scheme: a dolphin sound generation method based on deep learning comprises the following steps:
s1, decomposing original dolphin sound by adopting an EMD decomposition algorithm to obtain a plurality of EMD original signals;
s2, extracting singular values from each EMD original signal, and constructing a singular value feature matrix;
s3, extracting time-frequency characteristics of each EMD original signal, and constructing a time-frequency characteristic matrix;
s4, processing the singular value feature matrix and the time-frequency feature matrix by adopting a deep learning model to generate an EMD estimation signal;
s5, combining the EMD estimation signals to generate dolphin sound.
The beneficial effects of the invention are as follows: according to the method, an EMD decomposition algorithm is adopted to decompose dolphin original sound to obtain a plurality of EMD original signals, each EMD original signal contains partial sound characteristics, singular values and time-frequency characteristics are extracted, the sound characteristics of each EMD original signal are obtained, a deep learning model is adopted to process the singular value characteristic matrix and the time-frequency characteristic matrix, EMD estimated signals are generated, and the EMD estimated signals are overlapped and combined to obtain dolphin sound. According to the method, the original dolphin sound is decomposed, so that each EMD original signal only contains part of characteristics of the original dolphin sound, the data volume is less, the complexity is lower, and the precision of generating the dolphin sound can be improved.
Further, the step S2 includes the following sub-steps:
s21, constructing a corresponding track matrix according to each EMD original signal;
s22, performing singular value decomposition on the track matrix to obtain a singular value eigenvector;
s23, constructing a singular value feature matrix according to the singular value feature vector, wherein A=a T a, wherein A is a singular value eigenvector, a is a singular value eigenvector, and T is a transposition operation.
Further, the step S3 includes the following sub-steps:
s31, extracting time domain features of each EMD original signal, wherein the time domain features comprise: peak-to-peak, skewness, kurtosis, and form factor;
s32, carrying out frequency domain transformation on each EMD original signal to obtain a frequency domain signal;
s33, extracting frequency domain characteristics of the frequency domain signals, wherein the frequency domain characteristics comprise: spectrum amplitude mean value, spectrum amplitude center of gravity, power spectrum density and cepstrum coefficient;
s34, constructing a time-frequency feature vector by taking the time-domain features and the frequency-domain features as elements;
s35, constructing a time-frequency characteristic matrix according to the time-frequency characteristic vector, wherein B=b T B, wherein B is a time-frequency characteristic matrix, and B is a time-frequency characteristic vectorT is a transpose operation.
The beneficial effects of the above further scheme are: according to the invention, the sound characteristics of the dolphin original sound are expressed through the singular value characteristic matrix and the time-frequency characteristic matrix, so that the data volume of the EMD original signal is reduced.
Further, the deep learning model in S4 includes: the system comprises a singular value feature processing network, a time-frequency feature processing network, a feature splicing unit, a first LSTM unit, a Concat layer, a second LSTM unit and a plurality of output units;
the input end of the singular value feature processing network is used for inputting a singular value feature matrix, and the output end of the singular value feature processing network is connected with the first input end of the feature splicing unit; the input end of the time-frequency characteristic processing network is used for inputting a time-frequency characteristic matrix, and the output end of the time-frequency characteristic processing network is connected with the second input end of the characteristic splicing unit; the output ends of the characteristic splicing units are respectively connected with the input ends of a plurality of Cell modules in the first LSTM unit; the input ends of the Concat layer are respectively connected with the output ends of the plurality of Cell modules in the first LSTM unit, and the output ends of the Concat layer are respectively connected with the input ends of the plurality of Cell modules in the second LSTM unit; the input end of each output unit is connected with the output end of one Cell module in the second LSTM unit, and the output end of each output unit is used as the output end of the deep learning model; each Cell module in the first LSTM unit is used for inputting a characteristic value in the output characteristics of the characteristic splicing unit; each Cell module in the second LSTM unit is configured to input an output feature of the Concat layer.
The beneficial effects of the above further scheme are: according to the method, a singular value feature processing network and a time-frequency feature processing network are arranged to respectively process a singular value feature matrix and a time-frequency feature matrix, further feature extraction is achieved, feature extraction and splicing of two networks are carried out by adopting a feature splicing unit, each Cell module in a first LSTM unit is used for processing a feature value, the memory of LSTM is utilized, the relationship of each feature value in the feature splicing unit output feature is better considered by the first LSTM unit, the output of the first LSTM unit is spliced into a vector by adopting a Concat layer, and the vector is input to each Cell module in a second LSTM unit, so that the second LSTM unit not only considers the output of the last Cell module in the second LSTM unit, but also synthesizes the output of the first LSTM unit, and the precision of generating dolphin sound is improved.
Further, the singular value feature processing network and the time-frequency feature processing network have the same structure, and both the singular value feature processing network and the time-frequency feature processing network comprise: a first convolution block, a second convolution block, a third convolution block, a first upsampling layer, a second upsampling layer, a third upsampling layer, an adder A1 and a feature saliency processing layer;
the input end of the first convolution block is used as the input end of a singular value feature processing network or a time-frequency feature processing network, and the output end of the first convolution block is respectively connected with the input end of the first up-sampling layer and the input end of the second convolution block; the output end of the second convolution block is respectively connected with the input end of the second up-sampling layer and the input end of the third convolution block; the output end of the third convolution block is connected with the input end of the third up-sampling layer; the input end of the adder A1 is respectively connected with the output end of the first upsampling layer, the output end of the second upsampling layer and the output end of the third upsampling layer, and the output end of the adder A1 is connected with the input end of the characteristic significant processing layer; and the output end of the characteristic significant processing layer is used as the output end of a singular value characteristic processing network or a time-frequency characteristic processing network.
The beneficial effects of the above further scheme are: in the invention, three convolution blocks are adopted to gradually extract the features, an up-sampling layer is arranged at the features with different depths to carry out up-sampling treatment, the data quantity of the features with different depths is enriched, then an adder A1 is adopted to carry out fusion treatment, and a feature salient treatment layer is adopted to salient the salient features.
Further, the expression of the feature saliency processing layer is:
wherein x is i,z The ith eigenvalue, x, output for the feature saliency processing layer i The ith eigenvalue, x, of the layer input is processed for feature saliency max I is a positive integer, which is the maximum characteristic value in the output characteristics of the adder A1.
The beneficial effects of the above further scheme are: the characteristic salient processing layer can normalize the input characteristic values on one hand and can salient the salient characteristics on the other hand, so that the large characteristic values are distinguished from the small characteristic values more remarkably.
Further, the expression of the feature stitching unit is:
wherein H is the output characteristic of the characteristic splicing unit, maxpool is the maximum pooling operation, avgpool is the average pooling operation, and X Q Processing the output characteristics of the network for singular value characteristics, X S The output characteristics of the network are processed for time-frequency characteristics,is Hadamard product.
The beneficial effects of the above further scheme are: and when the characteristic is spliced, the maximum characteristic and the average characteristic are respectively extracted by adopting the maximum pooling operation and the average pooling operation, so that the characteristic is further simplified.
Further, the expression of the output unit is:
wherein y is k An output value h of the kth output unit j,k The j-th characteristic value, w, output by the Cell module in the second LSTM unit j,k Is h j,k Weights of b j,k Is h j,k N is the number of eigenvalues output by a Cell module in the second LSTM unit, j and k are positive integers, and sigmoid is an activation function.
The beneficial effects of the above further scheme are: in the invention, the output unit considers all the characteristic values output by each Cell module in the second LSTM unit, selects a sigmoid linear activation function, and outputs the amplitude value of each discrete point on the EMD estimation signal.
Further, the loss function of the deep learning model in the step S4 during training is as follows:
wherein L is a loss function, y k Output value g of kth output unit k For the kth tag value, a plurality of output values y k Forming an EMD estimation signal, r k For the kth training parameter, M is the number of output values, and k is a positive integer.
Further, the expression of the kth training parameter is:
where exp is an exponential function based on a natural constant.
The beneficial effects of the above further scheme are: the deep learning model is output into a plurality of y k A plurality of output values y k An EMD estimation signal is formed, and because the deep learning model uses two layers of LSTM, the LSTM training time is long, in order to shorten the training time, the training parameters are set, the loss function is enhanced, and the training speed of the deep learning model is accelerated.
Drawings
FIG. 1 is a flow chart of a dolphin sound generation method based on deep learning;
FIG. 2 is a schematic diagram of a deep learning model;
fig. 3 is a schematic structural diagram of a singular value feature processing network or a time-frequency feature processing network.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.
As shown in fig. 1, a dolphin sound generation method based on deep learning includes the following steps:
s1, decomposing original dolphin sound by adopting an EMD decomposition algorithm to obtain a plurality of EMD original signals;
s2, extracting singular values from each EMD original signal, and constructing a singular value feature matrix;
s3, extracting time-frequency characteristics of each EMD original signal, and constructing a time-frequency characteristic matrix;
s4, processing the singular value feature matrix and the time-frequency feature matrix by adopting a deep learning model to generate an EMD estimation signal;
s5, combining the EMD estimation signals to generate dolphin sound.
In this embodiment, a singular value feature matrix and a time-frequency feature matrix of an EMD original signal correspondingly generate an EMD estimation signal.
In this embodiment, a plurality of EMD estimation signals are combined, specifically: the co-located amplitude values are added.
The step S2 comprises the following sub-steps:
s21, constructing a corresponding track matrix according to each EMD original signal;
s22, performing singular value decomposition on the track matrix to obtain a singular value eigenvector;
s23, constructing a singular value feature matrix according to the singular value feature vector, wherein A=a T a, wherein A is a singular value eigenvector, a is a singular value eigenvector, and T is a transposition operation.
The singular value eigenvector consists of the singular values.
The step S3 comprises the following substeps:
s31, extracting time domain features of each EMD original signal, wherein the time domain features comprise: peak-to-peak, skewness, kurtosis, and form factor;
s32, carrying out frequency domain transformation on each EMD original signal to obtain a frequency domain signal;
s33, extracting frequency domain characteristics of the frequency domain signals, wherein the frequency domain characteristics comprise: spectrum amplitude mean value, spectrum amplitude center of gravity, power spectrum density and cepstrum coefficient;
s34, constructing a time-frequency feature vector by taking the time-domain features and the frequency-domain features as elements;
s35, constructing a time-frequency characteristic matrix according to the time-frequency characteristic vector, wherein B=b T B, wherein B is a time-frequency characteristic matrix, B is a time-frequency characteristic vector, and T is transposition operation.
According to the invention, the sound characteristics of the dolphin original sound are expressed through the singular value characteristic matrix and the time-frequency characteristic matrix, so that the data volume of the EMD original signal is reduced.
As shown in fig. 2, the deep learning model in S4 includes: the system comprises a singular value feature processing network, a time-frequency feature processing network, a feature splicing unit, a first LSTM unit, a Concat layer, a second LSTM unit and a plurality of output units;
the input end of the singular value feature processing network is used for inputting a singular value feature matrix, and the output end of the singular value feature processing network is connected with the first input end of the feature splicing unit; the input end of the time-frequency characteristic processing network is used for inputting a time-frequency characteristic matrix, and the output end of the time-frequency characteristic processing network is connected with the second input end of the characteristic splicing unit; the output ends of the characteristic splicing units are respectively connected with the input ends of a plurality of Cell modules in the first LSTM unit; the input ends of the Concat layer are respectively connected with the output ends of the plurality of Cell modules in the first LSTM unit, and the output ends of the Concat layer are respectively connected with the input ends of the plurality of Cell modules in the second LSTM unit; the input end of each output unit is connected with the output end of one Cell module in the second LSTM unit, and the output end of each output unit is used as the output end of the deep learning model; each Cell module in the first LSTM unit is used for inputting a characteristic value in the output characteristics of the characteristic splicing unit; each Cell module in the second LSTM unit is configured to input an output feature of the Concat layer.
According to the method, a singular value feature processing network and a time-frequency feature processing network are arranged to respectively process a singular value feature matrix and a time-frequency feature matrix, further feature extraction is achieved, feature extraction and splicing of two networks are carried out by adopting a feature splicing unit, each Cell module in a first LSTM unit is used for processing a feature value, the memory of LSTM is utilized, the relationship of each feature value in the feature splicing unit output feature is better considered by the first LSTM unit, the output of the first LSTM unit is spliced into a vector by adopting a Concat layer, and the vector is input to each Cell module in a second LSTM unit, so that the second LSTM unit not only considers the output of the last Cell module in the second LSTM unit, but also synthesizes the output of the first LSTM unit, and the precision of generating dolphin sound is improved.
In the invention, the connection mode of the Cell module in the LSTM unit is the same as that of the prior art, and the input of the current Cell module in the first LSTM unit comprises the following steps: the feature stitching unit outputs a feature value in the features and outputs the last Cell module connected with the current Cell module in the first LSTM unit.
The inputs of the current Cell module in the second LSTM Cell include: the output of the Concat layer and the output of the last Cell module connected with the current Cell module in the second LSTM unit.
As shown in fig. 3, the singular value feature processing network and the time-frequency feature processing network have the same structure, and each of them includes: a first convolution block, a second convolution block, a third convolution block, a first upsampling layer, a second upsampling layer, a third upsampling layer, an adder A1 and a feature saliency processing layer;
the input end of the first convolution block is used as the input end of a singular value feature processing network or a time-frequency feature processing network, and the output end of the first convolution block is respectively connected with the input end of the first up-sampling layer and the input end of the second convolution block; the output end of the second convolution block is respectively connected with the input end of the second up-sampling layer and the input end of the third convolution block; the output end of the third convolution block is connected with the input end of the third up-sampling layer; the input end of the adder A1 is respectively connected with the output end of the first upsampling layer, the output end of the second upsampling layer and the output end of the third upsampling layer, and the output end of the adder A1 is connected with the input end of the characteristic significant processing layer; and the output end of the characteristic significant processing layer is used as the output end of a singular value characteristic processing network or a time-frequency characteristic processing network.
In this embodiment, the convolution block includes: convolutional layer, BN layer, and ReLU layer.
In the invention, three convolution blocks are adopted to gradually extract the features, an up-sampling layer is arranged at the features with different depths to carry out up-sampling treatment, the data quantity of the features with different depths is enriched, then an adder A1 is adopted to carry out fusion treatment, and a feature salient treatment layer is adopted to salient the salient features.
The expression of the characteristic significant processing layer is as follows:
wherein x is i,z The ith eigenvalue, x, output for the feature saliency processing layer i The ith eigenvalue, x, of the layer input is processed for feature saliency max I is a positive integer, which is the maximum characteristic value in the output characteristics of the adder A1.
The characteristic salient processing layer can normalize the input characteristic values on one hand and can salient the salient characteristics on the other hand, so that the large characteristic values are distinguished from the small characteristic values more remarkably.
The expression of the characteristic splicing unit is as follows:
wherein H is the output characteristic of the characteristic splicing unit, maxpool is the maximum pooling operation, avgpool is the average pooling operation, and X Q Processing the output characteristics of the network for singular value characteristics, X S The output characteristics of the network are processed for time-frequency characteristics,is Hadamard product.
And when the characteristic is spliced, the maximum characteristic and the average characteristic are respectively extracted by adopting the maximum pooling operation and the average pooling operation, so that the characteristic is further simplified.
The expression of the output unit is:
wherein y is k An output value h of the kth output unit j,k The j-th characteristic value, w, output by the Cell module in the second LSTM unit j,k Is h j,k Weights of b j,k Is h j,k N is the number of eigenvalues output by a Cell module in the second LSTM unit, j and k are positive integers, and sigmoid is an activation function.
In the invention, the output unit considers all the characteristic values output by each Cell module in the second LSTM unit, selects a sigmoid linear activation function, and outputs the amplitude value of each discrete point on the EMD estimation signal.
In this embodiment, the deep learning model needs to be trained before use, and the training samples are: the training sample labels are: and (3) decomposing the dolphin original sound, and forming an amplitude value vector by each amplitude value of each EMD original signal after the decomposition.
The loss function of the deep learning model during training is as follows:
wherein L is a loss function, y k Output value g of kth output unit k For the kth tag value, a plurality of output values y k Forming an EMD estimation signal, r k For the kth training parameter, M is the number of output values, and k is a positive integer.
The expression of the kth training parameter is:
where exp is an exponential function based on a natural constant.
The deep learning model is output into a plurality of y k A plurality of output values y k Forming an EMD estimation signal, since the deep learning model of the invention uses two layers of LSTM, LSTMThe training time is long, so that in order to shorten the training time, the training parameters are set, the loss function is enhanced, and the training speed of the deep learning model is accelerated.
According to the method, an EMD decomposition algorithm is adopted to decompose dolphin original sound to obtain a plurality of EMD original signals, each EMD original signal contains partial sound characteristics, singular values and time-frequency characteristics are extracted, the sound characteristics of each EMD original signal are obtained, a deep learning model is adopted to process the singular value characteristic matrix and the time-frequency characteristic matrix, EMD estimated signals are generated, and the EMD estimated signals are overlapped and combined to obtain dolphin sound. According to the method, the original dolphin sound is decomposed, so that each EMD original signal only contains part of characteristics of the original dolphin sound, the data volume is less, the complexity is lower, and the precision of generating the dolphin sound can be improved.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. The dolphin sound generation method based on deep learning is characterized by comprising the following steps of:
s1, decomposing original dolphin sound by adopting an EMD decomposition algorithm to obtain a plurality of EMD original signals;
s2, extracting singular values from each EMD original signal, and constructing a singular value feature matrix;
s3, extracting time-frequency characteristics of each EMD original signal, and constructing a time-frequency characteristic matrix;
s4, processing the singular value feature matrix and the time-frequency feature matrix by adopting a deep learning model to generate an EMD estimation signal;
s5, combining the EMD estimation signals to generate dolphin sound.
2. The deep learning based dolphin sound generation method of claim 1 wherein said S2 comprises the sub-steps of:
s21, constructing a corresponding track matrix according to each EMD original signal;
s22, performing singular value decomposition on the track matrix to obtain a singular value eigenvector;
s23, constructing a singular value feature matrix according to the singular value feature vector, wherein A=a T a, wherein A is a singular value eigenvector, a is a singular value eigenvector, and T is a transposition operation.
3. The deep learning based dolphin sound generation method of claim 1 wherein said S3 comprises the sub-steps of:
s31, extracting time domain features of each EMD original signal, wherein the time domain features comprise: peak-to-peak, skewness, kurtosis, and form factor;
s32, carrying out frequency domain transformation on each EMD original signal to obtain a frequency domain signal;
s33, extracting frequency domain characteristics of the frequency domain signals, wherein the frequency domain characteristics comprise: spectrum amplitude mean value, spectrum amplitude center of gravity, power spectrum density and cepstrum coefficient;
s34, constructing a time-frequency feature vector by taking the time-domain features and the frequency-domain features as elements;
s35, constructing a time-frequency characteristic matrix according to the time-frequency characteristic vector, wherein B=b T B, wherein B is a time-frequency characteristic matrix, B is a time-frequency characteristic vector, and T is transposition operation.
4. The deep learning-based dolphin sound generation method of claim 1 wherein said deep learning model in S4 comprises: the system comprises a singular value feature processing network, a time-frequency feature processing network, a feature splicing unit, a first LSTM unit, a Concat layer, a second LSTM unit and a plurality of output units;
the input end of the singular value feature processing network is used for inputting a singular value feature matrix, and the output end of the singular value feature processing network is connected with the first input end of the feature splicing unit; the input end of the time-frequency characteristic processing network is used for inputting a time-frequency characteristic matrix, and the output end of the time-frequency characteristic processing network is connected with the second input end of the characteristic splicing unit; the output ends of the characteristic splicing units are respectively connected with the input ends of a plurality of Cell modules in the first LSTM unit; the input ends of the Concat layer are respectively connected with the output ends of the plurality of Cell modules in the first LSTM unit, and the output ends of the Concat layer are respectively connected with the input ends of the plurality of Cell modules in the second LSTM unit; the input end of each output unit is connected with the output end of one Cell module in the second LSTM unit, and the output end of each output unit is used as the output end of the deep learning model; each Cell module in the first LSTM unit is used for inputting a characteristic value in the output characteristics of the characteristic splicing unit; each Cell module in the second LSTM unit is configured to input an output feature of the Concat layer.
5. The deep learning-based dolphin sound generation method of claim 4, wherein said singular value feature processing network and said time-frequency feature processing network are identical in structure, each comprising: a first convolution block, a second convolution block, a third convolution block, a first upsampling layer, a second upsampling layer, a third upsampling layer, an adder A1 and a feature saliency processing layer;
the input end of the first convolution block is used as the input end of a singular value feature processing network or a time-frequency feature processing network, and the output end of the first convolution block is respectively connected with the input end of the first up-sampling layer and the input end of the second convolution block; the output end of the second convolution block is respectively connected with the input end of the second up-sampling layer and the input end of the third convolution block; the output end of the third convolution block is connected with the input end of the third up-sampling layer; the input end of the adder A1 is respectively connected with the output end of the first upsampling layer, the output end of the second upsampling layer and the output end of the third upsampling layer, and the output end of the adder A1 is connected with the input end of the characteristic significant processing layer; and the output end of the characteristic significant processing layer is used as the output end of a singular value characteristic processing network or a time-frequency characteristic processing network.
6. The deep learning based dolphin sound generation method of claim 5 wherein said feature saliency processing layer is expressed as:
wherein x is i,z The ith eigenvalue, x, output for the feature saliency processing layer i The ith eigenvalue, x, of the layer input is processed for feature saliency max I is a positive integer, which is the maximum characteristic value in the output characteristics of the adder A1.
7. The deep learning based dolphin sound generation method of claim 4 wherein said feature stitching unit has the expression:
wherein H is the output characteristic of the characteristic splicing unit, maxpool is the maximum pooling operation, avgpool is the average pooling operation, and X Q Processing the output characteristics of the network for singular value characteristics, X S The output characteristics of the network are processed for time-frequency characteristics,is Hadamard product.
8. The deep learning based dolphin sound generation method of claim 4 wherein said output unit has the expression:
wherein y is k An output value h of the kth output unit j,k The j-th characteristic value, w, output by the Cell module in the second LSTM unit j,k Is h j,k Weights of b j,k Is h j,k N is the number of eigenvalues output by a Cell module in the second LSTM unit, j and k are positive integers, and sigmoid is an activation function.
9. The deep learning-based dolphin sound generation method of claim 1, wherein said deep learning model in S4 has a loss function during training of:
wherein L is a loss function, y k Output value g of kth output unit k For the kth tag value, a plurality of output values y k Forming an EMD estimation signal, r k For the kth training parameter, M is the number of output values, and k is a positive integer.
10. The deep learning based dolphin sound generation method of claim 9 wherein said kth training parameter is expressed as:
where exp is an exponential function based on a natural constant.
CN202410091532.4A 2024-01-23 2024-01-23 Deep learning-based dolphin sound generation method Active CN117612513B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410091532.4A CN117612513B (en) 2024-01-23 2024-01-23 Deep learning-based dolphin sound generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410091532.4A CN117612513B (en) 2024-01-23 2024-01-23 Deep learning-based dolphin sound generation method

Publications (2)

Publication Number Publication Date
CN117612513A true CN117612513A (en) 2024-02-27
CN117612513B CN117612513B (en) 2024-04-26

Family

ID=89960238

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410091532.4A Active CN117612513B (en) 2024-01-23 2024-01-23 Deep learning-based dolphin sound generation method

Country Status (1)

Country Link
CN (1) CN117612513B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113192504A (en) * 2021-04-29 2021-07-30 浙江大学 Domain-adaptation-based silent voice attack detection method
CN116863959A (en) * 2023-09-04 2023-10-10 哈尔滨工业大学(威海) Dolphin sound generating method based on generating countermeasure network
KR20230167617A (en) * 2022-06-02 2023-12-11 계명대학교 산학협력단 Method and apparatus for generating multi-scale feature-based transformer models

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113192504A (en) * 2021-04-29 2021-07-30 浙江大学 Domain-adaptation-based silent voice attack detection method
KR20230167617A (en) * 2022-06-02 2023-12-11 계명대학교 산학협력단 Method and apparatus for generating multi-scale feature-based transformer models
CN116863959A (en) * 2023-09-04 2023-10-10 哈尔滨工业大学(威海) Dolphin sound generating method based on generating countermeasure network

Also Published As

Publication number Publication date
CN117612513B (en) 2024-04-26

Similar Documents

Publication Publication Date Title
Kameoka et al. ACVAE-VC: Non-parallel voice conversion with auxiliary classifier variational autoencoder
EP3926623A1 (en) Speech recognition method and apparatus, and neural network training method and apparatus
CN110136731B (en) Cavity causal convolution generation confrontation network end-to-end bone conduction voice blind enhancement method
CN109887489B (en) Speech dereverberation method based on depth features for generating countermeasure network
CN108777146A (en) Speech model training method, method for distinguishing speek person, device, equipment and medium
CN109887484A (en) A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device
Gabor Communication theory and cybernetics
CN110060657B (en) SN-based many-to-many speaker conversion method
CN105654939A (en) Voice synthesis method based on voice vector textual characteristics
CN108538283B (en) Method for converting lip image characteristics into voice coding parameters
CN111931820A (en) Water target radiation noise LOFAR spectrogram spectrum extraction method based on convolution residual error network
CN111696520A (en) Intelligent dubbing method, device, medium and electronic equipment
An et al. Speech Emotion Recognition algorithm based on deep learning algorithm fusion of temporal and spatial features
CN110084250A (en) A kind of method and system of iamge description
Chen et al. Learning to generate steganographic cover for audio steganography using gan
Cao et al. Underwater target classification at greater depths using deep neural network with joint multiple‐domain feature
CN113539232A (en) Muslim class voice data set-based voice synthesis method
JP2019168608A (en) Learning device, acoustic generation device, method, and program
CN113782044B (en) Voice enhancement method and device
CN117612513B (en) Deep learning-based dolphin sound generation method
CN115908662B (en) Speaker video generation model training and using method, device and equipment
CN113421576B (en) Voice conversion method, device, equipment and storage medium
Pan et al. Bone-conducted speech to air-conducted speech conversion based on cycleconsistent adversarial networks
Zhu et al. DPTCN-ATPP: Multi-scale end-to-end modeling for single-channel speech separation
CN113870887A (en) Single-channel speech enhancement method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant