CN115188387A - Effective marine mammal sound automatic detection and classification method - Google Patents

Effective marine mammal sound automatic detection and classification method Download PDF

Info

Publication number
CN115188387A
CN115188387A CN202210817343.1A CN202210817343A CN115188387A CN 115188387 A CN115188387 A CN 115188387A CN 202210817343 A CN202210817343 A CN 202210817343A CN 115188387 A CN115188387 A CN 115188387A
Authority
CN
China
Prior art keywords
probability
grained
layer
coarse
network model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210817343.1A
Other languages
Chinese (zh)
Other versions
CN115188387B (en
Inventor
李丹阳
李军
蒋凯林
郑兴泽
李焦
明扬
李林成
谢天宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Agricultural University
Original Assignee
Sichuan Agricultural University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Agricultural University filed Critical Sichuan Agricultural University
Priority to CN202210817343.1A priority Critical patent/CN115188387B/en
Publication of CN115188387A publication Critical patent/CN115188387A/en
Application granted granted Critical
Publication of CN115188387B publication Critical patent/CN115188387B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/26Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Game Theory and Decision Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Business, Economics & Management (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to an effective automatic detection and classification method for marine mammal sound, which comprises the steps of carrying out data enhancement processing on marine mammal audio data through a single-sample variation self-encoder; extracting a Mel cepstrum coefficient from the audio data, performing characteristic splicing on the Mel cepstrum coefficient and the initial intensity envelope to obtain a Mel frequency cepstrum coefficient of a first input characteristic, and extracting an audio fingerprint characteristic of a second input characteristic from the audio data in an audio fingerprint extraction mode; inputting the Mel frequency cepstrum coefficient and the audio fingerprint characteristics into a two-way fusion network to obtain two prediction results, and fusing the two prediction results to obtain a final prediction result for detecting and classifying the marine mammals. According to the invention, through a two-way parallel fusion network structure, the fusion network can simultaneously have the capabilities of capturing high-dimensional features and utilizing time sequence information, and the complementarity between models is enhanced to efficiently improve the model performance by utilizing different information concerned by different networks.

Description

Effective marine mammal sound automatic detection and classification method
Technical Field
The invention relates to the technical field of marine acoustics application, in particular to an effective automatic detection and classification method for marine mammal sound.
Background
Statistically, 2.1% of the mammals in the world have died since 1600 years. Expert statistical analysis shows that the extinction speed of the species is continuously increased, which is about 100 to 1000 times of the former (original estimation) speed. Among the 173.9 thousands of species recorded worldwide, there are about 130 species of marine mammals, of which over 90 species are of the whale and dolphin species and nearly 40 species are of the other orders pinniphyllales and bovid. Of 20278 species of organisms recorded in the sea area of our country, nearly 50 species of aquatic mammals (including introduced species) have been found, of which up to 41 species have been found recorded in whales and dolphins, 5 species of finfoot animals (excluding introduced species), and 1 species of animals of the order porcines. Marine mammals are the most endangered species in nature and are classified as protective animals in almost every country in the world.
Although the number of marine mammals is small, the role played in maintaining marine ecosystem balance is not trivial, and protection of marine mammals is a matter of paramount importance; in recent years, the survival conditions of marine mammals still face serious challenges due to the problems of unclear species resource conditions, declined habitat, water environment pollution and the like; the traditional method for manually identifying the mammals has the disadvantages of difficult work, low work efficiency, long work time, high material and labor cost, huge and difficult data processing, incapability of monitoring the marine mammals, time lag and certain danger. The sample collection has great contingency and randomness, and long-time accumulation of samples is needed; the field survey and passive acoustic monitoring require huge investment in manpower, financial resources and material resources, and are often difficult to implement quickly, frequently and on a large scale. Since many marine mammals inhabit sea areas where human tracks are rare and have strong mobility, great difficulty is brought to human identification. Therefore, how to detect, identify and classify marine mammals is a current consideration.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, provides an effective automatic detection and classification method for marine mammal sound, and solves the problems existing in the traditional method for manually identifying mammals.
The purpose of the invention is realized by the following technical scheme: an effective marine mammal sound automatic detection and classification method, the automatic detection and classification method comprising:
carrying out data enhancement processing on the audio data of the marine mammal, which is disclosed in the prior art and acquired in the field, by a single-sample variation self-encoder;
extracting a Mel cepstrum coefficient and an initial intensity envelope from the audio data after data enhancement, performing feature splicing to obtain a first input feature, and extracting an audio fingerprint feature of a second input feature from the audio data after data enhancement in an audio fingerprint extraction mode;
inputting the Mel frequency cepstrum coefficient and the audio fingerprint characteristics into a two-way fusion MG-ResFormer network, outputting to obtain two ways of results for predicting the sample class probability, and fusing the two ways of prediction results to obtain the final prediction result for detecting and classifying the marine mammals.
The extracting of the audio fingerprint feature of the second input feature from the data-enhanced audio data in the audio fingerprint extracting manner includes:
dividing audio data into a plurality of original subframes with the same size, and carrying out Fourier transform on the data of the original subframes to calculate the frequency spectrum information of the original subframes;
dividing the atomic spectrum obtained by calculation into a plurality of spectral bands, calculating each spectral band to obtain an energy block, and combining all the energy blocks to obtain a two-dimensional matrix representing atomic spectrum energy information;
performing differential calculation on the two-dimensional matrix, and acquiring a 01 matrix only containing 0 and 1 by capturing each energy block and adjacent energy blocks;
and splicing the two-dimensional matrix and the 01 matrix containing the biological sound production information to obtain the audio fingerprint characteristics.
The two-way fusion MG-ResFormer network comprises an MG-Resnet network model, an MG-Transformer network model and a fusion layer; the Mel frequency cepstrum coefficient is input into an MG-Transformer network model to obtain a probability matrix of the MG-Transformer network model to the sample, the audio fingerprint characteristics are input into an MG-Resnet network to obtain a probability matrix of the MG-Resnet network model to the sample, and the fusion layer fuses the probability matrices output by the two network models to obtain a final prediction result of the detection and classification of the marine mammals.
The MG-Resnet network model comprises five convolutional layer modules, a pooling layer, two full-connection layers and a coarse-fine granularity combination module, wherein the input audio fingerprint characteristics are subjected to 7 x 7 convolution through the first convolutional layer module, the average pooling is carried out after residual convolution is carried out on a second convolutional layer to a fifth convolutional layer respectively containing two build-blocks, then two obtained outputs are input to the coarse-fine granularity combination layer after the two parallel full-connection layers are passed, and finally a probability matrix of a sample is obtained.
The MG-Transformer network model firstly pools input Mel frequency cepstrum coefficients to reduce the sensitivity of characteristics to different positions, then inputs the coefficients to an encoding layer to extract different characteristic signals through a multi-head attention mechanism, segments the characteristics to enhance the attention of the model to global characteristics, respectively extracts coarse granularity and fine granularity probabilities through two linear layers, and finally inputs the probabilities to a coarse and fine granularity combination layer to obtain a probability matrix of a sample.
The fusion layer fuses the probability matrixes output by the two network models to obtain the final prediction result of the detection and classification of the marine mammals, and the final prediction result comprises the following steps:
9 neurons are arranged in the fusion layer, and the probability values of 9 types output by the two network models respectively pass through the 9 neurons and the 9 pseudo neurons;
the class probability value output by one network model is directly multiplied by 9 neurons, the class probability value output by the other network model is multiplied by 9 pseudo neurons, and then the two groups of probability values are added together to carry out normalization operation to obtain the final probability.
The obtaining of the sample quota probability matrix through the thickness granularity combination layer in the MG-Transformer network model and the MG-Resnet network model comprises the following steps:
tensors with the lengths of the fine-grained type quantity and the coarse-grained type quantity are respectively mapped through two parallel full connection layers in the MG-Resnet network model, and tensors with the lengths of the fine-grained type quantity and the coarse-grained type quantity are respectively mapped through two parallel linear layers in the MG-Transformer network model;
the coarse-grained layer in the coarse-grained and fine-grained combined layer obtains the probability of the coarse-grained category to which the sample belongs through a softmax function on input data, the fine-grained layer groups the input data, divides the fine-grained data belonging to the same coarse-grained category into one group, and performs softmax operation on each groupFinally, multiplying the obtained coarse granularity probability by the corresponding fine granularity probability to obtain class probability information p 1
Directly performing softmax operation on the input of the fine granularity layer through a similar residual error structure, and endowing epsilon with p 1 The matrix in which it is located performs the operation, i.e. p 2 =p 1 +ε*p 0 ,p 0 Representing the probability obtained by directly performing softmax operation on the input of the fine-grained layer;
finally p is added 2 Normalizing to obtain final probability matrix p of the sample 3 =p 2 /∑p 2
The invention has the following advantages: an effective automatic detection and classification method for marine mammal sound is different from the traditional characteristics, a great amount of energy, frequency and time sequence information of a sound signal contained in audio fingerprint information and specific sound production expression capability of different species are constructed, and the advancement is shown in a convolutional neural network; by constructing a multi-granularity combination layer for assisting multi-classification tasks, constructing a coarse-granularity layer and a fine-granularity layer for species data to be identified by using a dividing mode of 'kingdom compendium family' species, wherein the coarse-granularity layer corresponds to the 'family' of the species, the fine-granularity layer corresponds to the 'genus' of the species, and the decision of the fine-granularity layer is consolidated through the prior judgment of the coarse granularity, and the multi-granularity fusion layer has strong universality and is suitable for other researches; by means of the two-way parallel fusion network structure, the fusion network can have the capability of capturing high-dimensional features and utilizing time sequence information, and the complementarity among models is enhanced to efficiently improve the model performance by utilizing different information concerned by different networks.
Drawings
FIG. 1 is a schematic structural diagram of a two-way MG-ResFormer network of the present invention;
FIG. 2 is a schematic view of a coarse-fine grain composite layer;
FIG. 3 is a schematic diagram of the MG-Resnet network model;
FIG. 4 is a schematic structural view of a fused layer;
FIG. 5 is a comparison of the effects of various network models.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, as generally described and illustrated in the figures herein, could be arranged and designed in a wide variety of different configurations. Thus, the detailed description of the embodiments of the present application provided below in connection with the appended drawings is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application. The invention is further described below with reference to the accompanying drawings.
As shown in fig. 1, an effective automatic detection and classification method for marine mammal sounds comprises:
s1, carrying out data enhancement processing on the audio data of the existing open and field collected marine mammals through a single-sample variation self-encoder;
wherein the disclosed marine mammal audio data is class 9 marine mammal information disclosed in a waters marine mammal sound database providing marine mammal sound recordings from 1940 to 2000; in order to facilitate the generation of audio fingerprints and the same network input data scale, audio with different lengths is averagely divided into 2s of audio;
in order to prevent the Sample cloth distribution from causing negative influence on the learning of the model, the phenomena of blurring and chaos (image superposition is serious due to posterior collapse) which can occur in an image generated by a common VAE (Plain VAE) are solved by a single Sample variation auto-encoder (S3 VAE).
S2, extracting a Mel cepstrum coefficient and an initial intensity envelope from the audio data after data enhancement, performing feature splicing to obtain a first input feature, and extracting an audio fingerprint feature of a second input feature from the audio data after data enhancement in an audio fingerprint extraction mode;
further, audio feature extraction can reduce the sampling signal of the original waveform, thereby accelerating the understanding of the semantic meaning in the audio by the machine. In order to obtain the audio features with the best effect, 9 features of the audio data mainstream are extracted: chromatographic information, constant Q chromatographic information, normalized chromatographic information, mel-frequency spectral information, mel-cepstral information, spectral contrast, tonal centroid, local autocorrelation of the initial intensity envelope, fourier-velocimetry.
Firstly, pre-emphasis is carried out on samples to improve the energy of a high-frequency part of a signal, given a time domain input signal x [ n ], the pre-emphasized signal is: y [ n ] = x [ n ] - α x [ n-1],0.9 ≦ α ≦ 1.0;
to facilitate the extraction of various subsequent features, we should make the value of the signal at the window boundary approximate to 0, so that the signal approaches to be a periodic signal, and the window function is as follows:
Figure BDA0003741236130000041
after signal preprocessing, nine features in the above are extracted through an open source library librosa in python, the features are combined due to different concerned dimensions of each feature, the combinations are trained and the correlation is verified, and finally two features are selected to be spliced to serve as one of input data: mel-frequency cepstral coefficients (MFCCs) and the starting intensity envelope.
And S3, inputting the Mel frequency cepstrum coefficient and the audio fingerprint characteristics into a two-way fusion MG-ResFormer network, outputting two ways of results for predicting the sample class probability, and fusing the two ways of prediction results to obtain the final prediction result for detecting and classifying the marine mammals.
Further, as shown in fig. 2, extracting the audio fingerprint feature of the second input feature from the data-enhanced audio data by means of audio fingerprint extraction includes:
dividing audio data into a plurality of original subframes with the same size, and carrying out Fourier transform on the data of the original subframes to calculate the frequency spectrum information of the original subframes;
dividing the atomic spectrum obtained by calculation into a plurality of spectral bands, calculating each spectral band to obtain an energy block, and combining all the energy blocks to obtain a two-dimensional matrix representing atomic spectrum energy information;
performing differential calculation on the two-dimensional matrix, and acquiring a 01 matrix only containing 0 and 1 by capturing each energy block and adjacent energy blocks;
and splicing the two-dimensional matrix and the 01 matrix containing the biological sound production information to obtain the audio fingerprint characteristics.
In particular, in conventional audio features such as mel-frequency spectrum, mfcc and chromatogram, which are often related to only a single piece of information in the voiceprint, the present invention is expected to construct a new voiceprint feature containing frequency, energy and time sequence to enhance the weak term of a single signal, and also to reflect the uniqueness of different kinds of utterances.
The invention reduces the influence of over-long and over-short audio on the final classification as much as possible. Therefore, in the construction of the audio fingerprint, the strategy of atomic frame stream is used, namely, the original audio is divided into original subframes with the same size, then a series of changes are carried out on the atomic frames to obtain atomic features, and the final features are obtained by combining the atomic features. Common atomic features may contain not much information enough to support the model to identify itself, but under normal circumstances the audio to be identified is made up of hundreds and thousands of original sub-frames, containing enough atomic frames to make efficient and reliable identification.
Then, performing fourier transform on the data of the original subframe to calculate the spectrum information of the original subframe, wherein the calculation comprises the following steps:
Figure BDA0003741236130000051
F(e )=a+ib
Figure BDA0003741236130000052
wherein j is an imaginary number, omega is an angular frequency, t is time, in order to enable the energy information expression to be more accurate, an atomic frequency spectrum is divided into 65 frequency spectrum bands, and an energy block is obtained by calculating each frequency spectrum band
Figure BDA0003741236130000053
Thus, energy information of the atomic spectrum is obtained by combining 65 energy blocks. The energy information of the atomic frequency spectrums form a two-dimensional matrix, so that certain time sequence information is reserved, which is an important characteristic of the audio fingerprint.
And (3) carrying out differential calculation on the two-dimensional matrix, and obtaining a matrix only containing 0 and 1 by capturing each energy block and adjacent energy blocks, wherein the calculation formula is as follows:
Figure BDA0003741236130000061
the method comprises the following steps that (1) m and n represent coordinate information of a two-dimensional matrix, n represents a coordinate in an x direction, and m represents a coordinate in a y direction, so that the position of each energy block in the two-dimensional matrix is positioned; the 01 matrix contains the vocal information of the living beings, so that the vocal information is spliced with the two-dimensional matrix to obtain the final fingerprint characteristics.
Further, as shown in fig. 1, the two-way fusion MG-ResFormer network includes a MG-Resnet network model, a MG-Transformer network model and a fusion layer; the Mel frequency cepstrum coefficient is input into an MG-Transformer network model to obtain a probability matrix of the MG-Transformer network model to the sample, the audio fingerprint characteristics are input into an MG-Resnet network to obtain a probability matrix of the MG-Resnet network model to the sample, and the fusion layer fuses the probability matrices output by the two network models to obtain a final prediction result of the detection and classification of the marine mammals.
Inputting audio fingerprint characteristics of audio in an MG-Resnet network model, and recording a Loss function as Loss1; inputting the Mel frequency cepstrum coefficient characteristic of the audio in an MG-Transformer network model, and marking the Loss function as Loss2; in the fusion module, the probability output by the MGResnet network model and the MGTransformer network model is used as input to be fitted with the label information after One-hot coding, and the Loss is recorded as Loss3. The final loss function is noted as:
Loss=loss1+loss2+loss3
fingerprint features of an audio sample, mel cepstrum coefficients (MFCC) and splicing features (description size) of a starting intensity envelope are respectively input into MGResnet and MGTransformer, after forward propagation through a two-way network, two matrixes for predicting sample class probabilities are respectively output, and at the moment, the two matrixes for predicting 1 × 9 are frozen. Because the two matrixes have calculation processes from two paths, parameters of the two-path network are changed when the Loss3 is propagated reversely, and the Loss1 and the Loss2 complete the updating of the two-path network, so that the two-path network is negatively influenced by the reverse propagation of the Loss3.
The Loss calculation process is correct because the Loss calculation is obtained according to gradient back propagation, and after the detailed partial derivative calculation, the Loss contained in the Loss is calculated 1 ,Loss 2 ,Loss 3 Will be responsible for the back propagation of the modules to which they belong, respectively.
As shown in fig. 3, the MG-respet network model includes five convolutional layer modules, one pooling layer, two full-link layers, and one coarse-fine granularity combination module, and first performs 7 × 7 convolution on input audio fingerprint features by the first convolutional layer module, performs average pooling after performing residual convolution on the second convolutional layer to the fifth convolutional layer, which respectively include two build-blocks, and then inputs two obtained outputs to the coarse-fine granularity combination layer after passing through the two parallel full-link layers, and finally obtains a probability matrix for a sample.
The MG-Transformer network model firstly pools input Mel frequency cepstrum coefficients to reduce the sensitivity of characteristics to different positions, then inputs the coefficients into a coding layer to extract different characteristic signals through a multi-head attention mechanism, segments the characteristics to enhance the attention of the model to global characteristics, respectively extracts coarse granularity and fine granularity probabilities through two linear layers, and finally inputs the probabilities into a coarse and fine granularity combination layer to obtain a probability matrix of a sample; the processing ensures that the model keeps high-efficiency recognition performance for a large number of call sound classification tasks, can better sense global characteristics and has more excellent performance for multi-class tasks.
As shown in fig. 4, obtaining the sample quota probability matrix through the coarse and fine granularity combination layer in the MG-Transformer network model and the MG-respet network model includes:
tensors with the lengths of the fine-grained category number and the coarse-grained category number are respectively mapped through two parallel full-connection layers in the MG-Resnet network model, and tensors with the lengths of the fine-grained category number and the coarse-grained category number are respectively mapped through two parallel linear layers in the MG-Transformer network model; the data used here still has only one tag, namely a fine-grained tag, and the capture of coarse-grained information is realized by means of a coarse-grained layer.
A coarse-grained layer in the coarse-grained and fine-grained combined layer obtains the probability of a coarse-grained category to which a sample belongs through a softmax function for input data, a fine-grained layer groups the input data, divides fine-grained data belonging to the same coarse-grained category into one group, performs softmax operation on each group, and finally multiplies the obtained coarse-grained probability with the corresponding fine-grained probability to obtain category probability information; the following formula represents a calculation formula of the tiger whale class probability:
p 1 (Killer whale)=p(Killer whale|Whale)*p(Whale)
in addition, it is necessary to consider a special case that the coarse-grained layer may output an incorrect coarse-grained classification probability, and although this is extremely low, it is necessary to directly perform softmax operation on the input of the fine-grained layer by using a similar residual structure, and then give epsilon and p 1 The matrix in which the operation is performed is p 2 (Killer whale)=p 1 (Killer whale)+ε*p 0 (Killer whale);
Wherein p is 0 (Killer tile) represents the probability obtained by directly performing softmax operation on the input of the fine-grained layer;
finally p is added 0 (Killer while) is normalized to obtain the final probability matrix p of the sample 3 (Killer whale)=p 2 (Killer whale)/∑p 2
In the actual training process, the neural network gradually notices the correctness of the learning characteristics, the more correctly the coarse-grained classes to which the fine-grained classes belong are classified, the more likely the fine-grained classes are correctly distributed, and although the invention only uses the fine-grained labels, the final classification effect is still very ideal.
As shown in fig. 5, the fusion layer fuses the probability matrices output by the two network models to obtain the final prediction result of the detection and classification of marine mammals, which includes:
9 neurons are arranged in the fusion layer, and the probability values (respectively marked as input1 and input 2) of 9 categories output by the two network models pass through the 9 neurons and the 9 pseudo neurons respectively;
the class probability value output by one network model is directly multiplied by 9 neurons, the class probability value output by the other network model is multiplied by 9 pseudo neurons, and the two groups of obtained probability values are added to carry out normalization operation to obtain the final probability.
There are also 9 pseudo-neurons, the value of each of which depends on the value of the neuron at the corresponding position. The calculation formula is as follows, wherein Ti Representing neurons involved in training, beta Fi Represent dummy neurons not involved in training:
β Ti =1-β Fi
in order to ensure that the fusion structure has correct influence, a similar residual error structure is also used, the final output probability is added with the input1 and input2, and the standardization operation is carried out to obtain the final output.
The invention tests the two-way fusion MG-ResFormer network, and the following table shows that the two-way fusion MG-ResFormer network has extreme performance in the audio nine-classification task of marine mammals. In this task, the ACC, AUC, map and f1_ score reach 99.09, 99.99, 99.97 and 99.24 respectively. Compared with the classification network commonly used in the same field, the network makes obvious progress.
Network effect comparison graph
Network/index ACC AUC mAP f1_score
MGResFormer 99.09 99.99 99.97 99.24
MGResnet18 97.27 99.93 99.35 96.54
MGTrans 96.36 99.85 98.63 95.76
InceptionV3 94.62 98.97 99.03 94.43
EfficientNet 95.39 99.14 99.02 95.78
As shown in fig. 5, the accuracy of MGResFormer almost always lies above MGResnet and mgfransformer, which proves that the two networks merge into each other to play a role of complementary promotion.
The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (7)

1. An effective method for automatically detecting and classifying marine mammal sounds, comprising: the automatic detection and classification method comprises the following steps:
carrying out data enhancement processing on marine mammal audio data which is disclosed and collected in the field by a single-sample variation self-encoder;
extracting a Mel cepstrum coefficient and an initial intensity envelope from the audio data after data enhancement, performing feature splicing to obtain a first input feature, and extracting an audio fingerprint feature of a second input feature from the audio data after data enhancement in an audio fingerprint extraction mode;
inputting the Mel frequency cepstrum coefficient and the audio fingerprint characteristics into a two-way fusion MG-ResFormer network, outputting to obtain two ways of results for predicting the sample class probability, and fusing the two ways of prediction results to obtain the final prediction result for detecting and classifying the marine mammals.
2. An effective method for automatic detection and classification of marine mammal sounds according to claim 1, wherein: the extracting of the audio fingerprint feature of the second input feature from the data-enhanced audio data in the audio fingerprint extracting manner includes:
dividing audio data into a plurality of original subframes with the same size, and performing Fourier transform on data of the original subframes to calculate frequency spectrum information of the data;
dividing the atomic spectrum obtained by calculation into a plurality of spectral bands, calculating each spectral band to obtain an energy block, and combining all the energy blocks to obtain a two-dimensional matrix representing atomic spectrum energy information;
performing differential calculation on the two-dimensional matrix, and acquiring a 01 matrix only containing 0 and 1 by capturing each energy block and adjacent energy blocks;
and splicing the two-dimensional matrix and the 01 matrix containing the biological sound production information to obtain the audio fingerprint characteristics.
3. An effective method for automatic detection and classification of marine mammal sounds according to claim 1, wherein: the two-way fusion MG-ResFormer network comprises an MG-Resnet network model, an MG-Transformer network model and a fusion layer; the Mel frequency cepstrum coefficient is input into an MG-Transformer network model to obtain a probability matrix of the MG-Transformer network model to the sample, the audio fingerprint characteristics are input into an MG-Resnet network to obtain a probability matrix of the MG-Resnet network model to the sample, and the fusion layer fuses the probability matrices output by the two network models to obtain a final prediction result of the detection and classification of the marine mammals.
4. An effective method for automatic detection and classification of marine mammal sounds according to claim 3, wherein: the MG-Resnet network model comprises five convolutional layer modules, a pooling layer, two full-connection layers and a coarse-fine granularity combination module, wherein the input audio fingerprint characteristics are subjected to 7 x 7 convolution through the first convolutional layer module, the average pooling is carried out after residual convolution is carried out on a second convolutional layer to a fifth convolutional layer respectively containing two build-blocks, then two obtained outputs are input to the coarse-fine granularity combination layer after the two parallel full-connection layers are passed, and finally a probability matrix of a sample is obtained.
5. An effective method for automatic detection and classification of marine mammal sounds according to claim 4, wherein: the MG-Transformer network model firstly pools input Mel frequency cepstrum coefficients to reduce the sensitivity of features to different positions, then inputs the coefficients to an encoding layer to extract different feature signals through a multi-head attention mechanism, divides the features to enhance the attention of the model to global features, respectively extracts coarse granularity and fine granularity probabilities through two linear layers, and finally inputs the probabilities to a coarse granularity combination layer to obtain a probability matrix of a sample.
6. An effective marine mammal sound automatic detection and classification method according to claim 3, characterized in that: the fusion layer fuses probability matrixes output by the two network models to obtain a final prediction result of the detection and classification of the marine mammals, and the final prediction result comprises the following steps:
9 neurons are arranged in the fusion layer, and the probability values of 9 categories output by the two network models respectively pass through the 9 neurons and the 9 pseudo neurons;
the class probability value output by one network model is directly multiplied by 9 neurons, the class probability value output by the other network model is multiplied by 9 pseudo neurons, and the two groups of obtained probability values are added to carry out normalization operation to obtain the final probability.
7. An effective method for automatic detection and classification of marine mammal sounds according to claim 5, wherein: the obtaining of the sample quota probability matrix through the thickness granularity combination layer in the MG-Transformer network model and the MG-Resnet network model comprises the following steps:
tensors with the lengths of the fine-grained type quantity and the coarse-grained type quantity are respectively mapped through two parallel full connection layers in the MG-Resnet network model, and tensors with the lengths of the fine-grained type quantity and the coarse-grained type quantity are respectively mapped through two parallel linear layers in the MG-Transformer network model;
a coarse-grained layer in the coarse-grained and fine-grained combined layers obtains the probability of a coarse-grained category to which a sample belongs through a softmax function on input data, fine-grained layers group the input data, divide fine-grained data belonging to the same coarse-grained category into one group, perform softmax operation on each group, and finally multiply the obtained coarse-grained probability with the corresponding fine-grained probability to obtain category probability information p 1
Directly performing softmax operation on the input of the fine granularity layer through a similar residual error structure, and endowing epsilon with p 1 The matrix in which it is located performs the operation, i.e. p 2 =p 1 +ε*p 0 ,p 0 Representing the probability obtained by directly performing softmax operation on the input of the fine-grained layer;
finally p is added 2 Normalizing to obtain final probability matrix p of the sample 3 =p 2 /∑p 2
CN202210817343.1A 2022-07-12 2022-07-12 Effective marine mammal sound automatic detection and classification method Active CN115188387B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210817343.1A CN115188387B (en) 2022-07-12 2022-07-12 Effective marine mammal sound automatic detection and classification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210817343.1A CN115188387B (en) 2022-07-12 2022-07-12 Effective marine mammal sound automatic detection and classification method

Publications (2)

Publication Number Publication Date
CN115188387A true CN115188387A (en) 2022-10-14
CN115188387B CN115188387B (en) 2023-04-07

Family

ID=83517853

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210817343.1A Active CN115188387B (en) 2022-07-12 2022-07-12 Effective marine mammal sound automatic detection and classification method

Country Status (1)

Country Link
CN (1) CN115188387B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116318867A (en) * 2023-02-15 2023-06-23 四川农业大学 Resource platform data transmission method based on out-of-order encryption and decryption
CN117174109A (en) * 2023-11-03 2023-12-05 青岛科技大学 Feature extraction-based marine mammal sound signal imitation hidden scoring method
CN117275491A (en) * 2023-11-17 2023-12-22 青岛科技大学 Sound classification method based on audio conversion and time diagram neural network

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030125946A1 (en) * 2002-01-03 2003-07-03 Wen-Hao Hsu Method and apparatus for recognizing animal species from an animal voice
CN103117061A (en) * 2013-02-05 2013-05-22 广东欧珀移动通信有限公司 Method and device for identifying animals based on voice
US9558762B1 (en) * 2011-07-03 2017-01-31 Reality Analytics, Inc. System and method for distinguishing source from unconstrained acoustic signals emitted thereby in context agnostic manner
CN109800700A (en) * 2019-01-15 2019-05-24 哈尔滨工程大学 A kind of underwater sound signal target classification identification method based on deep learning
CN110120224A (en) * 2019-05-10 2019-08-13 平安科技(深圳)有限公司 Construction method, device, computer equipment and the storage medium of bird sound identification model
CN110390952A (en) * 2019-06-21 2019-10-29 江南大学 City sound event classification method based on bicharacteristic 2-DenseNet parallel connection
CN110827837A (en) * 2019-10-18 2020-02-21 中山大学 Whale activity audio classification method based on deep learning
WO2021075709A1 (en) * 2019-10-14 2021-04-22 고려대학교 산학협력단 Apparatus and method for identifying animal species robustly against noisy environment
CN112750442A (en) * 2020-12-25 2021-05-04 浙江弄潮儿智慧科技有限公司 Nipponia nippon population ecosystem monitoring system with wavelet transformation and wavelet transformation method thereof
CN112802484A (en) * 2021-04-12 2021-05-14 四川大学 Panda sound event detection method and system under mixed audio frequency
CN112927694A (en) * 2021-03-08 2021-06-08 中国地质大学(武汉) Voice instruction validity judging method based on fusion voiceprint features
CN113345443A (en) * 2021-04-22 2021-09-03 西北工业大学 Marine mammal vocalization detection and identification method based on mel-frequency cepstrum coefficient
US20220036053A1 (en) * 2018-10-16 2022-02-03 Korea University Research And Business Foundation Method and apparatus for identifying animal species

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030125946A1 (en) * 2002-01-03 2003-07-03 Wen-Hao Hsu Method and apparatus for recognizing animal species from an animal voice
US9558762B1 (en) * 2011-07-03 2017-01-31 Reality Analytics, Inc. System and method for distinguishing source from unconstrained acoustic signals emitted thereby in context agnostic manner
CN103117061A (en) * 2013-02-05 2013-05-22 广东欧珀移动通信有限公司 Method and device for identifying animals based on voice
US20220036053A1 (en) * 2018-10-16 2022-02-03 Korea University Research And Business Foundation Method and apparatus for identifying animal species
CN109800700A (en) * 2019-01-15 2019-05-24 哈尔滨工程大学 A kind of underwater sound signal target classification identification method based on deep learning
CN110120224A (en) * 2019-05-10 2019-08-13 平安科技(深圳)有限公司 Construction method, device, computer equipment and the storage medium of bird sound identification model
CN110390952A (en) * 2019-06-21 2019-10-29 江南大学 City sound event classification method based on bicharacteristic 2-DenseNet parallel connection
WO2021075709A1 (en) * 2019-10-14 2021-04-22 고려대학교 산학협력단 Apparatus and method for identifying animal species robustly against noisy environment
CN110827837A (en) * 2019-10-18 2020-02-21 中山大学 Whale activity audio classification method based on deep learning
CN112750442A (en) * 2020-12-25 2021-05-04 浙江弄潮儿智慧科技有限公司 Nipponia nippon population ecosystem monitoring system with wavelet transformation and wavelet transformation method thereof
CN112927694A (en) * 2021-03-08 2021-06-08 中国地质大学(武汉) Voice instruction validity judging method based on fusion voiceprint features
CN112802484A (en) * 2021-04-12 2021-05-14 四川大学 Panda sound event detection method and system under mixed audio frequency
CN113345443A (en) * 2021-04-22 2021-09-03 西北工业大学 Marine mammal vocalization detection and identification method based on mel-frequency cepstrum coefficient

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CHE YONG YEO ETC: "Animal voice recognition for identification (ID) detection system", 《2011 IEEE 7TH INTERNATIONAL COLLOQUIUM ON SIGNAL PROCESSING AND ITS APPLICATIONS》 *
SONJA SCHALL ETC: "Voice Identity Recognition: Functional Division of the Right STS and Its Behavioral Relevance", 《JOURNAL OF COGNITIVE NEUROSCIENCE》 *
曹晏飞: "复杂背景下蛋鸡声音分类提取方法研究", 《CNKI博士学位论文全文库》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116318867A (en) * 2023-02-15 2023-06-23 四川农业大学 Resource platform data transmission method based on out-of-order encryption and decryption
CN116318867B (en) * 2023-02-15 2023-11-28 四川农业大学 Resource platform data transmission method based on out-of-order encryption and decryption
CN117174109A (en) * 2023-11-03 2023-12-05 青岛科技大学 Feature extraction-based marine mammal sound signal imitation hidden scoring method
CN117174109B (en) * 2023-11-03 2024-02-02 青岛科技大学 Feature extraction-based marine mammal sound signal imitation hidden scoring method
CN117275491A (en) * 2023-11-17 2023-12-22 青岛科技大学 Sound classification method based on audio conversion and time diagram neural network
CN117275491B (en) * 2023-11-17 2024-01-30 青岛科技大学 Sound classification method based on audio conversion and time attention seeking neural network

Also Published As

Publication number Publication date
CN115188387B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN115188387B (en) Effective marine mammal sound automatic detection and classification method
Stöter et al. Countnet: Estimating the number of concurrent speakers using supervised learning
CN107731233B (en) Voiceprint recognition method based on RNN
Cakir et al. Multi-label vs. combined single-label sound event detection with deep neural networks
Samizade et al. Adversarial example detection by classification for deep speech recognition
CN110148408A (en) A kind of Chinese speech recognition method based on depth residual error
CN117095694B (en) Bird song recognition method based on tag hierarchical structure attribute relationship
Hussain et al. Swishnet: A fast convolutional neural network for speech, music and noise classification and segmentation
Leonid et al. Classification of Elephant Sounds Using Parallel Convolutional Neural Network.
CN113191178A (en) Underwater sound target identification method based on auditory perception feature deep learning
Yang et al. Classification of odontocete echolocation clicks using convolutional neural network
Qiao et al. Sub-spectrogram segmentation for environmental sound classification via convolutional recurrent neural network and score level fusion
Hou et al. Transfer learning for improving singing-voice detection in polyphonic instrumental music
CN114863938A (en) Bird language identification method and system based on attention residual error and feature fusion
Liu Deep convolutional and LSTM neural networks for acoustic modelling in automatic speech recognition
Bai et al. Multimodal urban sound tagging with spatiotemporal context
Xiao et al. AMResNet: An automatic recognition model of bird sounds in real environment
CN116771662A (en) Machine pump fault diagnosis method based on multi-feature fusion
Pellegrini Deep-learning-based central African primate species classification with MixUp and SpecAugment
Xie et al. Acoustic feature extraction using perceptual wavelet packet decomposition for frog call classification
Singh et al. Speaker Identification Analysis Based on Long-Term Acoustic Characteristics with Minimal Performance
Kek et al. Acoustic scene classification using bilinear pooling on time-liked and frequency-liked convolution neural network
Bai et al. A multi-feature fusion based method for urban sound tagging
Bai et al. CIAIC-BAD system for DCASE2018 challenge task 3
Segarceanu et al. Environmental acoustics modelling techniques for forest monitoring

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant