CN115188387B - Effective marine mammal sound automatic detection and classification method - Google Patents

Effective marine mammal sound automatic detection and classification method Download PDF

Info

Publication number
CN115188387B
CN115188387B CN202210817343.1A CN202210817343A CN115188387B CN 115188387 B CN115188387 B CN 115188387B CN 202210817343 A CN202210817343 A CN 202210817343A CN 115188387 B CN115188387 B CN 115188387B
Authority
CN
China
Prior art keywords
probability
layer
grained
coarse
network model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210817343.1A
Other languages
Chinese (zh)
Other versions
CN115188387A (en
Inventor
李丹阳
李军
蒋凯林
郑兴泽
李焦
明扬
李林成
谢天宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Agricultural University
Original Assignee
Sichuan Agricultural University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Agricultural University filed Critical Sichuan Agricultural University
Priority to CN202210817343.1A priority Critical patent/CN115188387B/en
Publication of CN115188387A publication Critical patent/CN115188387A/en
Application granted granted Critical
Publication of CN115188387B publication Critical patent/CN115188387B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/26Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Game Theory and Decision Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Business, Economics & Management (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to an effective automatic detection and classification method for marine mammal sound, which comprises the steps of carrying out data enhancement processing on marine mammal audio data through a single-sample variation self-encoder; extracting a Mel cepstrum coefficient from the audio data, performing characteristic splicing on the Mel cepstrum coefficient and the initial intensity envelope to obtain a Mel frequency cepstrum coefficient of a first input characteristic, and extracting an audio fingerprint characteristic of a second input characteristic from the audio data in an audio fingerprint extraction mode; inputting the Mel frequency cepstrum coefficient and the audio fingerprint characteristics into a two-way fusion network to obtain two-way prediction results, and fusing the two-way prediction results to obtain a final prediction result for the detection and classification of the marine mammals. According to the invention, through a two-way parallel fusion network structure, the fusion network can simultaneously have the capabilities of capturing high-dimensional features and utilizing time sequence information, and the complementarity between models is enhanced to efficiently improve the model performance by utilizing different information concerned by different networks.

Description

Effective marine mammal sound automatic detection and classification method
Technical Field
The invention relates to the technical field of marine acoustics application, in particular to an effective automatic detection and classification method for marine mammal sound.
Background
Statistically, 2.1% of the mammals in the world have died since 1600 years. The expert statistical analysis shows that the extinction speed of the species is continuously increased, which is about 100 to 1000 times of the former (original estimation) speed. Among the 173.9 million species recorded worldwide, there are about 130 species of marine mammals, of which there are more than 90 species of cetacea and dolphins, and nearly 40 species of other animals of the order finpoda and bovida. Of 20278 species of organisms recorded in the sea area of our country, nearly 50 species of aquatic mammals (including introduced species) have been found, of which up to 41 species have been found recorded in whales and dolphins, 5 species of finfoot animals (excluding introduced species), and 1 species of animals of the order porcines. Marine mammals are the most endangered species in nature and are classified as protective animals in almost every country in the world.
Although the number of marine mammals is small, the role played in maintaining marine ecosystem balance is not trivial, and protection of marine mammals is a matter of paramount importance; in recent years, the survival condition of marine mammals still faces a serious challenge due to the problems of unclear species resource condition, declined habitat, water environment pollution and the like; the traditional manual work for identifying the mammals is difficult in work, low in work efficiency, long in work time, high in material and labor cost, huge and difficult in data processing, incapable of monitoring the marine mammals, time lag and dangerous to some extent. The sample collection has great contingency and randomness, and long-time sample accumulation is needed; the field survey and passive acoustic monitoring require huge investment in manpower, financial resources and material resources, and are often difficult to implement quickly, frequently and on a large scale. In view of the fact that many marine mammals inhabit sea areas where human tracks are rare and have strong mobility, great difficulty is brought to human identification. Therefore, how to detect, identify and classify marine mammals is a current consideration.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, provides an effective automatic detection and classification method for marine mammal sound, and solves the problems of traditional manual mammal identification.
The purpose of the invention is realized by the following technical scheme: an efficient automatic detection and classification method for marine mammal sounds, said automatic detection and classification method comprising:
carrying out data enhancement processing on marine mammal audio data which is disclosed and collected in the field by a single-sample variation self-encoder;
extracting a Mel cepstrum coefficient and an initial intensity envelope from the audio data after data enhancement, performing feature splicing to obtain a first input feature, and extracting an audio fingerprint feature of a second input feature from the audio data after data enhancement in an audio fingerprint extraction mode;
inputting the Mel frequency cepstrum coefficient and the audio fingerprint characteristics into a two-way fusion MG-ResFormer network, outputting to obtain two ways of results for predicting the sample class probability, and fusing the two ways of prediction results to obtain the final prediction result for detecting and classifying the marine mammals.
The extracting of the audio fingerprint feature of the second input feature from the data-enhanced audio data in the audio fingerprint extraction manner includes:
dividing audio data into a plurality of original subframes with the same size, and carrying out Fourier transform on the data of the original subframes to calculate the frequency spectrum information of the original subframes;
dividing the atomic spectrum obtained by calculation into a plurality of spectral bands, calculating each spectral band to obtain an energy block, and combining all the energy blocks to obtain a two-dimensional matrix representing atomic spectrum energy information;
performing differential calculation on the two-dimensional matrix, and acquiring a 01 matrix only containing 0 and 1 by capturing each energy block and adjacent energy blocks;
and splicing the two-dimensional matrix and the 01 matrix containing the biological sound production information to obtain the audio fingerprint characteristics.
The two-way fusion MG-ResFormer network comprises an MG-Resnet network model, an MG-Transformer network model and a fusion layer; the Mel frequency cepstrum coefficient is input into an MG-transform network model to obtain a probability matrix of the MG-transform network model to a sample, the audio fingerprint characteristics are input into an MG-Resnet network to obtain a probability matrix of the MG-Resnet network model to the sample, and the fusion layer fuses the probability matrices output by the two network models to obtain a final prediction result of detection and classification of the marine mammals.
The MG-Resnet network model comprises five convolutional layer modules, a pooling layer, two full-connection layers and a coarse-fine granularity combination module, wherein the input audio fingerprint characteristics are subjected to 7 x 7 convolution through the first convolutional layer module, the average pooling is carried out after residual convolution is carried out on a second convolutional layer to a fifth convolutional layer respectively containing two build-blocks, then two obtained outputs are input to the coarse-fine granularity combination layer after the two parallel full-connection layers are passed, and finally a probability matrix of a sample is obtained.
The MG-Transformer network model firstly pools input Mel frequency cepstrum coefficients to reduce the sensitivity of characteristics to different positions, then inputs the coefficients to an encoding layer to extract different characteristic signals through a multi-head attention mechanism, segments the characteristics to enhance the attention of the model to global characteristics, respectively extracts coarse granularity and fine granularity probabilities through two linear layers, and finally inputs the probabilities to a coarse and fine granularity combination layer to obtain a probability matrix of a sample.
The fusion layer fuses probability matrixes output by the two network models to obtain a final prediction result of the detection and classification of the marine mammals, and the final prediction result comprises the following steps:
9 neurons are arranged in the fusion layer, and the probability values of 9 types output by the two network models respectively pass through the 9 neurons and the 9 pseudo neurons;
the class probability value output by one network model is directly multiplied by 9 neurons, the class probability value output by the other network model is multiplied by 9 pseudo neurons, and the two groups of obtained probability values are added to carry out normalization operation to obtain the final probability.
The obtaining of the sample quota probability matrix through a coarse and fine granularity combination layer in the MG-Transformer network model and the MG-Resnet network model comprises the following steps:
tensors with the lengths of the fine-grained type quantity and the coarse-grained type quantity are respectively mapped through two parallel full connection layers in the MG-Resnet network model, and tensors with the lengths of the fine-grained type quantity and the coarse-grained type quantity are respectively mapped through two parallel linear layers in the MG-Transformer network model;
the coarse-grained layer in the coarse-grained and fine-grained combined layer obtains the probability of the coarse-grained category to which the sample belongs through a softmax function on input data, the fine-grained layer groups the input data, divides fine-grained data belonging to the same coarse-grained category into one group, conducts softmax operation on each group, and finally multiplies the obtained coarse-grained probability with the corresponding fine-grained probability to obtain category probability information p 1
Directly carrying out softmax operation on the input of the fine granularity layer through a similar residual error structure, endowing epsilon with epsilon, and then adding p 1 The matrix in which it is located performs the operation, i.e. p 2 =p 1 +ε*p 0 ,p 0 Representing the probability obtained by directly performing softmax operation on the input of the fine granularity layer;
finally p is added 2 Normalizing to obtain final probability matrix p of the sample 3 =p 2 /∑p 2
The invention has the following advantages: an effective automatic detection and classification method for marine mammal sound is different from the traditional characteristics, a great amount of energy, frequency and time sequence information of a sound signal contained in audio fingerprint information and specific sound production expression capability of different species are constructed, and the advancement is shown in a convolutional neural network; by constructing a multi-granularity combination layer for assisting multi-classification tasks, constructing a coarse-granularity layer and a fine-granularity layer for species data to be identified by using a dividing mode of 'kingdom compendium family' species, wherein the coarse-granularity layer corresponds to the 'family' of the species, the fine-granularity layer corresponds to the 'genus' of the species, and the decision of the fine-granularity layer is consolidated through the prior judgment of the coarse granularity, and the multi-granularity fusion layer has strong universality and is suitable for other researches; by means of the two-way parallel fusion network structure, the fusion network can have the capability of capturing high-dimensional features and utilizing time sequence information at the same time, and the complementarity between models is enhanced to efficiently improve the performance of the models by utilizing different information concerned by different networks.
Drawings
FIG. 1 is a schematic structural diagram of a two-way MG-ResFormer network of the present invention;
FIG. 2 is a schematic diagram of a coarse-fine grain size composite layer;
FIG. 3 is a schematic diagram of the MG-Resnet network model;
FIG. 4 is a schematic structural view of a fused layer;
FIG. 5 is a comparison of the effects of various network models.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the detailed description of the embodiments of the present application provided below in connection with the appended drawings is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application. The invention is further described below with reference to the accompanying drawings.
As shown in fig. 1, an effective automatic detection and classification method for marine mammal sounds comprises:
s1, carrying out data enhancement processing on the audio data of the existing open and field collected marine mammals through a single-sample variation self-encoder;
wherein the disclosed marine mammal audio data is class 9 marine mammal information disclosed in a waters marine mammal sound database providing marine mammal sound recordings from 1940 to 2000; in order to facilitate the generation of audio fingerprints and the same network input data scale, audio with different lengths is averagely divided into 2s of audio;
in order to prevent the Sample distribution from causing negative influence on the learning of the model, the phenomena of blurring and chaos (meaning that the image is seriously overlapped due to posterior collapse) which may occur in an image generated by a common VAE (Plain VAE) are solved by a single Sample variation auto-encoder (S3 VAE).
S2, extracting a Mel cepstrum coefficient and an initial intensity envelope from the audio data after data enhancement, performing feature splicing to obtain a first input feature, and extracting an audio fingerprint feature of a second input feature from the audio data after data enhancement in an audio fingerprint extraction mode;
further, audio feature extraction can reduce the sampling signal of the original waveform, thereby accelerating the understanding of the semantic meaning in the audio by the machine. In order to obtain the audio features with the best effect, 9 features of the audio data mainstream are extracted: chromatographic information, constant Q chromatographic information, normalized chromatographic information, mel-frequency spectral information, mel-cepstral information, spectral contrast, tonal centroid, local autocorrelation of the initial intensity envelope, fourier-velocimetry.
Firstly, pre-emphasis is carried out on samples to improve the energy of a high-frequency part of a signal, given a time domain input signal x [ n ], the pre-emphasized signal is: y [ n ] = x [ n ] -alphaxn-1, 0.9 ≦ alpha ≦ 1.0;
to facilitate the extraction of various subsequent features, we should make the value of the signal at the window boundary approximate to 0, so that the signal approaches to be a periodic signal, and the window function is as follows:
Figure BDA0003741236130000041
after signal preprocessing, nine features in the above are extracted through an open source library librosa in python, the features are combined due to different concerned dimensions of each feature, the combinations are trained and the correlation is verified, and finally two features are selected to be spliced to serve as one of input data: mel-frequency cepstral coefficients (MFCCs) and the starting intensity envelope.
And S3, inputting the Mel frequency cepstrum coefficient and the audio fingerprint characteristics into a two-way fusion MG-ResFormer network, outputting to obtain two ways of results for predicting the sample class probability, and fusing the two ways of prediction results to obtain a final prediction result for detecting and classifying the marine mammals.
Further, as shown in fig. 2, extracting the audio fingerprint feature of the second input feature from the data-enhanced audio data by means of audio fingerprint extraction includes:
dividing audio data into a plurality of original subframes with the same size, and carrying out Fourier transform on the data of the original subframes to calculate the frequency spectrum information of the original subframes;
dividing the atomic spectrum obtained by calculation into a plurality of spectral bands, calculating each spectral band to obtain an energy block, and combining all the energy blocks to obtain a two-dimensional matrix representing atomic spectrum energy information;
performing differential calculation on the two-dimensional matrix, and acquiring a 01 matrix only containing 0 and 1 by capturing each energy block and adjacent energy blocks;
and splicing the two-dimensional matrix and the 01 matrix containing the biological sound production information to obtain the audio fingerprint characteristics.
In particular, in conventional audio features such as mel-frequency spectrum, mfcc and chromatogram, which are often related to only a single piece of information in the voiceprint, the present invention is expected to construct a new voiceprint feature containing frequency, energy and time sequence to enhance the weak term of a single signal, and also to reflect the uniqueness of different kinds of utterances.
The invention reduces the influence of overlong and overlong audio frequency on the final classification as much as possible. Therefore, in the construction of the audio fingerprint, the strategy of atomic frame stream is used, namely, the original audio is divided into original subframes with the same size, then a series of changes are carried out on the atomic frames to obtain atomic features, and the final features are obtained by combining the atomic features. Common atomic features may contain not much information enough to support the model to identify itself, but under normal circumstances the audio to be identified is made up of hundreds and thousands of original sub-frames, containing enough atomic frames to make efficient and reliable identification.
Then, performing fourier transform on the data of the original subframe to calculate the spectrum information of the original subframe, wherein the calculation comprises the following steps:
Figure BDA0003741236130000051
F(e )=a+ib
Figure BDA0003741236130000052
wherein j is an imaginary number, ω is an angular frequency, and t is time, in order to make the energy information expression more accurate, the atomic frequency spectrum is divided into 65 spectral bands, and an energy block is obtained by calculating for each spectral band
Figure BDA0003741236130000053
Then, energy information of the atomic spectrum is obtained by combining 65 energy blocks. The energy information of the atomic frequency spectrums form a two-dimensional matrix, so that certain time sequence information is reserved, which is an important characteristic of the audio fingerprint.
And (3) carrying out differential calculation on the two-dimensional matrix, and obtaining a matrix only containing 0 and 1 by capturing each energy block and adjacent energy blocks, wherein the calculation formula is as follows:
Figure BDA0003741236130000061
wherein m and n represent coordinate information of a two-dimensional matrix, n represents a coordinate in an x direction, and m represents a coordinate in a y direction, so as to position the position of each energy block in the two-dimensional matrix; the 01 matrix contains the vocal information of the living beings, so that the vocal information is spliced with the two-dimensional matrix to obtain the final fingerprint characteristics.
Further, as shown in fig. 1, the two-way fusion MG-ResFormer network includes a MG-Resnet network model, a MG-Transformer network model and a fusion layer; the Mel frequency cepstrum coefficient is input into an MG-transform network model to obtain a probability matrix of the MG-transform network model to a sample, the audio fingerprint characteristics are input into an MG-Resnet network to obtain a probability matrix of the MG-Resnet network model to the sample, and the fusion layer fuses the probability matrices output by the two network models to obtain a final prediction result of detection and classification of the marine mammals.
Inputting audio fingerprint characteristics of audio in an MG-Resnet network model, and recording a Loss function as Loss1; inputting the Mel frequency cepstrum coefficient characteristic of the audio in an MG-Transformer network model, and marking the Loss function as Loss2; in the fusion module, the probability output by the MGResnet network model and the MGTransformer network model is used as input to be fitted with the label information after One-hot coding, and the Loss is recorded as Loss3. The final loss function is noted as:
Loss=loss1+loss2+loss3
fingerprint features of an audio sample, mel cepstrum coefficients (MFCC) and splicing features (description size) of a starting intensity envelope are respectively input into MGResnet and MGTransformer, after forward propagation through a two-way network, two matrixes for predicting sample class probabilities are respectively output, and at the moment, the two matrixes for predicting 1 × 9 are frozen. Because the two matrixes have calculation processes from two paths, parameters of the two-path network are changed when the Loss3 is propagated reversely, and the Loss1 and the Loss2 complete the updating of the two-path network, so that the two-path network is negatively influenced by the reverse propagation of the Loss3.
The calculation of the Loss is obtained according to gradient back propagation, so that the Loss calculation process is correct, and after the fine partial derivative calculation, the Loss contained in the Loss is calculated 1 ,Loss 2 ,Loss 3 Will be responsible for the back propagation of the modules to which they belong, respectively.
As shown in fig. 3, the MG-Resnet network model includes five convolutional layers, a pooling layer, two full-link layers, and a coarse-fine granularity combination module, and first performs 7 × 7 convolution on input audio fingerprint features by the first convolutional layer module, performs average pooling after performing residual convolution on the second convolutional layer to the fifth convolutional layer, which respectively include two build-blocks, and then inputs two obtained outputs to the coarse-fine granularity combination layer after passing through the two parallel full-link layers, and finally obtains a probability matrix for a sample.
The MG-Transformer network model firstly pools input Mel frequency cepstrum coefficients to reduce the sensitivity of characteristics to different positions, then inputs the coefficients into a coding layer to extract different characteristic signals through a multi-head attention mechanism, segments the characteristics to enhance the attention of the model to global characteristics, respectively extracts coarse granularity and fine granularity probabilities through two linear layers, and finally inputs the probabilities into a coarse and fine granularity combination layer to obtain a probability matrix of a sample; the processing ensures that the model keeps high-efficiency recognition performance for a large number of call sound classification tasks, can better sense global characteristics and has more excellent performance for multi-class tasks.
As shown in fig. 4, obtaining the sample probability matrix through the coarse and fine granularity combination layer in the MG-Transformer network model and the MG-respet network model includes:
tensors with the lengths of the fine-grained category number and the coarse-grained category number are respectively mapped through two parallel full-connection layers in the MG-Resnet network model, and tensors with the lengths of the fine-grained category number and the coarse-grained category number are respectively mapped through two parallel linear layers in the MG-Transformer network model; the data used here still has only one tag, namely a fine-grained tag, and the capture of coarse-grained information is realized by means of a coarse-grained layer.
A coarse-grained layer in the coarse-grained and fine-grained combined layer obtains the probability of a coarse-grained category to which a sample belongs through a softmax function for input data, a fine-grained layer groups the input data, divides fine-grained data belonging to the same coarse-grained category into one group, performs softmax operation on each group, and finally multiplies the obtained coarse-grained probability with the corresponding fine-grained probability to obtain category probability information; the following formula represents a calculation formula of the tiger whale class probability:
p 1 (Killer whale)=p(Killer whale|Whale)*p(Whale)
in addition, it is necessary to consider a special case that the coarse-grained layer may output an incorrect coarse-grained classification probability, and although this is extremely low, it is necessary to directly perform softmax operation on the input of the fine-grained layer by using a similar residual structure, and then give epsilon and p 1 The matrix in which the operation is performed is p 2 (Killer whale)=p 1 (Killer whale)+ε*p 0 (Killer whale);
Wherein p is 0 (Killer tile) represents the probability obtained by directly performing softmax operation on the input of the fine-grained layer;
finally p is added 0 (Killer while) normalizing to obtain final probability matrix p of the sample 3 (Killer whale)=p 2 (Killer whale)/∑p 2
In the actual training process, the neural network gradually notices the correctness of the learning characteristics, the more correctly the coarse-grained classes to which the fine-grained classes belong are classified, the more likely the fine-grained classes are correctly distributed, and although the invention only uses the fine-grained labels, the final classification effect is still very ideal.
As shown in fig. 5, the fusion layer fusing the probability matrices output by the two network models to obtain the final prediction result of the detection and classification of marine mammals includes:
9 neurons are arranged in the fusion layer, and the probability values (respectively marked as input1 and input 2) of 9 categories output by the two network models pass through the 9 neurons and the 9 pseudo neurons respectively;
the class probability value output by one network model is directly multiplied by 9 neurons, the class probability value output by the other network model is multiplied by 9 pseudo neurons, and the two groups of obtained probability values are added to carry out normalization operation to obtain the final probability.
There are also 9 pseudo-neurons, the value of each of which depends on the value of the neuron at the corresponding position. The calculation formula is as follows, wherein Ti Representing neurons involved in training, beta Fi Represent pseudo neurons not involved in training:
β Ti =1-β Fi
in order to ensure that the fusion structure has correct influence, a similar residual error structure is also used, the final output probability is added to the input1 and input2, and the normalization operation is performed to obtain the final output.
The invention tests the two-way fusion MG-ResFormer network, and the following table shows that the two-way fusion MG-ResFormer network has extreme performance in the audio nine-classification task of marine mammals. In this task, the ACC, AUC, map and f1_ score reach 99.09, 99.99, 99.97 and 99.24 respectively. Compared with the classification network commonly used in the same field, the network makes obvious progress.
Network effect comparison graph
Network/index ACC AUC mAP f1_score
MGResFormer 99.09 99.99 99.97 99.24
MGResnet18 97.27 99.93 99.35 96.54
MGTrans 96.36 99.85 98.63 95.76
InceptionV3 94.62 98.97 99.03 94.43
EfficientNet 95.39 99.14 99.02 95.78
As shown in fig. 5, the accuracy of MGResFormer almost always lies above MGResnet and mgfransformer, which proves that the two networks merge into each other to play a role of complementary promotion.
The foregoing is illustrative of the preferred embodiments of the present invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and is not to be construed as limited to the exclusion of other embodiments, and that various other combinations, modifications, and environments may be used and modifications may be made within the scope of the concepts described herein, either by the above teachings or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (7)

1. An effective method for automatically detecting and classifying marine mammal sounds, comprising: the automatic detection and classification method comprises the following steps:
carrying out data enhancement processing on audio data in a Wattix marine mammal sound database and marine mammal audio data collected on the spot by a single-sample variation self-encoder;
extracting a Mel cepstrum coefficient and an initial intensity envelope from the audio data after data enhancement, performing feature splicing to obtain a first input feature, and extracting an audio fingerprint feature of a second input feature from the audio data after data enhancement in an audio fingerprint extraction mode;
inputting the Mel frequency cepstrum coefficient and the audio fingerprint characteristics into a two-way fusion MG-ResFormer network, outputting to obtain two ways of results for predicting the sample class probability, and fusing the two ways of prediction results to obtain the final prediction result for detecting and classifying the marine mammals.
2. An effective method for automatic detection and classification of marine mammal sounds according to claim 1, wherein: the extracting of the audio fingerprint feature of the second input feature from the data-enhanced audio data in the audio fingerprint extraction manner includes:
dividing audio data into a plurality of original subframes with the same size, and performing Fourier transform on data of the original subframes to calculate frequency spectrum information of the data;
dividing the atomic spectrum obtained by calculation into a plurality of spectral bands, calculating each spectral band to obtain an energy block, and combining all the energy blocks to obtain a two-dimensional matrix representing atomic spectrum energy information;
performing differential calculation on the two-dimensional matrix, and acquiring a 01 matrix only containing 0 and 1 by capturing each energy block and adjacent energy blocks;
and splicing the two-dimensional matrix and the 01 matrix containing the biological sound production information to obtain the audio fingerprint characteristics.
3. An effective method for automatic detection and classification of marine mammal sounds according to claim 1, wherein: the two-way fusion MG-ResFormer network comprises an MG-Resnet network model, an MG-Transformer network model and a fusion layer; the Mel frequency cepstrum coefficient is input into an MG-transform network model to obtain a probability matrix of the MG-transform network model to a sample, the audio fingerprint characteristics are input into an MG-Resnet network to obtain a probability matrix of the MG-Resnet network model to the sample, and the fusion layer fuses the probability matrices output by the two network models to obtain a final prediction result of detection and classification of the marine mammals.
4. An effective marine mammal sound automatic detection and classification method according to claim 3, characterized in that: the MG-Resnet network model comprises five convolutional layer modules, a pooling layer, two full-connection layers and a coarse-fine granularity combination module, wherein the input audio fingerprint characteristics are subjected to 7 x 7 convolution through the first convolutional layer module, the average pooling is carried out after residual convolution is carried out on a second convolutional layer to a fifth convolutional layer respectively containing two build-blocks, then two obtained outputs are input to the coarse-fine granularity combination layer after the two parallel full-connection layers are passed, and finally a probability matrix of a sample is obtained.
5. An effective marine mammal sound automatic detection and classification method according to claim 4, wherein: the MG-Transformer network model firstly pools input Mel frequency cepstrum coefficients to reduce the sensitivity of characteristics to different positions, then inputs the coefficients to an encoding layer to extract different characteristic signals through a multi-head attention mechanism, segments the characteristics to enhance the attention of the model to global characteristics, respectively extracts coarse granularity and fine granularity probabilities through two linear layers, and finally inputs the probabilities to a coarse and fine granularity combination layer to obtain a probability matrix of a sample.
6. An effective method for automatic detection and classification of marine mammal sounds according to claim 3, wherein: the fusion layer fuses the probability matrixes output by the two network models to obtain the final prediction result of the detection and classification of the marine mammals, and the final prediction result comprises the following steps:
9 neurons are arranged in the fusion layer, and the probability values of 9 types output by the two network models respectively pass through the 9 neurons and the 9 pseudo neurons;
the class probability value output by one network model is directly multiplied by 9 neurons, the class probability value output by the other network model is multiplied by 9 pseudo neurons, and the two groups of obtained probability values are added to carry out normalization operation to obtain the final probability.
7. An effective method for automatic detection and classification of marine mammal sounds according to claim 5, wherein: the obtaining of the sample quota probability matrix through the thickness granularity combination layer in the MG-Transformer network model and the MG-Resnet network model comprises the following steps:
tensors with the lengths of the fine-grained category number and the coarse-grained category number are respectively mapped through two parallel full-connection layers in the MG-Resnet network model, and tensors with the lengths of the fine-grained category number and the coarse-grained category number are respectively mapped through two parallel linear layers in the MG-Transformer network model;
the coarse-grained layer in the coarse-grained and fine-grained combined layer obtains the probability of the coarse-grained category to which the sample belongs through a softmax function on input data, the fine-grained layer groups the input data, divides the fine-grained data belonging to the same coarse-grained category into one group, conducts softmax operation on each group, and finally multiplies the obtained coarse-grained probability with the corresponding fine-grained probability to obtain category probability information
Figure QLYQS_1
Directly performing softmax operation on the input of the fine granularity layer through a similar residual error structure, and endowing epsilon with a sum
Figure QLYQS_2
The matrix in which it is operated on->
Figure QLYQS_3
,/>
Figure QLYQS_4
Representing the probability obtained by directly performing softmax operation on the input of the fine granularity layer;
finally will be
Figure QLYQS_5
Normalization is performed to obtain a final probability matrix for the sample>
Figure QLYQS_6
。/>
CN202210817343.1A 2022-07-12 2022-07-12 Effective marine mammal sound automatic detection and classification method Active CN115188387B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210817343.1A CN115188387B (en) 2022-07-12 2022-07-12 Effective marine mammal sound automatic detection and classification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210817343.1A CN115188387B (en) 2022-07-12 2022-07-12 Effective marine mammal sound automatic detection and classification method

Publications (2)

Publication Number Publication Date
CN115188387A CN115188387A (en) 2022-10-14
CN115188387B true CN115188387B (en) 2023-04-07

Family

ID=83517853

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210817343.1A Active CN115188387B (en) 2022-07-12 2022-07-12 Effective marine mammal sound automatic detection and classification method

Country Status (1)

Country Link
CN (1) CN115188387B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116318867B (en) * 2023-02-15 2023-11-28 四川农业大学 Resource platform data transmission method based on out-of-order encryption and decryption
CN117174109B (en) * 2023-11-03 2024-02-02 青岛科技大学 Feature extraction-based marine mammal sound signal imitation hidden scoring method
CN117275491B (en) * 2023-11-17 2024-01-30 青岛科技大学 Sound classification method based on audio conversion and time attention seeking neural network

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9558762B1 (en) * 2011-07-03 2017-01-31 Reality Analytics, Inc. System and method for distinguishing source from unconstrained acoustic signals emitted thereby in context agnostic manner
CN109800700A (en) * 2019-01-15 2019-05-24 哈尔滨工程大学 A kind of underwater sound signal target classification identification method based on deep learning
CN110827837A (en) * 2019-10-18 2020-02-21 中山大学 Whale activity audio classification method based on deep learning
WO2021075709A1 (en) * 2019-10-14 2021-04-22 고려대학교 산학협력단 Apparatus and method for identifying animal species robustly against noisy environment
CN112927694A (en) * 2021-03-08 2021-06-08 中国地质大学(武汉) Voice instruction validity judging method based on fusion voiceprint features

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030125946A1 (en) * 2002-01-03 2003-07-03 Wen-Hao Hsu Method and apparatus for recognizing animal species from an animal voice
CN103117061B (en) * 2013-02-05 2016-01-20 广东欧珀移动通信有限公司 A kind of voice-based animals recognition method and device
KR102092475B1 (en) * 2018-10-16 2020-03-23 고려대학교 산학협력단 Method and application for animal species classification
CN110120224B (en) * 2019-05-10 2023-01-20 平安科技(深圳)有限公司 Method and device for constructing bird sound recognition model, computer equipment and storage medium
CN110390952B (en) * 2019-06-21 2021-10-22 江南大学 City sound event classification method based on dual-feature 2-DenseNet parallel connection
CN112750442B (en) * 2020-12-25 2023-08-08 浙江弄潮儿智慧科技有限公司 Crested mill population ecological system monitoring system with wavelet transformation and method thereof
CN112802484B (en) * 2021-04-12 2021-06-18 四川大学 Panda sound event detection method and system under mixed audio frequency
CN113345443A (en) * 2021-04-22 2021-09-03 西北工业大学 Marine mammal vocalization detection and identification method based on mel-frequency cepstrum coefficient

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9558762B1 (en) * 2011-07-03 2017-01-31 Reality Analytics, Inc. System and method for distinguishing source from unconstrained acoustic signals emitted thereby in context agnostic manner
CN109800700A (en) * 2019-01-15 2019-05-24 哈尔滨工程大学 A kind of underwater sound signal target classification identification method based on deep learning
WO2021075709A1 (en) * 2019-10-14 2021-04-22 고려대학교 산학협력단 Apparatus and method for identifying animal species robustly against noisy environment
CN110827837A (en) * 2019-10-18 2020-02-21 中山大学 Whale activity audio classification method based on deep learning
CN112927694A (en) * 2021-03-08 2021-06-08 中国地质大学(武汉) Voice instruction validity judging method based on fusion voiceprint features

Also Published As

Publication number Publication date
CN115188387A (en) 2022-10-14

Similar Documents

Publication Publication Date Title
CN115188387B (en) Effective marine mammal sound automatic detection and classification method
CN107731233B (en) Voiceprint recognition method based on RNN
Cakir et al. Multi-label vs. combined single-label sound event detection with deep neural networks
CN108875592A (en) A kind of convolutional neural networks optimization method based on attention
CN110148408A (en) A kind of Chinese speech recognition method based on depth residual error
CN117095694B (en) Bird song recognition method based on tag hierarchical structure attribute relationship
Leonid et al. Classification of Elephant Sounds Using Parallel Convolutional Neural Network.
Passricha et al. A comparative analysis of pooling strategies for convolutional neural network based Hindi ASR
Yang et al. Classification of odontocete echolocation clicks using convolutional neural network
CN113191178A (en) Underwater sound target identification method based on auditory perception feature deep learning
Qiao et al. Sub-spectrogram segmentation for environmental sound classification via convolutional recurrent neural network and score level fusion
Hou et al. Transfer learning for improving singing-voice detection in polyphonic instrumental music
Bai et al. Multimodal urban sound tagging with spatiotemporal context
Liu Deep convolutional and LSTM neural networks for acoustic modelling in automatic speech recognition
Brutti et al. Optimizing phinet architectures for the detection of urban sounds on low-end devices
Xiao et al. AMResNet: An automatic recognition model of bird sounds in real environment
Liu et al. Birdsong classification based on multi feature channel fusion
CN116884435A (en) Voice event detection method and device based on audio prompt learning
Bedyakin et al. Low-resource spoken language identification using self-attentive pooling and deep 1D time-channel separable convolutions
Xie et al. Acoustic feature extraction using perceptual wavelet packet decomposition for frog call classification
Singh et al. Speaker Identification Analysis Based on Long-Term Acoustic Characteristics with Minimal Performance
Bai et al. A multi-feature fusion based method for urban sound tagging
Bai et al. CIAIC-BAD system for DCASE2018 challenge task 3
Kek et al. Acoustic scene classification using bilinear pooling on time-liked and frequency-liked convolution neural network
Segarceanu et al. Environmental acoustics modelling techniques for forest monitoring

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant