CN112735444A - Chinese phoenix gull recognition system with model matching function and model matching method thereof - Google Patents

Chinese phoenix gull recognition system with model matching function and model matching method thereof Download PDF

Info

Publication number
CN112735444A
CN112735444A CN202011567949.1A CN202011567949A CN112735444A CN 112735444 A CN112735444 A CN 112735444A CN 202011567949 A CN202011567949 A CN 202011567949A CN 112735444 A CN112735444 A CN 112735444A
Authority
CN
China
Prior art keywords
audio
bird
data
sound
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011567949.1A
Other languages
Chinese (zh)
Other versions
CN112735444B (en
Inventor
刘妙燕
田元
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Nongchaoer Wisdom Technology Co ltd
Original Assignee
Zhejiang Nongchaoer Wisdom Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Nongchaoer Wisdom Technology Co ltd filed Critical Zhejiang Nongchaoer Wisdom Technology Co ltd
Priority to CN202011567949.1A priority Critical patent/CN112735444B/en
Publication of CN112735444A publication Critical patent/CN112735444A/en
Application granted granted Critical
Publication of CN112735444B publication Critical patent/CN112735444B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/26Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/63Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Biology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a Chinese water gull recognition system with model matching, which comprises a service layer, a data layer and a presentation layer, wherein the service layer consists of a user system, a service system and an algorithm system. The business system acquires various bird audios on the island through the audio extraction equipment and provides a frequency spectrum-time space interface for recording, positioning, analyzing, annotating and classifying, the algorithm system is used as a background system to realize corresponding functions through artificial intelligent voiceprint recognition, and the artificial intelligent voiceprint recognition comprises noise removal, multi-audio separation, automatic classification and single recognition; the model matching identification method comprises the following steps: step 1, collecting data; step 2, matching the model; and 3, identifying the audio. The invention integrates the acquisition, detection, denoising, audio separation, classification and identification of bird audio frequency into one system through a bird artificial intelligent identification system, thereby realizing the intelligent identification of the bird Liriopsis sinensis gull.

Description

Chinese phoenix gull recognition system with model matching function and model matching method thereof
Technical Field
The invention belongs to the field of artificial intelligent recognition of bird voiceprints, and particularly relates to a Chinese gull recognition system with model matching and a model matching method thereof.
Background
The Chinese water bird gull is a medium-sized water bird, and has a body length of 38-42 cm. Forehead, vertex and crown are black, upper body is light gray, wings are gray, and outer primary fletching is black. The tail is white and deep forked. The lower body is white. Mouth yellow with a wider black tip. The feet are black. Non-breeding feathers are similar to summer feathers, but the forehead and the top of the head are white. The iris is brown. The mouth is slightly thicker than the gull type, and is slightly curved, yellow, with a black sub-tip spot at the tip. The identification of the gull of the rooibos is unreliable only by means of images, and the acoustic wave of the birds is an important aviology characteristic and contains important and rich aviology meanings, such as: species identification mating breeding mode, community grade, gas quality characteristics, purification adaptability and the like, and the identification technology based on audio frequency can solve the identification problem of some Chinese gull; bird voiceprint can be used for carrying out birds diversity monitoring, and the competition and the heterogeneity performance in habitat living space can represent birds diversity, can provide individual, species, population, community, view level's diversity information to the analysis of audio frequency, quick voiceprint investigation: by the aid of an audio frequency only number extraction technology, rapid long-term wide-range bird diversity evaluation and monitoring can be realized quickly, and the structure of a population, including age and sex proportion, and the living state of things, including emotion, disease and fighting, can be analyzed after the bird species is fed; sonography ecology: the audio frequency livelihood index is obtained and represents a new species diversity index, and the sound scene ecology provides important data support. At present, an effective method for acquiring, detecting, denoising, audio separating, classifying, identifying and the like of the acoustic fingerprints of the birds is lacked.
Identifying bird voiceprints necessitates clarifying fine and detailed relationships between their characteristics (such as spectral characteristics, song or cry type) and behavioral context (such as direction, location, neighborhood), obtaining such data using conventional recordings or through human observation takes a lot of time and effort, many identification methods are not easy to implement due to hardware and software limitations, and the application of new non-invasive recording devices is an emphasis on eco-acoustics.
The noise present in most habitats and simultaneous chirping in many bird populations is difficult to achieve and more work needs to be done to solve the problem of identifying all species and the exact time they are sounding in noisy recordings of many birds. Current techniques are accomplished with the most manual intervention, especially the need to manually segment the recording into bird audio syllables. The processing of small audio data is usually done by manual de-noising and/or manual segmentation and has only a small number of species, these techniques are only used for signature recording and are not sufficient to detect the exact time of the utterance.
Most bird audio identification techniques are based on visual inspection of audio spectrograms. The continuous recognition of spectrograms of the audio of a large number of birds by human experts is an extremely time-consuming and laborious task. Therefore, it is urgently required to automatically recognize bird cry.
Identification of bird audio is becoming more important in bird acoustics and ecology as a tool to facilitate unattended monitoring, civilian science, and other applications with large amounts of audio data. For bird audio, the task of research includes identifying species and individuals, and many studies consider only the mono case, using recordings of individual birds with isolated or low background interference. Separating individual bird audio from mixed audio is a challenging task, and bird audio often contains rapid pitch modulations that carry information that may be useful in automatic identification.
Bird audio is complex, variable, monotonous, brief, repetitive, fixed, and usually consists of a series of notes.
Bird audio is typically divided into four levels: musical notes, syllables, phrases and bird sounds, where syllables play an important role in bird species identification. Syllables are used to address the problem of overlapping waveforms of many bird sounds. At present, all the related technologies extract the characteristics of a single syllable, but not extract a section of characteristics of the bird cry to identify the bird species. The identification is not accurate enough.
Disclosure of Invention
Aiming at the problems, particularly the problem of rapid audio recognition of bird voiceprints, the system and the method for recognizing the Chinese gull with the model matching are provided. The technical scheme is as follows:
a Chinese water gull recognition system with model matching comprises a service layer, a data layer and a display layer,
the service layer comprises three systems, namely a user system, a service system and an algorithm system, wherein the user system is mainly used for managing platform user operation behaviors and information management; the service system is used for managing services among the whole platform modules, and comprises audio address management, bird voiceprint acquisition, bird audio identification information and the like; the algorithm system identifies and detects the type of birds in the audio through artificial intelligent voiceprints and provides reasonable detection feedback information;
the data layer is used for storing data and is divided into a data center, a system database and a voiceprint database, and the data center is used for storing various service data including bird identification types, quantity, date, positions and the like; the system database stores service relation data among system modules, including voiceprints, audio storage addresses and the like; the voiceprint database stores all bird audio data;
the display layer outputs the interactive returned result among the functional modules through the WEB end, and the open API interface calling method developer can call according to the provided calling rule through the related open interface address.
The business system obtains various bird audios on the island through the audio extraction equipment and provides a frequency spectrum-time space interface for recording, positioning, analyzing, annotating and classifying, the algorithm system is used as a background system to realize corresponding functions through artificial intelligent voiceprint recognition, and the artificial intelligent voiceprint recognition comprises noise removal, multi-audio separation, automatic classification and single recognition.
A model matching method of a Chinese phoenix-gull recognition system comprises the following steps:
step 1, collecting data, and then carrying out data acquisition,
acquiring individual audio records of birds at different places and different periods, submitting the audio records to a voiceprint database, and processing data according to the standards of 44 kHz, 1kHz sampling rate, 1024 frames and 50% of limited time windows to acquire a standard frequency spectrum;
step 2, matching the model, and then performing model matching,
the task of identifying multiple sources in a sound field is accomplished using an established multiple identification paradigm, in order to identify different numbers of bird audio sources, a multiple identification model is introduced,
the current states of the plurality of sound source observations randomly determine the following states and the time intervals between them are represented as follows:
Figure BDA0002861546090000031
wherein P represents a conditional probability, t represents a specific time, YiRepresents the ith standard spectrum, (X)n,Tn) Representing a sequence of observations, XnDenotes the nth state, TnDenotes the time of the nth species, τn+1Represents Tn+1-TnThe time difference, C, represents a positive integer,
the observed values represent a single sequence, then τn+1Is known and fixed, but if the observed values may represent multiple sequences and clutter noise, the causal structure is unknown, τn+1Hidden, in this case the structure is estimated by choosing to divide the data into K clusters plus H noise events, thus maximizing the probability that:
Figure BDA0002861546090000032
wherein L represents an estimation structure, pMRP(k) Denotes the probability of observing a subsequence in group k, pNOISE(η) represents a likelihood of the η th noisy data;
step 3, the audio recognition is carried out,
and detecting single syllables of the bird audio by using a cross-correlation template matching paradigm, detecting the syllables from a standard frequency spectrum, and solving a maximum likelihood solution so as to realize the identification of the bird audio.
The invention has the beneficial effects that:
(1) the invention integrates the acquisition, detection, denoising, audio separation, classification and identification of bird audio frequency into one system through a bird artificial intelligent identification system, thereby realizing the intelligent identification of the Chinese gull.
(2) In the invention, a complete frame is provided by a frequency spectrum-time space interface based on ecological data analysis, and an annotation tool is realized by combining a feature mapping technology, so that necessary sound sources can be extracted, the time cost of classification can be reduced, the sound scene around a microphone array can be known, and the effects of bird gull singing and behavior in birds can be known in more detail.
(3) In the invention, noise removal is realized by two steps, the first step of segmentation and the spectrogram segmentation realize a fully automatic method, corresponding audio is extracted from each recording, event detection utilizes information provided by a group of weak labels of the recordings, namely, marked bird cry is used for automatically detecting the bird cry of each bird, and then the bird cry is classified into the marks, so that accurate sounding annotation of the Chinese phoenix-headed gull is realized.
(4) In the invention, the second event detection of noise removal, the good bird classification result obtained by the method is used for carrying out complete annotation on records on a unit level, instead of searching existing species and finding out the optimal visible matching of a vocalization by utilizing cross correlation, and the classification process of searching the optimal visual similarity matching of a fragment in the whole data set and refining the possible labels of the fragment is realized through multiple times of matching, so that the possible labels of each detected vocalization are reduced, and experiments show that the success rate of detecting the pingull swallow in the synthetic bird audio data set is up to 75 to 4 percent according to the evaluation of correct classification.
(5) In the invention, the automatic classification method uses the feature set two-dimensional Mei spectral coefficient and the dynamic two-dimensional Mei spectral coefficient as the sound production features to classify each syllable in the continuous bird audio recording, and test syllables and training syllables are separated from different recordings. The two-dimensional plum-blossom coefficient and the dynamic two-dimensional plum-blossom coefficient are combined, the classification precision of 28 kinds of birds can reach 84% and 06%, and the gull of the Chinese phoenix head can be easily identified.
(6) In the present invention, an improved spectrogram representation method is used to improve the performance of bird audio separation, which tracks voicing patterns, operates in the same paradigm, and demonstrates that improvements to the underlying representation can improve the quality of the tracking. A simple bird audio dictionary is used for analyzing signals, powerful parameter technology is used for estimating the characteristics of non-stationary signals, accurate representation can improve the tracking of various birds, and the specific technical variant developed by the invention. The sequence structure in the audio records containing a plurality of birds is deduced by a multiple tracing technology, the tracing program is applied to a data set of the audio records of the birds, and the analysis is carried out by a standard spectrogram, which shows that the method is favorable for the analysis of the audio records of the birds.
(7) The invention provides a wavelet-transformed multi-syllable bird audio feature extraction method, which not only extracts the features of single syllables, but also extracts the variation of the syllables, and does not use the single syllables, but uses bird audio fragments containing a syllable period to extract feature vectors.
Drawings
FIG. 1 is a block diagram of an artificial intelligent bird identification system according to the present invention;
FIG. 2 is a flow chart of segmentation in noise removal according to the present invention;
FIG. 3 is a flow chart of event detection in noise removal according to the present invention;
FIG. 4 is a flow chart of the automatic classification of the present invention;
FIG. 5 is a flow chart of audio separation according to the present invention;
FIG. 6 is a flow chart of model matching of the present invention.
Detailed Description
The invention is further described with reference to the following figures and examples.
Embodiments of the present invention are illustrated with reference to fig. 1-6.
A Chinese water gull recognition system with model matching comprises a service layer, a data layer and a display layer,
the service layer comprises three systems, namely a user system, a service system and an algorithm system, wherein the user system is mainly used for managing platform user operation behaviors and information management; the service system is used for managing services among the whole platform modules, and comprises audio address management, bird voiceprint acquisition, bird audio identification information and the like; the algorithm system identifies and detects the type of birds in the audio through artificial intelligent voiceprints and provides reasonable detection feedback information;
the data layer is used for storing data and is divided into a data center, a system database and a voiceprint database, and the data center is used for storing various service data including bird identification types, quantity, date, positions and the like; the system database stores service relation data among system modules, including voiceprints, audio storage addresses and the like; the voiceprint database stores all bird audio data;
the display layer outputs the interactive returned result among the functional modules through the WEB end, and the open API interface calling method developer can call according to the provided calling rule through the related open interface address.
The business system obtains various bird audios on the island through the audio extraction equipment and provides a frequency spectrum-time space interface for recording, positioning, analyzing, annotating and classifying, the algorithm system is used as a background system to realize corresponding functions through artificial intelligent voiceprint recognition, and the artificial intelligent voiceprint recognition comprises noise removal, multi-audio separation, automatic classification and single recognition.
The system service adopts a lightweight flash Web application framework, the WSGI tool box adopts Werkzeug, the flash has a built-in server and unit test, adapts RESTful and supports safe cookies. And a machine deep learning algorithm Keras artificial neural network and an Open CV machine learning algorithm capture dynamic voiceprints in real time for recognition. Automatically collecting data voiceprints and realizing accurate and intelligent identification of the Chinese water hyacinth.
The business system realizes bird voiceprint collection, uses audio extraction equipment to extract a sound source and a direction, the audio extraction equipment comprises a microphone array and data processing equipment, uses a frequency spectrum-time space interface on the data processing equipment to edit a collected audio file, can observe the distribution of the sound source on a two-dimensional characteristic space, knows the sound type in recording, thus knows the components of a sound landscape, and classifies the sound landscape by grouping similar sounds on the space; the user records, positions, analyzes, annotates and classifies the sound source on the visual frequency spectrum-time space interface, and the user can select a file or a folder to be operated on the left side of the window and change operation settings or execute various functions on the right side.
In the recording selection part, a user starts recording in a 16kHz and 16 bit format by using a microphone array and plays back or divides the recording, the system supports simultaneous recording of a plurality of microphone arrays connected to a data processing device, supports two-dimensional positioning for synchronous recording, and divides one recording file into a plurality of recording files by setting the number of partitions of the files or the recording time of each file so as to find parameter settings suitable for localization before analyzing long-term recording.
In the localization part, sound source localization is performed using a plurality of spectrograms having a short-time fourier transform based on a multi-signal classification method, and the separated sounds are extracted as a waveform file for each localized sound, basic parameter values related to bird audio localization and separation are set in a list on the right, and additional parameters are added to the list by defining parameter names and corresponding flags in a network file.
In the analysis section, the time distribution and directivity of the sound are visually analyzed, and the spectrogram and localization result are output in PDF format specifying the total number of pages by exporting file buttons, which helps to summarize the results on an appropriate time scale, and the data of all sound sources, including their directions and durations, are output in data interchange format files, loaded to the annotation tool and saved in the voiceprint database.
In the annotation part, a recorded spectrogram is displayed on a panel at the top of an annotation window, a time scale and a focus time scale are displayed on an x axis, the focus time scale and the display time scale can be adjusted, an audio and a sound source corresponding to the direction are displayed on a y axis, each box of the x axis in the annotation window represents the starting (left edge) and ending (right edge) time, the direction of the starting time of the corresponding source is displayed on the y axis, the color of each box represents the class of the audio and sound source, each box of localized sound is clicked, sound localization information is displayed on the right side, information can be manually edited, a corresponding file of separated sound or duration in the original recording is played back, the cancelling process of editing operation is supported, the position of each source is modified by dragging the corresponding box, and the modified data is stored in a voiceprint database in a data exchange format file form.
In the classification part, using spectrograms (100 × 64 pixels) of all separated sounds as a data set, performing dimensionality reduction on a local sound source, adopting a learning library to reduce the data dimension, drawing on a two-dimensional plane, visually distributing, performing a grid search through parameter settings, classifying the local sounds, the parameters including complexity, learning rate, iteration number, and the like, after extracting a suitable dimensionality reduction result, visualizing the sound source on a feature space using an interface, displaying the separated sound sources in the form of nodes on an interface for a classification tool, displaying the spectrograms in another window by clicking each node, and playing back the separated sounds, a group of nodes may be classified into a class, surrounded by a frame, and it is specified that this grouping may be accomplished by simple keyboard manipulation and mouse manipulation, allowing a user to classify similar sounds at a time, the user can also select noise removal when editing the spectrogram, and classified data is stored in a voiceprint database in a data exchange format file form by closing a window.
The spectrum-time space interface provides a complete framework based on ecological data analysis, and is combined with a feature mapping technology, so that an annotation tool is realized, necessary sound sources are extracted, the time cost of classification is reduced, the soundscape around the microphone array is known, and the effects of bird singing and behavior can be known in more detail.
The specific process of noise removal including segment segmentation and event detection is as follows:
the fragment segmentation specifically comprises the following steps:
step 1, processing audio data through short-time Fourier transform;
step 2, the segmentation and detection are carried out,
step 3, normalization is carried out according to the absolute maximum value,
and 4, removing the audio frequency of the frequency Hertz above 20kHz and below 340 kHz. No bird-calling sound occurs in the frequencies in nature, so that noise is filtered out;
step 5, performing median shearing on the spectrogram of each frequency and each time frame to obtain a binary image so as to eliminate noise, specifically: if the pixel value in the spectrogram is larger than 3 times of the median value of the corresponding row and column, setting the pixel to be 1, otherwise, setting the pixel to be 0;
step 6, applying a closing operation to fill any small hole with the current function, wherein the closing operation is applied to a rectangular neighborhood with the size of (3, 3);
step 7, removing the connecting component with less than 5 pixels;
step 8, applying dilation expansion in the rectangular neighborhood with the size of (7,7), wherein the dilation algorithm sets the pixel at (i, j) to be the maximum value of all pixels in the neighborhood with (i, j) as the center, and the dilation is to enlarge the region containing features (namely, phonation) and remove small objects which can be considered as noise;
step 9, filtering the image by using a median filter;
step 10, removing a part smaller than 150 pixels, and accordingly segmenting the binary spectrum image;
11, expanding the circular area with the radius of 3 again;
step 12, defining all connected pixels as a segment, and carrying out segment segmentation;
and step 13, calculating the size and the position of each segment.
The method generates fewer noise segments and generates larger sounding segments.
The event detection specifically comprises the following steps:
for each fragment, creating a taggable list, initializing to a weak tag containing a record of the fragment, the classification process will eliminate the tags unlikely to appear in the fragment by deduction, shorten the list of the fragment to one or more tags, each fragment to be marked is normalized by a matching template function, matching with different records to obtain all possible tag matches, normalized correlation is used to match the template (utterance) with a two-dimensional target image (spectrogram of the recording), a response image of the same size as the target image, the correlation coefficient between the template and the target image is between-1, 0 and 1, 0, by searching for the largest peak in the response image, finding the matching value between the fragment and a specific record, similar bird calls should appear at similar frequencies, applying the matching template to a smaller frequency range (5 below the fragment frequency or above the fragment frequency), thereby reducing the amount of computation.
In a single training set, no single training requires classification. The performance of this approach increases as the number of records per species increases. The chances of finding a segment match in the classification process increase as the voicing of each species changes. This process is divided into three different processes, namely a first track, a second track and a third track, which are applied to the recording in sequence, as follows:
step 1, first matching
Creating a set of records for each segment to find matches, indicating different combinations of tags generated from the initialization list, the records having tags in their weak tags, for each segment for which a tag is needed, searching the list of records, increasing the number of weak tags until a match is found or there are no more records remaining, the matching template returning the maximum peak in the response image, and when the similarity ratio returned by the matching template is 0, 4 or greater, in order to find a match.
Step 2, second matching
The second matching solves the first matching of the unmatched segments, all tags of the audio recording are assigned to at least one segment, and when the unmatched segments and tags of the corresponding segments are not in the audio recording, the unassigned tags are assigned to all unmatched segments.
Step 3, matching for the third time
After reducing the two matches, there may still be unassigned tags in the audio recording, all tags of the audio recording need to be assigned to at least one segment, in a recording where all segments have tags but some weak tags are not assigned to any segment, there must be some tags assigned to multiple segments (likely erroneous), possibly more than one segment having this tag, but when a tag is unassigned, one of the segments that matches the same tag is assumed to be misclassified, and the segments remaining for any unassigned tags are searched for the best match. If a match is found, the label of the segment derived from it will be changed to an unassigned label. The marking of the spectrogram is realized through the three-time matching, and noise and non-bird cry are removed.
Wherein, the automatic classification specifically is:
step 1, feature extraction
For most bird calls, within each syllable, there is more or less temporal variation between adjacent analysis frames, in syllable recognition the audio part with the largest spectral transformation contains the most important information, and dynamic two-dimensional mei-spectral coefficients are used to describe the jerk within the syllable.
Step 1.1, calculating two-dimensional plum spectral coefficient
The two-dimensional plum spectrum implicitly represents static characteristics and dynamic characteristics of a voice signal in a matrix form, a two-dimensional plum spectrum matrix T (q, n) can be obtained by applying two-dimensional discrete cosine transform to a continuous logarithmic spectrum sequence, a first dimension q of the two-dimensional plum spectrum matrix T (q, n) represents a cepstrum, a second dimension n represents time variation of each cepstrum coefficient, each syllable of a bird singing is modeled by adopting the two-dimensional plum spectrum coefficient, and the two-dimensional discrete cosine transform is applied to logarithmic energy of a plum spectrum scale band-pass filter defined according to a human auditory perception model to obtain a two-dimensional plum spectrum coefficient matrix C (q, n):
Figure BDA0002861546090000101
in the formula, Et(b) Is the energy of the B-th me spectral scale band pass filter of the t-th frame, q is the frequency index, n is the modulation frequency index, B is the number of me spectral scale band pass filters, L is the number of frames within a syllable, the two-dimensional discrete cosine transform is decomposed into two one-dimensional discrete cosine transforms, C (q, n) applies the one-dimensional discrete cosine transform to a continuous sequence of L MFCC coefficients along the time axis, expressed as:
Figure BDA0002861546090000102
the first row of the two-dimensional meissner coefficient matrix with frequency index q equal to 0 preserves the temporal variation of the energy in short time, each element in the first column with modulation frequency index n equal to 0 represents the average of the cepstral coefficients of all the analysis frames, on the frequency axis the lower coefficients represent the spectral envelope, the higher coefficients represent the pitch and excitation, on the time axis the lower coefficients represent the overall variation of the frequency and the higher coefficients represent the local variation of the frequency.
The analyzed frame number is different according to syllables due to different durations of different syllables, the number of columns in C (q, n) is different according to syllables, more useful information is provided for audio recognition by coefficients of a lower half part along a frequency axis q and a time axis n than coefficients of a higher part, the coefficients of the first 15 rows and the first 5 columns of C (q, n) are used for not comprising the coefficient C (0, 0) as the initial pronunciation characteristic of the syllable, 74 coefficients are selected from a two-dimensional Meme spectral coefficient matrix C (q, n) to form a two-dimensional Meme spectral coefficient characteristic vector of the syllable, the dimension of the characteristic vector is fixed, and the two-dimensional Meme spectral coefficient characteristic vector F is fixedTDExpressed as:
FTD=[C(0,1),...,C(0,4),C(1,0),...,C(1,4),...,C(14,0),...,C(14,4)]T
step 1.2, calculating dynamic two-dimensional Mei spectral coefficient
Dynamic two-dimensional mei-spectral coefficients are used to emphasize sharp transitions within syllables. Dynamic two-dimensional Mei spectral coefficient is a recognition method based on combination of instantaneous characteristics and dynamic characteristics of voice frequency spectrumIn all syllables, the part with the largest frequency spectrum conversion carries the most important speech information, the dynamic characteristics of a regression coefficient are defined as first-order orthogonal polynomial coefficients for isolated word identification, the regression coefficient represents the slope of a time function of each cepstrum coefficient in a tested speech segment, dynamic two-dimensional Mei spectral coefficients are extracted to highlight the part with the largest frequency spectrum transition in the syllables, and then the regression coefficient r of the b-th Mei spectral scale of the t-th framet(b) Comprises the following steps:
Figure BDA0002861546090000111
in the formula, n0Is to measure the interval length of the transition information, rt(b) Reflecting the energy transfer around the t-th frame, outputting each regression coefficient r for the b-th plum spectrum scale band-pass filtert(b) To Et(b) To obtain enhanced energy
Figure BDA0002861546090000112
Figure BDA0002861546090000113
Logarithmic emphasis of energy
Figure BDA0002861546090000114
Obtaining a cosine transform matrix by applying a two-dimensional discrete cosine transform
Figure BDA0002861546090000115
Figure BDA0002861546090000116
From
Figure BDA0002861546090000117
As the dynamic two-dimensional meibomian coefficient of the syllable, the coefficient selected in the first 15 rows and the first 5 columns (excluding the coefficient C (0, 0)) is usedFeature, dynamic two-dimensional Mei spectral coefficient feature vector FDTIs shown as
Figure BDA0002861546090000118
Step 1.3, feature vector combination
To obtain better classification results, we will use the two feature vectors (F)DTAnd FTD) Combined to obtain a larger feature vector, i.e. combined feature vector FSDFor describing static, dynamic and spectral transition information within syllables, FSDFrom FDTAnd FTDThe number of the first and second electrodes is the same as the number of the first and second electrodes,
Figure BDA0002861546090000119
step 1, 4, normalization of characteristic values
The calculated syllable feature vector (F) is represented by F without loss of generalityDT,FTDAnd FSD) Normalizing each eigenvalue to range from 0 to 1, expressed as:
Figure BDA0002861546090000121
wherein F (m) is the mth eigenvalue, x (m) is the normalized mth eigenvalue, Q1(m) and Q3(m)) means the first and third quartiles, defined as 25% (or 75%) of the mth eigenvalue of all training syllables being less than or equal to this value, the very high and very low eigenvalues being normalized to 1 and 0, so that the normalized eigenvalues are not affected by noise. Calculating the first quartile Q of each feature value1(m) and a third quartile Q3(m) in the classification stage, for the actual normalization, each feature value extracted from the input syllable uses a reference quartile value (Q)1(m) and Q3(m)) to obtain the targetNormalizing the value.
And step 2, principal component analysis, which is defined as orthogonal projection of the data on a low-dimensional vector space, so that the variance of the projection data is maximized.
Step 2.1, calculating a D-dimensional training vector set X ═ XjD-dimensional mean vector μ and D × D variance matrix of j ═ 1.,. N }
Figure BDA0002861546090000122
Figure BDA0002861546090000123
Figure BDA0002861546090000124
Step 2.2, calculating covariance matrix
Figure BDA0002861546090000125
And the corresponding eigenvalues, and sorting in descending order of the eigenvalues, the eigenvector viAnd a characteristic value lambdai1 ≦ i ≦ D, the first eigenvector D with the largest eigenvalue is the Dxd transform matrix APCAThe column (c) of (a),
APCA=[v1,v2,...,vd]
the number of feature vectors d is determined by finding the smallest integer that meets the following criteria,
Figure BDA0002861546090000126
where α is the percentage of information that needs to be retained for the determination, based on the transformation matrix APCACalculating a projection vector xPCA
Figure BDA0002861546090000127
Step 3, prototype vector generation
The audio of each bird is composed of several syllables with different characteristics, any two syllables separated from the same bird's voice may be very different, the prototype vector clusters together syllables with similar feature vectors by classifying syllables from the same bird into several subcategories, comprising in particular the following steps:
step 3.1, model selection
The gaussian-bass model of birds is as follows:
Figure BDA0002861546090000131
wherein X ═ { X ═ XjJ is more than or equal to 1 and less than or equal to N is a training vector set,
Figure BDA0002861546090000132
is a set of gaussian parameters that are,
Figure BDA0002861546090000133
representing a training set modeled by Gaussian using a set of training vectors X
Figure BDA0002861546090000134
M is the number of mixed components, d is the dimension of each feature vector, and N is the number of training vectors; the mixed weight is distributed in the same way
Figure BDA0002861546090000135
The covariance matrix of each Gaussian component is calculated and replaced by the average covariance matrix of all bird gaussians
Figure BDA0002861546090000136
Figure BDA0002861546090000137
Wherein S represents the total of avian speciesClass number, NSThe gaussian component selected for the species of avian species s,
Figure BDA0002861546090000138
is the covariance matrix of the jth gaussian component of the species of the s bird species, the bayesian computation of the mean covariance matrix model is as follows:
Figure BDA0002861546090000139
wherein M Gaussian components and a d-dimensional mean vector of a common diagonal covariance matrix are counted if
Figure BDA00028615460900001310
Will be selected as the best model of the bird; if not, then,
Figure BDA00028615460900001311
will be the model selected, and when the training data is limited, will select based on
Figure BDA00028615460900001312
The model of (1). If there is a large amount of training data,
Figure BDA00028615460900001313
is expected to be selected.
Step 3.2, component number selection
Assigning each training sample to the gaussian component most likely to produce the training sample, grouping the training data into clusters, the number of clusters used to simulate different bird audio must be species-specific, determining the number of clusters depending on the acoustic variation of each bird, the choice of the number of clusters used to simulate each bird audio will affect the classification accuracy, starting with a single gaussian component, then successively decomposing one selected component into two new gaussian components, repeating the selection and splitting process until the most appropriate number of components is found, using a bayesian model for finding the components to be split and determining the appropriate number of components.
Step 4, linear discriminant analysis
Linear discriminant analysis is used to provide greater distinctiveness between various birds, further improving the classification accuracy of the low-dimensional feature space, the linear discriminant analysis attempting to minimize intra-class distances while maximizing inter-class distances, and in the linear discriminant analysis, determining an optimal transformation matrix corresponding to a mapping from d-dimensional feature space to k-dimensional space, where k is<d, maximized Linear mapping JF(A) Comprises the following steps:
JF(A)=tr((ATSWA)-1(ATSBA));
where A is the mapping matrix, SWAnd SBRespectively representing an intra-class scatter matrix and an inter-class scatter matrix, an intra-class scatter matrix SWComprises the following steps:
Figure BDA0002861546090000141
wherein S represents the total number of avian species, CsFeature vector, μ, assigned to species of avian species of species ssIs the average vector of the species of avian species of the s;
inter-class scatter matrix SBComprises the following steps:
Figure BDA0002861546090000142
wherein N issExpressing the number of characteristic vectors in the S-th bird species, wherein mu is the average vector of all training vectors, converting the multivariate normal distribution of the training vector set into the spherical normal distribution, and converting the characteristic vectors and the corresponding characteristic values S thereofWIs calculated. Let phi denote the transformation matrix whose column is SWIs obtained by expressing the diagonal matrix of the respective eigenvalues, a,
Figure BDA0002861546090000143
each training vector x is transformed to obtain x',
Figure BDA0002861546090000144
whitening vector intra-class scattering matrix S'WBecomes an unit matrix, an inter-class scattering matrix of whitening vectors
Figure BDA0002861546090000145
Containing all authentication information, the transformation matrix phi is through finding S'BAssuming that the eigenvalues are arranged in descending order, the eigenvectors corresponding to the largest k ═ S-1 eigenvalues will constitute the columns of the transformation matrix ψ, the optimal transformation matrix aLDAIs defined as:
Figure BDA0002861546090000146
wherein A isLDAFor transforming each principal component analysis transformed d-dimensional feature vector into a low-dimensional vector, xPCAA k-dimensional feature vector representing a d-dimensional principal component analysis transformation vector and a linear discriminant analysis transformation is calculated by
Figure BDA0002861546090000151
Step 5, classification
Classifying each syllable based on nearest neighbor classifier, calculating eigenvector of each input syllable, applying same normalization to each eigenvalue, transforming matrix A by principal component analysisPCAAnd linear discriminant analysis transformation matrix ALDATransforming the normalized feature vector to obtain a final feature vector f,
Figure BDA0002861546090000152
wherein the distance between prototype vectors of each bird is measured by euclidean distance, sc represents a standard vector for classifying bird species, determined by finding the prototype vector with the shortest distance f, and is represented as:
sc=arg min d(f,fs,j),1≤s≤S,1≤j≤Ns
wherein f iss,jJ-th prototype vector, N, representing an avian species of the s speciessIs the number of prototype vectors for the s-th bird species, the class of birds for the audio is determined by sc.
The method provides a new tool for classifying or distinguishing birds through audio, the audio of the birds is different among different species, even if the birds can emit a plurality of different types of audio in the same species, the automatic classification method uses a feature set two-dimensional Mei spectral coefficient and a dynamic two-dimensional Mei spectral coefficient as sounding features, classifies each syllable in continuous bird audio recording, and divides test syllables and training syllables from different recordings. The two-dimensional plum-blossom coefficient and the dynamic two-dimensional plum-blossom coefficient are combined, the classification precision of 28 kinds of birds can reach 84% and 06%, and the birds, namely the gulls of Chinese phoenix heads, can be easily identified.
According to a specific embodiment of the present invention, the specific process of audio separation is as follows:
step 1, Fourier transform
For an arbitrary distribution function x and a test function ψ, the inner product<,>The following requirements are met:<x′,ψ>=-<x,ψ′>then, for bird audio signal s, a distribution is considered which satisfies:<s′,we>=-<s′,w′e>+jω<s,we>(ii) a Wherein the content of the first and second substances,<,>expressing the inner product,' expressing the derivative, w being a finite time window function, and s being the bird audio signal; fourier transform function S with frequency omegaw(ω) is written as: sw(ω)=<s,we>。
Step 2, converting a sine curve function
Figure BDA0002861546090000153
Wherein s (t) represents positiveA chord curve function, t represents time, r (t) represents a non-stationary function, rkRepresents a non-stationary parameter, k represents the order, a positive integer of C, the following equation:
Figure BDA0002861546090000161
wherein the content of the first and second substances,
Figure BDA0002861546090000162
for any finite time window function w, it can be used to define a parameter r relative to non-stationaritykAnd k is greater than 0.
Step 3, parameter estimation
Estimating a non-stationary parameter rkK > 0, a complex stationary parameter r is estimated0According to
Figure BDA0002861546090000163
Where P (r (t)) represents an estimation function for the non-stationary function r (t).
Step 4, estimating the frequency change of bird audio
Using estimated values
Figure BDA0002861546090000164
k > 0 instead of the parameter rkK > 0, to yield
Figure BDA0002861546090000165
Is estimated value of
Figure BDA0002861546090000166
Values S of linear system at different frequenciesw,Sw’,StwThe widest main lobe width, w (t) t, is formed, for a total of 5 segments, and the frequency variation of typical bird audio is estimated from real recordings.
Step 5, separating audio frequency
Lower frequency limit omegaLAnd upper frequency limit ωHPair of frequency and amplitude estimates Sw,Sw’,StwThe spectrum of (a) is divided to obtain separate single bird audio, and then identification is performed, wherein the identification method of the single bird audio comprises model matching and wavelet identification.
The specific process of model matching is as follows:
step 1, collecting data
Acquiring the individual audio records of birds at different places and different periods, submitting the audio records to a voiceprint database, and processing data according to the standards of 44 kHz, 1kHz sampling rate, 1024 frames and 50% of limited time windows to acquire a standard frequency spectrum.
Step 2, matching the model
The task of identifying multiple sound sources in a sound field is accomplished using an established multiple identification paradigm, in order to identify different numbers of bird audio sources, a multiple identification model is introduced, and the current states of multiple sound source observations randomly determine the following states and the time intervals between them are represented as follows:
Figure BDA0002861546090000171
wherein P represents a conditional probability, t represents a specific time, YiRepresents the ith standard spectrum, (X)n,Tn) Representing a sequence of observations, XnDenotes the nth state, TnDenotes the time of the nth species, τn+1Represents Tn+1-TnTime difference, C denotes a positive integer, the observed values represent a single sequence, then τn+1Is known and fixed, but if the observed values may represent multiple sequences and clutter noise, the causal structure is unknown, τn+1Hidden, in this case the structure is estimated by choosing to divide the data into K clusters plus H noise events, thus maximizing the probability, i.e.
Figure BDA0002861546090000172
Wherein L represents an estimation structure, pMRP(k) Representing the probability of observing a subsequence in the kth group generated by a single MRP, pNOISE(η) represents the probability of the η th noisy data.
Step 3, audio recognition
And detecting single syllables of the bird audio by using a cross-correlation template matching paradigm, detecting the syllables from a standard frequency spectrum, and solving a maximum likelihood solution so as to realize the identification of the bird audio. The above technique uses a series of spectral bins from an improved basic spectral representation to infer detailed information about the modulated sinusoid, which is particularly useful in bird audio, enabling fast audio recognition.
The wavelet transformation process comprises preprocessing, feature extraction and identification, and specifically comprises the following steps:
step 1, pretreatment
The method comprises the following steps of (1) carrying out appropriate segmentation on a segment of syllables through preprocessing so as to extract features, specifically:
step 1.1, syllable endpoint detection, as follows:
step 1.1.1, calculate a short-time fourier transform X [ m, k ] of X [ N ] with a frame size N of 512,
Figure BDA0002861546090000173
where m is the frame number, the Hamming window w for short-time analysism[n]Has the following forms;
Figure BDA0002861546090000174
step 1.1.2, forming a spectrogram of the signal by aligning the frequency spectrums of all frames, wherein X [ M, k ], M is 1, 2.
Step 1.1.3, for each frame m, find the frequency bin with the largest amplitudem
Figure BDA0002861546090000181
Step 1.1.4 initializes syllable j, which is 1.
Step 1.1.5, calculating the frame t with the maximum amplitude,
Figure BDA0002861546090000182
wherein the amplitude of syllable j is AjThen, then
Aj=20log10|X[m,binm]|(dB)。
Step 1.1.6, starting from the t-th frame, moving backwards and forwards to the h-th framejFrame and tjFrame, if both amplitudes are
Figure BDA0002861546090000186
And
Figure BDA0002861546090000185
are all less than (A)j-20), then hjFrame and tjThe frames are called the head and end frames of syllable j.
Step 1.1.7, set | X [ m, binm]|=0,m=hj,hj+1,...,tj-1,tj
Step 1.1.8, j ═ j + 1.
Step 1.1.9, return to step 1.1.6, until Aj<Aj-1-20, by the above steps, obtaining the boundary of each syllable.
Step 1.2, normalization and Pre-enhancement
The difference of speech amplitude due to the diversity of the recording environment is adjusted by a normalization process, the amplitude is linearly normalized to the area of [ -1, 1], since the amplitude of high frequency signals is usually much smaller than that of low frequency signals, a pre-phasing technique is used to enhance the high frequency signals, the enhancement is realized by a Finite Impulse Response (FIR) filter h (z), which is in the form:
H(z)=a·z-1
h (z) filtering the signal x (n)
Figure BDA0002861546090000183
Has the following properties
Figure BDA0002861546090000184
Where a is the similarity, between 0, 9 and 1, which is set to 0.95 by the present invention.
Step 1.3, segmentation
The segmentation is carried out by taking a segment of syllables as a center, but not the segmentation of single syllables, and since the syllables of the bird audio are usually repeated, the characteristic vector of the segment of syllables is very practical for bird audio identification. After end-point detection, normalization and pre-emphasis, the segmentation process is completed by detecting repetitions of syllables.
Step 1.3.1, set i ═ 1 to the index of the first syllable of the segment.
Step 1.3.2, find out that the similarity a between syllables i and j is less than simijIs good, where j is the last syllable of the segment.
Step 1.3.3, set segment length l ═ j.
Step 1.3.4, set k ═ j + 1.
Step 1.3.5, set i equal to 1 and l equal to j.
Step 1.3.6, calculate similarity sim between syllable k and syllable iki
Step 1.3.7, if simki> a (same type) and l ═ k-j, segmentation is stopped, from syllable 1 to syllable l.
If j, j +1 goes to step 1.3.5; otherwise, i +1 and k +1 are set, and then step 1.3.6 is performed.
Step 1.3.8, if i is equal to i +1, j is equal to j +1, and go to step 1.3.5.
Step 1.3.9, setting k1, 1, l1, and then turning to step 1.3.6; the similarity between two syllables is determined by calculating the difference between the amplitudes of the corresponding frequency bins, a is set such that l satisfies a value of 2 < l < 8 since the syllable type of bird audio is typically within 6, and after segmentation, the segmented syllables are aligned for feature extraction.
Step 2, feature extraction
After syllable segmentation, calculating the feature vector of bird audio to align syllables, and acquiring the wavelet cepstrum transformation of the feature vector specifically as follows:
step 2.1, calculating the cepstrum coefficient of each frame, wherein the step of calculating the cepstrum coefficient of each frame is as follows:
step 2.1.1, calculating the fast Fourier transform of each frame signal,
Figure BDA0002861546090000191
step 2.1.2, calculate the energy of each triangular filter band,
Figure BDA0002861546090000192
in the formula, phij[k]Denotes the amplitude of the jth triangular filter at frequency k, EjRepresenting the energy of the jth filter band, J being the number of triangular filters.
Step 2.1.3, calculating cepstrum coefficients by using cosine transform,
Figure BDA0002861546090000201
wherein, ci(m) represents an m-th order cepstrum coefficient of the ith frame.
Step 2.2, after obtaining the cepstrum coefficient of each frame of the aligned bird audio signal by using the feature vector formed by wavelet cepstrum transformation, obtaining the feature vector of the bird audio by calculating the wavelet cepstrum transformation, as follows:
step 2.2.1, cepstrum coefficients of all frames of the alignment signal are collected,
{c1(0),c1(1),...,c1(L-1),...,ci(0),...,ci(L-1),...},
wherein L is the total order of the cepstral coefficients;
step 2.2.2, aligning the cepstrum coefficients in the same order,
sm[n]=[c1(m),c2(m),...,ci(m),...],m=0,...,L-1,
step 2.2.3, calculate sm[n]The three-level wavelet transform of (a),
Figure BDA0002861546090000202
wherein, δ [ n ]]And d [ n ]]Denotes sm[n]Low and high frequency components of h0[k]And h1[k]Are low-pass and high-pass filters applied in the transform, as:
h0[k]=[0.3327,0.8069,0.4599,0.1350,-0.0854,0.0352];
h1[k]=[0.0352,0.0854,-0.1350,-0.4599,0.8069,-0.3327];
wherein s ism[n]Is subjected to wavelet cepstrum transformation and is expressed as
Figure BDA0002861546090000203
Figure BDA0002861546090000204
Step 2.2.4, calculate the average of each of the six sequences, expressed as
Figure BDA0002861546090000205
Figure BDA0002861546090000206
Step 2.2.5, forming a feature vector by utilizing six average values of all the first five-order cepstrum coefficient sequences
Figure BDA0002861546090000207
Step 3, identifying by using BP neural network
Most of the current bird call sound recognition technology uses the characteristics of single syllables to form a characteristic vector of each bird, and the invention does not use the single syllables, but uses bird audio frequency segments containing one syllable cycle to extract the characteristic vector. The experimental result shows that compared with the traditional method, the method has the advantages that after the range of each syllable is detected, the bird audio clip containing one syllable cycle is segmented, and the identification rate of the Chinese gull is obviously improved.
The above-described embodiment merely represents one embodiment of the present invention, but is not to be construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.

Claims (9)

1. A Chinese phoenix-gull recognition system with model matching comprises a service layer, a data layer and a display layer, wherein the service layer comprises a user system, a service system and an algorithm system, and the user system is mainly used for managing platform user operation behaviors and information management; the business system is used for managing the business among the whole platform modules and comprises audio address management, bird voiceprint acquisition and bird audio identification information; the algorithm system identifies and detects the type of birds in the audio through artificial intelligent voiceprints and provides reasonable detection feedback information;
the data layer is used for storing data and is divided into a data center, a system database and a voiceprint database, and the data center is used for storing various service data including bird identification types, quantity, date and position; the system database stores service relation data among system modules, including voiceprint and audio storage address; the voiceprint database stores all bird audio data;
the display layer outputs the interactive returned result among the functional modules through a WEB end, and an open API interface calling method developer can call according to the provided calling rule through a related open interface address; the business system acquires various bird audios on the island through audio extraction equipment and provides a frequency spectrum-time space interface for recording, positioning, analysis, annotation and classification, the algorithm system is used as a background system to perform voiceprint recognition through artificial intelligence, and the artificial intelligence voiceprint recognition comprises noise removal, multi-audio separation, automatic classification and single recognition;
the single identification method is a model matching identification method and comprises the following steps:
step 1, collecting data;
step 2, matching the model;
and 3, identifying the audio.
2. The model matching method of the identification system of gull of Chinese phoenix with model matching according to claim 1, wherein the step 1 is specifically as follows: data are collected, separate audio records of birds at different places and different periods are obtained and submitted to a voiceprint database, data processing is carried out according to the standard of 44.1kHz sampling rate, 1024 frames and 50% of limited time window, and a standard frequency spectrum is obtained.
3. The model matching method according to claim 2, wherein the step 2 is specifically: matching models, identifying multiple sound sources in the sound field by using the established multiple recognition paradigm, and identifying different numbers of bird audio sources by introducing a multiple recognition model, the current states of multiple sound source observations randomly determining the following states and the time intervals between them as follows: a
Figure FDA0002861546080000011
Wherein P represents a conditional probability, t represents a specific time, YiRepresents the ith standard spectrum, (X)n,Tn) Representing a sequence of observations, XnDenotes the nth state, TnDenotes the time of the nth kind, Tn+1Represents Tn+1-TnThe time difference, C, represents a positive integer,
the observed values represent a single sequence, then Tn+1Is known and fixed, but if the observed values may represent multiple sequences and clutter noise, the causal structure is unknown, Tn+1Hidden, in this case the structure is estimated by choosing to divide the data into K clusters plus H noise events, thus maximizing the probability that:
Figure FDA0002861546080000021
wherein L represents an estimation structure, pMRP(k) Denotes the probability of observing a subsequence in group k, pNOISE(η) represents the probability of the η th noisy data.
4. The model matching method according to claim 2, wherein the step 3 specifically comprises: and audio identification, namely detecting single syllables of the bird audio by using a cross-correlation template matching paradigm, detecting the syllables from a standard frequency spectrum, and solving a maximum likelihood solution so as to realize identification of the bird audio.
5. The model matching method according to claim 2, characterized in that: the business system collects the voiceprints of the birds and uses audio extraction equipment to extract sound sources and directions; the audio extraction equipment comprises a microphone array and data processing equipment, wherein a frequency spectrum-time space interface is used on the data processing equipment to edit collected audio files, observe the distribution of sound sources on a two-dimensional characteristic space, know the sound types in the sound recording, acquire the components of a sound landscape, and classify the sound landscape by grouping similar sounds on the space; the sound source is recorded, positioned, analyzed, annotated and classified on a visual frequency spectrum-time space interface, and the operation setting is changed and/or the corresponding function is executed on the right side of the window by selecting the file or folder to be operated on the left side of the window.
6. The model matching method according to claim 5, characterized in that: in the recording part, recording is carried out in a 16kHz and 16 bit format by using a microphone array, and the recording is played back or divided; and support a plurality of microphone arrays connected to the data processing device to record simultaneously, support two-dimensional positioning to perform synchronous recording, divide a recording file into a plurality of recording files by setting the number of partitions of the file or the recording time of each file, so that parameter settings suitable for localization are found before analyzing long-term recording.
7. The model matching method according to claim 5, characterized in that: in the localization part, sound source localization is performed by using a plurality of spectrograms with short-time Fourier transform based on a multi-signal classification method, and separated sounds are extracted as a waveform file of each localized sound, basic parameter values related to bird audio localization and separation are set in a list on the right side of a window, and additional parameters are added to the list by defining parameter names and corresponding marks in a network file.
8. The model matching method according to claim 6, characterized in that: analyzing, in an analyzing section, a time distribution and a directivity of the sound; outputting the spectrogram and the positioning result in a PDF format of the specified total number of pages through an export file button, summarizing the result in a corresponding time scale, outputting the result in a data exchange format file form, loading the result to an annotation tool and storing the result in a voiceprint database.
9. The model matching method according to claim 6, characterized in that: in the annotation part, a recorded spectrogram is displayed on a panel at the top of an annotation window, a time scale and a focus time period are displayed on an x axis, the focus time period and the display time scale can be adjusted, corresponding audio and sound sources are displayed on a y axis, each box of the x axis in the annotation window represents the start and end time, and the y axis represents the direction of the corresponding source start time; the color of each frame represents the class, sound localization information is displayed by clicking each frame of localized sound, and manual information editing and corresponding file of separated sound or duration in the original recording can be carried out; and modifying the position of each source by dragging the corresponding frame, and storing the modified data in a voice print database in a data exchange format file form.
CN202011567949.1A 2020-12-25 2020-12-25 Chinese phoenix head and gull recognition system with model matching and model matching method thereof Active CN112735444B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011567949.1A CN112735444B (en) 2020-12-25 2020-12-25 Chinese phoenix head and gull recognition system with model matching and model matching method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011567949.1A CN112735444B (en) 2020-12-25 2020-12-25 Chinese phoenix head and gull recognition system with model matching and model matching method thereof

Publications (2)

Publication Number Publication Date
CN112735444A true CN112735444A (en) 2021-04-30
CN112735444B CN112735444B (en) 2024-01-09

Family

ID=75616699

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011567949.1A Active CN112735444B (en) 2020-12-25 2020-12-25 Chinese phoenix head and gull recognition system with model matching and model matching method thereof

Country Status (1)

Country Link
CN (1) CN112735444B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116128330A (en) * 2022-11-18 2023-05-16 中国人民解放军陆军装甲兵学院 Air-ground unmanned system combat effectiveness evaluation method based on machine learning

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6535131B1 (en) * 1998-08-26 2003-03-18 Avshalom Bar-Shalom Device and method for automatic identification of sound patterns made by animals
US20050049877A1 (en) * 2003-08-28 2005-03-03 Wildlife Acoustics, Inc. Method and apparatus for automatically identifying animal species from their vocalizations
US20070033031A1 (en) * 1999-08-30 2007-02-08 Pierre Zakarauskas Acoustic signal classification system
US20110082574A1 (en) * 2009-10-07 2011-04-07 Sony Corporation Animal-machine audio interaction system
CN104700829A (en) * 2015-03-30 2015-06-10 中南民族大学 System and method for recognizing voice emotion of animal
CN104882144A (en) * 2015-05-06 2015-09-02 福州大学 Animal voice identification method based on double sound spectrogram characteristics
CN108898164A (en) * 2018-06-11 2018-11-27 南京理工大学 A kind of chirping of birds automatic identifying method based on Fusion Features
CN110246504A (en) * 2019-05-20 2019-09-17 平安科技(深圳)有限公司 Birds sound identification method, device, computer equipment and storage medium
CN111028845A (en) * 2019-12-06 2020-04-17 广州国音智能科技有限公司 Multi-audio recognition method, device, equipment and readable storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6535131B1 (en) * 1998-08-26 2003-03-18 Avshalom Bar-Shalom Device and method for automatic identification of sound patterns made by animals
US20070033031A1 (en) * 1999-08-30 2007-02-08 Pierre Zakarauskas Acoustic signal classification system
US20050049877A1 (en) * 2003-08-28 2005-03-03 Wildlife Acoustics, Inc. Method and apparatus for automatically identifying animal species from their vocalizations
US20110082574A1 (en) * 2009-10-07 2011-04-07 Sony Corporation Animal-machine audio interaction system
CN104700829A (en) * 2015-03-30 2015-06-10 中南民族大学 System and method for recognizing voice emotion of animal
CN104882144A (en) * 2015-05-06 2015-09-02 福州大学 Animal voice identification method based on double sound spectrogram characteristics
CN108898164A (en) * 2018-06-11 2018-11-27 南京理工大学 A kind of chirping of birds automatic identifying method based on Fusion Features
CN110246504A (en) * 2019-05-20 2019-09-17 平安科技(深圳)有限公司 Birds sound identification method, device, computer equipment and storage medium
CN111028845A (en) * 2019-12-06 2020-04-17 广州国音智能科技有限公司 Multi-audio recognition method, device, equipment and readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116128330A (en) * 2022-11-18 2023-05-16 中国人民解放军陆军装甲兵学院 Air-ground unmanned system combat effectiveness evaluation method based on machine learning
CN116128330B (en) * 2022-11-18 2024-04-26 中国人民解放军陆军装甲兵学院 Air-ground unmanned system combat effectiveness evaluation method based on machine learning

Also Published As

Publication number Publication date
CN112735444B (en) 2024-01-09

Similar Documents

Publication Publication Date Title
CN112289326B (en) Noise removal method using bird identification integrated management system with noise removal function
CN112750442B (en) Crested mill population ecological system monitoring system with wavelet transformation and method thereof
CN105023573B (en) It is detected using speech syllable/vowel/phone boundary of auditory attention clue
Tzanetakis et al. Marsyas: A framework for audio analysis
Barchiesi et al. Acoustic scene classification: Classifying environments from the sounds they produce
Mesgarani et al. Discrimination of speech from nonspeech based on multiscale spectro-temporal modulations
US8676574B2 (en) Method for tone/intonation recognition using auditory attention cues
Dennis Sound event recognition in unstructured environments using spectrogram image processing
US9558762B1 (en) System and method for distinguishing source from unconstrained acoustic signals emitted thereby in context agnostic manner
Xie et al. A review of automatic recognition technology for bird vocalizations in the deep learning era
Wang et al. Playing technique recognition by joint time–frequency scattering
Fagerlund et al. New parametric representations of bird sounds for automatic classification
Ranjard et al. Integration over song classification replicates: Song variant analysis in the hihi
Praksah et al. Analysis of emotion recognition system through speech signal using KNN, GMM & SVM classifier
CN112735444B (en) Chinese phoenix head and gull recognition system with model matching and model matching method thereof
Xiao et al. AMResNet: An automatic recognition model of bird sounds in real environment
CN112687280B (en) Biodiversity monitoring system with frequency spectrum-time space interface
CN112735442B (en) Wetland ecology monitoring system with audio separation voiceprint recognition function and audio separation method thereof
Garg et al. RETRACTED: Urban Sound Classification Using Convolutional Neural Network Model
Ruiz-Muñoz et al. Enhancing the dissimilarity-based classification of birdsong recordings
Adiban et al. Statistical feature embedding for heart sound classification
CN112735443B (en) Ocean space resource management system with automatic classification function and automatic classification method thereof
Mohammed Overlapped speech and music segmentation using singular spectrum analysis and random forests
Marck et al. Identification, analysis and characterization of base units of bird vocal communication: The white spectacled bulbul (Pycnonotus xanthopygos) as a case study
CN112735443A (en) Ocean space resource management system with automatic classification function and automatic classification method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant