CN114863939B - Panda attribute identification method and system based on sound - Google Patents

Panda attribute identification method and system based on sound Download PDF

Info

Publication number
CN114863939B
CN114863939B CN202210791585.8A CN202210791585A CN114863939B CN 114863939 B CN114863939 B CN 114863939B CN 202210791585 A CN202210791585 A CN 202210791585A CN 114863939 B CN114863939 B CN 114863939B
Authority
CN
China
Prior art keywords
data
sound data
panda
frequency
mel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210791585.8A
Other languages
Chinese (zh)
Other versions
CN114863939A (en
Inventor
赵启军
张艳秋
陈鹏
侯蓉
刘鹏
唐金龙
何梦楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CHENGDU RESEARCH BASE OF GIANT PANDA BREEDING
Sichuan University
Original Assignee
CHENGDU RESEARCH BASE OF GIANT PANDA BREEDING
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CHENGDU RESEARCH BASE OF GIANT PANDA BREEDING, Sichuan University filed Critical CHENGDU RESEARCH BASE OF GIANT PANDA BREEDING
Priority to CN202210791585.8A priority Critical patent/CN114863939B/en
Publication of CN114863939A publication Critical patent/CN114863939A/en
Application granted granted Critical
Publication of CN114863939B publication Critical patent/CN114863939B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/26Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Abstract

The invention discloses a panda attribute identification method and system based on sound, which comprises the following steps: collecting the daily sounds of the pandas; preprocessing the collected data; extracting an acoustic feature Mel cepstrum coefficient, turning over the Mel cepstrum coefficient and a perception linear prediction coefficient, and combining the Mel cepstrum coefficient and the perception linear prediction coefficient to obtain a fusion feature; performing cascade-based data enhancement operation on the fusion characteristics, and constraining the distribution consistency before and after enhancement; and training the enhanced fusion features by using a convolutional neural network based on an attention mechanism, and predicting a final result. The fusion characteristics provided by the invention can more effectively extract the high-frequency and low-frequency information in the panda sound, reduce noise and improve the identification result. Meanwhile, the cascade-based data enhancement mode provided by the invention can effectively restrict the consistency of the features before and after enhancement. The invention respectively identifies the age group and the gender by utilizing the panda voice, and can obtain better results.

Description

Panda attribute identification method and system based on sound
Technical Field
The invention relates to a panda voice data processing technology, in particular to a panda attribute identification method and system based on voice.
Background
Pandas (Ailuropoda melanoleuuca) are a unique and fragile animal in china, and researchers have invested considerable effort in investigating panda population size in order to develop an appropriate protection program. Of the numerous attributes of pandas, age structure and gender ratio are of primary importance. Age structure and sex ratio are two important factors affecting the dynamic change of panda populations. The study on the age structure of panda population can help practitioners predict panda population dynamics. Changes in panda sex ratios often result in changes in the intraspecific relationships and mating behavior, which in turn affects the population growth rate. Due to the low population density and wide field distribution range of pandas, researchers are difficult to determine the age structure and sex ratio of panda populations.
Although conventional methods such as DNA techniques and bite node methods can be used to determine the age group and gender attributes of pandas, the former have high time requirements for the extracted panda secretions: stool over three days has difficulty extracting useful DNA information. The latter method is only suitable for adult pandas. Both methods require researchers to go deep into the field, and search for panda feces continuously, which is time-consuming, labor-consuming and dangerous.
Although researchers have begun to apply speech-based recognition algorithms to animal vocals, there has been no method for voice-based panda age group and gender recognition for a while. Meanwhile, how to extract the acoustic features of the pandas can better obtain high-frequency and low-frequency information contained in the sounds of the pandas, the interference of other noises is avoided, the sounding period of the pandas is short, and how to utilize limited data to perform better recognition is also a great challenge.
Disclosure of Invention
In order to overcome at least the above disadvantages in the prior art, the present application aims to provide a panda attribute identification method and system based on voice.
In a first aspect, an embodiment of the present application provides a panda attribute identification method based on sound, including:
acquiring sound data of a panda as first sound data;
preprocessing the first sound data, and labeling age information and gender information on the preprocessed first sound data to generate second sound data;
acquiring a Mel cepstrum coefficient and a perceptual linear prediction coefficient from the second sound data, and combining the perceptual linear prediction coefficient, the Mel cepstrum coefficient and the inverted Mel cepstrum coefficient to form a fusion feature;
performing data enhancement on the fusion features by taking the consistent distribution as a constraint condition to form first sample data;
carrying out convolutional neural network training on the first sample data to form a panda attribute identification model;
and inputting target voice data into a recognition model according to the attributes of the pandas, and generating age information and gender information corresponding to the target voice data.
In the prior art, the sound producing period of the pandas is short, so that fewer samples which can be used for carrying out recognition model training are produced, a great deal of difficulty is brought to the relevant technology of panda sound recognition, and when the inventor finds that the sample amount is insufficient, the recognition accuracy of the trained model is insufficient, so that the acquisition of sufficient characteristic samples is a very important process for carrying out sound recognition research on the pandas.
In the embodiment of the application, the sound data of the pandas can be acquired through the sound acquisition equipment and serve as the first sound data, and the first sound data contains a large amount of noise data, so that preprocessing is needed and corresponding labeling is carried out, and the purpose of labeling is to be used for subsequent model training. In order to increase the utilization rate of second sound data and sample information, the high-frequency and low-frequency information of panda sound can be fully extracted through the fusion feature formed by combining the reversed mel cepstrum coefficient and the perception linear prediction coefficient. In the embodiment of the application, the mel cepstrum coefficient is inverted, so that the inverted mel cepstrum coefficient can focus on high-frequency information, the fusion characteristics after fusion can contain low-frequency information and high-frequency information, and noise can be obviously reduced by using a perceptual linear prediction coefficient, so that the information content of a sample can be increased, and the noise can also be reduced. In the embodiment of the application, in order to further improve the diversity of the sample, data enhancement with consistent distribution as a constraint condition is adopted, and the consistent distribution as the constraint condition can enable the sample data after the data enhancement to effectively improve the recognition accuracy of the trained model. In the embodiment of the application, the first sample data is data after data enhancement, and the diversity of the characteristics is improved, so that the recognition model precision at the training position is higher. The convolutional neural network training can be performed in the prior art, and the embodiments of the present application are not repeated here.
In one possible implementation manner, performing data enhancement on the fusion feature with the distribution consistency as a constraint condition to form sample data includes:
sequentially enhancing frequency mask data and enhancing time mask data on the fusion features to form second sample data;
acquiring weighting parameters corresponding to the fusion characteristics and the second sample data according to beta distribution;
and performing weighted fusion on the fusion characteristics and the second sample data according to the weighted parameters to form the first sample data.
In a possible implementation manner, sequentially performing frequency mask data enhancement and time mask data enhancement on the fusion feature to form second sample data includes:
adding a preset mask to a frequency axis of the Mel spectrum within the Mel spectrum range to complete a frequency mask;
adding the preset mask to a time axis of the Mel spectrum within the Mel spectrum range to complete a frequency mask; the number and width of the preset masks are determined according to the Mel frequency spectrum.
In one possible implementation, preprocessing the first sound data includes:
removing collision noise from the first sound data to generate pure sound data;
extracting background sound data from the adult panda sound data, and integrating the background sound data into the juvenile panda sound data;
and taking the adult panda voice data and the juvenile panda voice data integrated with the background voice data as first voice data after preprocessing.
In one possible implementation, the obtaining mel-frequency cepstral coefficients and perceptual linear prediction coefficients from the second sound data comprises:
calculating power spectrum estimation data of the second sound data, and integrating the overlapped critical band filter responses into the power spectrum estimation data to form power spectrum integration data;
convolving the power spectrum integration data on a symmetrical frequency domain on the frequency to allow the low frequency to cover the high frequency and smooth the frequency spectrum, then pre-emphasizing the frequency spectrum, and compressing the frequency spectrum amplitude to form pre-processing power spectrum data;
and performing inverse discrete Fourier transform on the preprocessed power spectrum data to obtain an autocorrelation coefficient, performing spectrum smoothing, solving an autoregressive equation, and converting the autoregressive coefficient into a cepstrum variable to obtain the perceptual linear prediction coefficient.
In a second aspect, an embodiment of the present application provides a system for identifying panda attributes based on voice, including:
an acquisition unit configured to acquire sound data of a panda as first sound data;
the preprocessing unit is configured to preprocess the first sound data, label age information and gender information on the preprocessed first sound data and generate second sound data;
the fusion unit is used for acquiring a Mel cepstrum coefficient and a perception linear prediction coefficient from the second sound data and combining the perception linear prediction coefficient, the Mel cepstrum coefficient and the inverted Mel cepstrum coefficient to form a fusion characteristic;
the enhancement unit is configured to perform data enhancement on the fusion feature by taking the distribution consistency as a constraint condition to form first sample data;
the training unit is configured to conduct convolutional neural network training on the first sample data to form a panda attribute identification model;
and the identification unit is configured to input target sound data into the identification model according to the attributes of the pandas and generate age information and gender information corresponding to the target sound data.
In one possible implementation, the enhancing unit is further configured to:
sequentially enhancing frequency mask data and enhancing time mask data on the fusion features to form second sample data;
acquiring a weighting parameter corresponding to the fusion feature and the second sample data according to beta distribution;
and performing weighted fusion on the fusion characteristics and the second sample data according to the weighting parameters to form the first sample data.
In one possible implementation, the enhancing unit is further configured to:
adding a preset mask to a frequency axis of the Mel spectrum within the Mel spectrum range to complete a frequency mask;
adding the preset mask to a time axis of the Mel spectrum within the Mel spectrum range to complete a frequency mask; the number and width of the preset masks are determined according to the Mel frequency spectrum.
In one possible implementation, the preprocessing unit is further configured to:
removing collision noise from the first sound data to generate pure sound data;
extracting background sound data from the adult panda sound data, and integrating the background sound data into the juvenile panda sound data;
and taking the adult panda voice data and the juvenile panda voice data integrated with the background voice data as first voice data after preprocessing.
In one possible implementation, the preprocessing unit is further configured to:
calculating power spectrum estimation data of the second sound data, and integrating the overlapped critical band filter responses into the power spectrum estimation data to form power spectrum integration data;
convolving the power spectrum integration data on a symmetrical frequency domain on the frequency to allow the low frequency to cover the high frequency and smooth the frequency spectrum, then pre-emphasizing the frequency spectrum, and compressing the frequency spectrum amplitude to form pre-processing power spectrum data;
and performing inverse discrete Fourier transform on the preprocessed power spectrum data to obtain an autocorrelation coefficient, performing spectrum smoothing, solving an autoregressive equation, and converting the autoregressive coefficient into a cepstrum variable to obtain the perceptual linear prediction coefficient.
Compared with the prior art, the invention has the following advantages and beneficial effects:
the panda attribute identification method and system based on sound make up the blank of the current panda audio identification field, solve the problem that how to extract panda acoustic features can only make full use of panda high and low frequency information and reduce noise, simultaneously solve the problem that panda sound data is less and not beneficial to deep learning training, and finally obtain higher attribute identification accuracy.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:
fig. 1 is a schematic flow chart of a voice-based panda attribute identification method according to an embodiment of the present application;
fig. 2 is a schematic diagram of a voice-based panda attribute recognition system according to an embodiment of the present application;
fig. 3 is a schematic diagram of performing data enhancement on the fusion feature by using the distribution consistency as a constraint condition to form first sample data according to the embodiment of the present application.
Detailed Description
In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it should be understood that the drawings in the present application are for illustrative and descriptive purposes only and are not used to limit the scope of protection of the present application. Additionally, it should be understood that the schematic drawings are not necessarily drawn to scale. The flowcharts used in this application illustrate operations implemented according to some of the embodiments of the present application. It should be understood that the operations of the flow diagrams may be performed out of order, and steps without logical context may be performed in reverse order or simultaneously. One skilled in the art, under the guidance of this application, may add one or more other operations to, or remove one or more operations from, the flowchart.
In addition, the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, as generally described and illustrated in the figures herein, could be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.
To facilitate explanation of the panda attribute identification method based on sound in the embodiment of the present application, please refer to fig. 1, which is a schematic flow chart of the panda attribute identification method based on sound according to the embodiment of the present invention, the panda attribute identification method based on sound may be applied to the panda attribute identification system based on sound in fig. 2, and further, the panda attribute identification method based on sound may specifically include the contents described in the following steps S1 to S6.
S1: acquiring sound data of a panda as first sound data;
s2: preprocessing the first sound data, and labeling age information and gender information on the preprocessed first sound data to generate second sound data;
s3: acquiring a Mel cepstrum coefficient and a perceptual linear prediction coefficient from the second sound data, and combining the perceptual linear prediction coefficient, the Mel cepstrum coefficient and the inverted Mel cepstrum coefficient to form a fusion feature;
s4: performing data enhancement on the fusion features by taking the consistent distribution as a constraint condition to form first sample data;
s5: performing convolutional neural network training on the first sample data to form a panda attribute identification model;
s6: and inputting the target sound data into a recognition model according to the attributes of the pandas, and generating age information and gender information corresponding to the target sound data.
In the prior art, the sound producing period of the pandas is short, so that fewer samples which can be used for carrying out recognition model training are produced, a great deal of difficulty is brought to the relevant technology of panda sound recognition, and when the inventor finds that the sample amount is insufficient, the recognition accuracy of the trained model is insufficient, so that the acquisition of sufficient characteristic samples is a very important process for carrying out sound recognition research on the pandas.
In the embodiment of the application, the sound data of the pandas can be acquired through the sound acquisition device to serve as the first sound data, and the first sound data contains a large amount of noise data, so that preprocessing is required and corresponding labeling is performed, and the purpose of labeling is to train and use a subsequent model. In order to increase the utilization rate of second sound data and sample information, the high-frequency and low-frequency information of panda sound can be fully extracted through the fusion feature formed by combining the reversed mel cepstrum coefficient and the perception linear prediction coefficient. In the embodiment of the application, the mel cepstrum coefficient is inverted, so that the inverted mel cepstrum coefficient can focus on high-frequency information, the fusion characteristics after fusion can contain low-frequency information and high-frequency information, and noise can be obviously reduced by using a perceptual linear prediction coefficient, so that the information content of a sample can be increased, and the noise can also be reduced. In the embodiment of the application, in order to further improve the diversity of the sample, data enhancement with consistent distribution as a constraint condition is adopted, and the consistent distribution as the constraint condition can enable the sample data after the data enhancement to effectively improve the recognition accuracy of the trained model. In the embodiment of the application, the first sample data is data after data enhancement, and the diversity of the characteristics is improved, so that the recognition model precision at the training position is higher. The convolutional neural network training can be performed in the prior art, and the embodiments of the present application are not repeated here. In the embodiment of the application, the mel-frequency cepstrum coefficient and the inverted mel-frequency cepstrum coefficient are in mirror symmetry in a frequency domain range.
For example, the obtaining of the flipped mel-frequency cepstral coefficient includes:
dividing the second sound data into a plurality of frames; in order to avoid overlarge change of two adjacent frames, an overlapping area exists between the two adjacent frames;
multiplying each frame by a Hamming window to increase the continuity of the left and right ends of the frame;
converting continuous frame signals from a time domain to a frequency domain through fast Fourier transform, and calculating logarithmic energy output by a filter by passing transformed frequency spectrum data through a reversed Mel filter;
the inverse mel-frequency cepstral coefficients are obtained by Discrete Cosine Transform (DCT).
For example, the mel filter is denser in the low frequency part, which means that the low frequency information can be better extracted. Actual frequency
Figure 701000DEST_PATH_IMAGE001
And the Mel frequency
Figure 227928DEST_PATH_IMAGE002
The conversion relationship between them can be expressed by the following formula:
Figure 273244DEST_PATH_IMAGE003
the mathematical formula for the mel filter is defined as:
Figure 245617DEST_PATH_IMAGE004
Figure 591148DEST_PATH_IMAGE005
Figure 679190DEST_PATH_IMAGE006
wherein the content of the first and second substances,
Figure 24851DEST_PATH_IMAGE007
the number of the Mel filters is the same as the number of the Mel filters,
Figure 489331DEST_PATH_IMAGE008
is the center frequency of the frequency band, and is,
Figure 689368DEST_PATH_IMAGE009
in order to be able to sample the frequency,
Figure 985527DEST_PATH_IMAGE010
is composed of
Figure 271015DEST_PATH_IMAGE011
The inverse function of (a) is,
Figure 86655DEST_PATH_IMAGE012
being the lowest frequency in the filter frequency range,
Figure 203516DEST_PATH_IMAGE013
the highest frequency. In the present embodiment, the apparatus is provided
Figure 944945DEST_PATH_IMAGE014
Figure 717729DEST_PATH_IMAGE015
Figure 524011DEST_PATH_IMAGE016
NIs a time duration.
Same, actual frequency
Figure 183793DEST_PATH_IMAGE001
And a turnover Mel frequency
Figure 846856DEST_PATH_IMAGE017
Can be described by the following mathematical formula:
Figure 841357DEST_PATH_IMAGE018
the extraction of the inverse mel-frequency cepstrum coefficient is the same as the extraction of the mel-frequency cepstrum coefficient except that the inverse mel-frequency filter is used.
By way of example, the model is trained using an attention-based convolutional neural network, wherein the structure of the attention-based convolutional neural network comprises: convolutional layer, attention mechanism module, average pooling layer, ReLU activation function, Dense layer, softmax classifier.
The attention mechanism module comprises a 3 x 3 convolution kernel, a residual layer with jump connection of 1, a 2 x 2 global average pooling layer, a Dense layer and a sigmoid activation function, and is used for emphatically learning the most important information in the characteristic channel.
The residual error layer is used for increasing the network depth and has better fitting capability; the global average pooling layer enables a reduction in network feature dimensions; the Dense layer is used for learning important information of the channel.
The sigmoid activation function is used for increasing the nonlinear change of the network and selecting network characteristics, and the expression is as follows:
Figure 500264DEST_PATH_IMAGE019
wherein the content of the first and second substances,
Figure 60559DEST_PATH_IMAGE020
which is representative of the characteristics of the input,
Figure 848517DEST_PATH_IMAGE021
represents the logarithm of a natural number and the like,
Figure 127052DEST_PATH_IMAGE022
representing the output characteristics after the nonlinear transformation.
Wherein the softmax classifier predicts per-frame outcome probabilities for each panda audio.
In one possible implementation manner, labeling the age information and the gender information on the preprocessed first sound data to generate second sound data includes:
pre-analyzing the first sound data according to the age information and the gender information to generate a plurality of groups of pre-analysis features; the pre-analysis characteristics are voice data characteristics with the difference between different age information and/or gender information meeting expectations;
constructing corresponding kernel functions and clustering radii according to each group of pre-analysis features to form a plurality of groups of clustering parameters;
performing clustering analysis on the first sound data by using each group of clustering parameters to form a plurality of groups of clustering results;
comparing the clustering results with the classification conditions of the age information and the gender information, and taking at least one group of clustering results closest to the classification conditions as selected clustering results;
and labeling the first sound data according to the pre-analysis features corresponding to the selected clustering results.
In the embodiment of the application, the inventor finds that in practice, because the panda sound samples are fewer, the labeling of each sample can directly influence the generation precision of the subsequent model, and in order to label the features more accurately, the limited first sound data is pre-analyzed, and the sound data features with larger difference between different age information and/or gender information are found out to serve as the pre-analysis features. Establishing corresponding clustering parameters according to all pre-analysis features, and completing clustering analysis, wherein the clustering analysis is a classification algorithm under an unsupervised mechanism, so that the first sound data can be effectively classified into a plurality of categories only according to the characteristics of the pre-analysis features; since the first sound data can be divided into different ages and/or genders according to the sound source, the closest cluster classification result can be found only by comparing the difference between the actual classification of the first sound data and the cluster classification result; therefore, at least one group of pre-analysis features corresponding to the clustering classification result with the closest classification is used as the features for marking, so that the corresponding accuracy of the features and the age or the gender is improved.
In particular, the pre-analysis feature may include various sound features or a combination of sound features, which will not be repeated herein. By the marking mode of the embodiment of the application, the accuracy of the model at the training position can be effectively improved under the condition of a small sample.
In one possible implementation manner, performing data enhancement on the fusion feature with the distribution consistency as a constraint condition to form sample data includes:
sequentially enhancing frequency mask data and enhancing time mask data on the fusion features to form second sample data;
acquiring a weighting parameter corresponding to the fusion feature and the second sample data according to beta distribution;
and performing weighted fusion on the fusion characteristics and the second sample data according to the weighted parameters to form the first sample data.
In the embodiment of the application, in order to ensure the consistency of the data distribution before and after enhancement, the weighting parameters are generated by adopting the beta distribution, and the fusion characteristics and the second sample data are subjected to weighted combination, so that the enhanced sample data can be consistent before and after enhancement.
Illustratively, a beta distribution is a density function that is a conjugate prior distribution of a bernoulli distribution and a binomial distribution, is a probability distribution that describes probabilities, and has important applications in machine learning and mathematical statistics. Random variable
Figure 321142DEST_PATH_IMAGE023
Compliance parameter of
Figure 939205DEST_PATH_IMAGE024
Figure 209649DEST_PATH_IMAGE025
The beta distribution of (b) is generally written as:
Figure 929475DEST_PATH_IMAGE026
whereinBRepresenting a beta distribution.
In the present invention, setting
Figure 615671DEST_PATH_IMAGE027
. The weighting parameters generated by the beta distribution are
Figure 416137DEST_PATH_IMAGE028
Then the weighted feature of the final output
Figure 47363DEST_PATH_IMAGE029
Can be expressed by the following formula:
Figure 300489DEST_PATH_IMAGE030
wherein
Figure 72268DEST_PATH_IMAGE031
Is the second sample data, and is,
Figure 664923DEST_PATH_IMAGE032
is a fusion feature.
In a possible implementation manner, sequentially performing frequency mask data enhancement and time mask data enhancement on the fusion feature to form second sample data includes:
adding a preset mask to a frequency axis of the Mel spectrum within the Mel spectrum range to complete a frequency mask;
adding the preset mask to a time axis of the Mel spectrum within the Mel spectrum range to complete a frequency mask; the number and width of the preset masks are determined according to the Mel frequency spectrum.
When the embodiment of the application is implemented, the frequency mask data enhancement and the time mask data enhancement are sequentially carried out, so that the diversity of data can be greatly improved, wherein the preset mask can be determined according to the needs of technicians in the field, such as determining the number of masks and the width of the masks and determining which part of a Mel frequency spectrum needs to be masked.
Referring to fig. 3, it is illustrated that a frequency mask is applied by adding a mask having a value of 0 to a frequency axis of a mel-frequency spectrum, and a time mask is applied by adding a mask having a value of 0 to a time axis. The frequency masks are set to add five frequency masks over the entire mel-frequency spectrum, and the time masks are set to add two time masks over the entire mel-frequency spectrum. The enhanced cascade sequence is to apply a frequency mask first and then a time mask.
In one possible implementation, preprocessing the first sound data includes:
removing collision noise from the first sound data to generate pure sound data;
extracting background sound data from the adult panda sound data, and integrating the background sound data into the juvenile panda sound data;
and taking the adult panda voice data and the juvenile panda voice data integrated with the background voice data as first voice data after preprocessing.
In the implementation of the embodiment of the present application, the inventor found in scientific practice that, in order to ensure the growth environment of young pandas, the young pandas and adult pandas need to be separated and raised in different environments, which causes the difference between the sound data backgrounds of the young pandas and the adult pandas. For example, the number of pandas moving simultaneously in the environment for raising young pandas is large, and the background sounds collected by adult pandas are often raised individually are different.
In order to ensure the accuracy of the labeling of the first sound data and the training of the subsequent second sound data, the embodiment of the application extracts the background sound data from the adult panda sound data, integrates the background sound data into the juvenile panda sound data, and ensures the consistency of the background sound of the adult panda sound data and the juvenile panda sound data.
Specifically, only the portion of adult panda sound data in which background sound appears is collected as background sound data, and then they are integrated into juvenile panda data by mixing the sounds together.
In one possible implementation, the obtaining mel-frequency cepstral coefficients and perceptual linear prediction coefficients from the second sound data comprises:
calculating power spectrum estimation data of the second sound data, and integrating the overlapped critical band filter responses into the power spectrum estimation data to form power spectrum integration data;
convolving the power spectrum integration data on a symmetrical frequency domain on the frequency to allow the low frequency to cover the high frequency and smooth the frequency spectrum, then pre-emphasizing the frequency spectrum, and compressing the frequency spectrum amplitude to form pre-processing power spectrum data;
and performing inverse discrete Fourier transform on the preprocessed power spectrum data to obtain an autocorrelation coefficient, performing spectrum smoothing, solving an autoregressive equation, and converting the autoregressive coefficient into a cepstrum variable to obtain the perceptual linear prediction coefficient.
In a second aspect, an embodiment of the present application provides a system for identifying panda attributes based on voice, including:
an acquisition unit configured to acquire sound data of a panda as first sound data;
the preprocessing unit is configured to preprocess the first sound data, label age information and gender information on the preprocessed first sound data and generate second sound data;
a fusion unit configured to obtain mel cepstrum coefficients and perceptual linear prediction coefficients from the second sound data and combine the perceptual linear prediction coefficients, the mel cepstrum coefficients and the flipped mel cepstrum coefficients to form a fusion feature;
the enhancement unit is configured to perform data enhancement on the fusion feature by taking the distribution consistency as a constraint condition to form first sample data;
the training unit is configured to conduct convolutional neural network training on the first sample data to form a panda attribute identification model;
and the identification unit is configured to input target sound data into the identification model according to the attributes of the pandas and generate age information and gender information corresponding to the target sound data.
In one possible implementation, the enhancing unit is further configured to:
sequentially enhancing frequency mask data and enhancing time mask data on the fusion features to form second sample data;
acquiring a weighting parameter corresponding to the fusion feature and the second sample data according to beta distribution;
and performing weighted fusion on the fusion characteristics and the second sample data according to the weighting parameters to form the first sample data.
In one possible implementation, the enhancing unit is further configured to:
adding a preset mask to a frequency axis of the Mel spectrum within the Mel spectrum range to complete a frequency mask;
adding the preset mask to a time axis of the Mel spectrum within the Mel spectrum range to complete a frequency mask; the number and width of the preset masks are determined according to the Mel frequency spectrum.
In one possible implementation, the preprocessing unit is further configured to:
removing collision noise from the first sound data to generate pure sound data;
extracting background sound data from the adult panda sound data, and integrating the background sound data into the juvenile panda sound data;
and taking the adult panda voice data and the juvenile panda voice data integrated with the background voice data as first voice data after the preprocessing is finished.
In one possible implementation, the preprocessing unit is further configured to:
calculating power spectrum estimation data of the second sound data, and integrating the overlapped critical band filter responses into the power spectrum estimation data to form power spectrum integration data;
convolving the power spectrum integration data on a symmetrical frequency domain on the frequency to allow the low frequency to cover the high frequency and smooth the frequency spectrum, then pre-emphasizing the frequency spectrum, and compressing the frequency spectrum amplitude to form pre-processing power spectrum data;
and performing inverse discrete Fourier transform on the preprocessed power spectrum data to obtain an autocorrelation coefficient, performing spectrum smoothing, solving an autoregressive equation, and converting the autoregressive coefficient into a cepstrum variable to obtain the perceptual linear prediction coefficient.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.
The elements described as separate parts may or may not be physically separate, and it will be apparent to those of ordinary skill in the art that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein can be implemented as electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general sense in the foregoing description for the purpose of clearly illustrating the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a grid device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A panda attribute identification method based on voice is characterized by comprising the following steps:
acquiring sound data of a panda as first sound data;
preprocessing the first sound data, and labeling age information and gender information on the preprocessed first sound data to generate second sound data;
acquiring a Mel cepstrum coefficient and a perceptual linear prediction coefficient from the second sound data, and combining the perceptual linear prediction coefficient, the Mel cepstrum coefficient and the inverted Mel cepstrum coefficient to form a fusion feature;
performing data enhancement on the fusion characteristics by taking the consistent distribution as a constraint condition to form first sample data;
performing convolutional neural network training on the first sample data to form a panda attribute identification model;
inputting target sound data into a recognition model according to the attributes of the pandas, and generating age information and gender information corresponding to the target sound data;
labeling the age information and the gender information of the preprocessed first sound data to generate second sound data comprises the following steps:
pre-analyzing the first sound data according to the age information and the gender information to generate a plurality of groups of pre-analysis features; the pre-analysis characteristics are voice data characteristics with the difference between different age information and/or gender information meeting expectations;
constructing corresponding kernel functions and clustering radii according to each group of pre-analysis features to form a plurality of groups of clustering parameters;
performing clustering analysis on the first sound data by using each group of clustering parameters to form a plurality of groups of clustering results;
comparing the clustering results with the classification conditions of the age information and the gender information, and taking at least one group of clustering results closest to the classification conditions as selected clustering results;
and labeling the first sound data according to the pre-analysis features corresponding to the selected clustering results.
2. The method of claim 1, wherein the data enhancement of the fusion features with the consistent distribution as a constraint condition to form sample data comprises:
sequentially enhancing frequency mask data and enhancing time mask data on the fusion features to form second sample data;
acquiring a weighting parameter corresponding to the fusion feature and the second sample data according to beta distribution;
and performing weighted fusion on the fusion characteristics and the second sample data according to the weighted parameters to form the first sample data.
3. The method of claim 2, wherein the sequentially performing frequency mask data enhancement and time mask data enhancement on the fusion features to form second sample data comprises:
adding a preset mask to a frequency axis of the Mel spectrum within the Mel spectrum range to complete a frequency mask;
adding the preset mask to a time axis of the Mel spectrum within the Mel spectrum range to complete a frequency mask; the number and width of the preset masks are determined according to the Mel frequency spectrum.
4. The method of claim 1, wherein preprocessing the first sound data comprises:
removing collision noise from the first sound data to generate pure sound data;
dividing the pure sound data into adult panda sound data and juvenile panda sound data according to the age information;
extracting background sound data from the adult panda sound data, and integrating the background sound data into the juvenile panda sound data;
and taking the adult panda voice data and the juvenile panda voice data integrated with the background voice data as first voice data after preprocessing.
5. The method for panda attribute recognition based on voice according to claim 4, wherein the obtaining of mel cepstral coefficients and perceptual linear prediction coefficients from the second voice data comprises:
calculating power spectrum estimation data of the second sound data, and integrating the overlapped critical band filter response into the power spectrum estimation data to form power spectrum integration data;
convolving the power spectrum integration data on a symmetrical frequency domain on the frequency to allow the low frequency to cover the high frequency and smooth the frequency spectrum, then pre-emphasizing the frequency spectrum, and compressing the frequency spectrum amplitude to form pre-processing power spectrum data;
and performing inverse discrete Fourier transform on the preprocessed power spectrum data to obtain an autocorrelation coefficient, performing spectrum smoothing, solving an autoregressive equation, and converting the autoregressive coefficient into a cepstrum variable to obtain the perceptual linear prediction coefficient.
6. A voice-based panda attribute identification system using the method of any one of claims 1 to 5, comprising:
an acquisition unit configured to acquire sound data of pandas as first sound data;
the preprocessing unit is configured to preprocess the first sound data, label age information and gender information on the preprocessed first sound data and generate second sound data;
a fusion unit configured to obtain mel cepstral coefficients and perceptual linear prediction coefficients from the second sound data, and combine the perceptual linear prediction coefficients, the mel cepstral coefficients and the flipped mel cepstral coefficients to form a fusion feature;
the enhancement unit is configured to perform data enhancement on the fusion feature by taking the distribution consistency as a constraint condition to form first sample data;
the training unit is configured to conduct convolutional neural network training on the first sample data to form a panda attribute identification model;
and the identification unit is configured to input target sound data into the identification model according to the attributes of the pandas and generate age information and gender information corresponding to the target sound data.
7. The system of claim 6, wherein the enhancement unit is further configured to:
sequentially enhancing frequency mask data and enhancing time mask data on the fusion features to form second sample data;
acquiring a weighting parameter corresponding to the fusion feature and the second sample data according to beta distribution;
and performing weighted fusion on the fusion characteristics and the second sample data according to the weighted parameters to form the first sample data.
8. The system of claim 7, wherein the enhancement unit is further configured to:
adding a preset mask to a frequency axis of the Mel spectrum within the Mel spectrum range to complete a frequency mask;
adding the preset mask to a time axis of the Mel spectrum in a Mel frequency spectrum range to complete a frequency mask; the number and width of the preset masks are determined according to the Mel frequency spectrum.
9. The system according to claim 6, wherein the preprocessing unit is further configured to:
removing collision noise from the first sound data to generate pure sound data;
dividing the pure voice data into adult panda voice data and juvenile panda voice data according to the age information;
extracting background sound data from the adult panda sound data, and integrating the background sound data into the juvenile panda sound data;
and taking the adult panda voice data and the juvenile panda voice data integrated with the background voice data as first voice data after the preprocessing is finished.
10. The system of claim 9, wherein the preprocessing unit is further configured to:
calculating power spectrum estimation data of the second sound data, and integrating the overlapped critical band filter responses into the power spectrum estimation data to form power spectrum integration data;
convolving the power spectrum integration data on a symmetrical frequency domain on the frequency to allow the low frequency to cover the high frequency and smooth the frequency spectrum, then pre-emphasizing the frequency spectrum, and compressing the frequency spectrum amplitude to form pre-processing power spectrum data;
and performing inverse discrete Fourier transform on the preprocessed power spectrum data to obtain an autocorrelation coefficient, performing spectrum smoothing, solving an autoregressive equation, and converting the autoregressive coefficient into a cepstrum variable to obtain the perceptual linear prediction coefficient.
CN202210791585.8A 2022-07-07 2022-07-07 Panda attribute identification method and system based on sound Active CN114863939B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210791585.8A CN114863939B (en) 2022-07-07 2022-07-07 Panda attribute identification method and system based on sound

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210791585.8A CN114863939B (en) 2022-07-07 2022-07-07 Panda attribute identification method and system based on sound

Publications (2)

Publication Number Publication Date
CN114863939A CN114863939A (en) 2022-08-05
CN114863939B true CN114863939B (en) 2022-09-13

Family

ID=82625654

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210791585.8A Active CN114863939B (en) 2022-07-07 2022-07-07 Panda attribute identification method and system based on sound

Country Status (1)

Country Link
CN (1) CN114863939B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110189757A (en) * 2019-06-27 2019-08-30 电子科技大学 A kind of giant panda individual discrimination method, equipment and computer readable storage medium
CN111091840A (en) * 2019-12-19 2020-05-01 浙江百应科技有限公司 Method for establishing gender identification model and gender identification method
CN111312292A (en) * 2020-02-18 2020-06-19 北京三快在线科技有限公司 Emotion recognition method and device based on voice, electronic equipment and storage medium
CN111461252A (en) * 2020-04-13 2020-07-28 中国地质大学(武汉) Chick sex detector and detection method
CN111640438A (en) * 2020-05-26 2020-09-08 同盾控股有限公司 Audio data processing method and device, storage medium and electronic equipment
CN112802484A (en) * 2021-04-12 2021-05-14 四川大学 Panda sound event detection method and system under mixed audio frequency
CN113345422A (en) * 2021-04-23 2021-09-03 北京巅峰科技有限公司 Voice data processing method, device, equipment and storage medium
CN114582360A (en) * 2022-02-23 2022-06-03 腾讯音乐娱乐科技(深圳)有限公司 Method, apparatus and computer program product for identifying audio sensitive content

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10200824B2 (en) * 2015-05-27 2019-02-05 Apple Inc. Systems and methods for proactively identifying and surfacing relevant content on a touch-sensitive device
US20210264939A1 (en) * 2018-06-21 2021-08-26 Nec Corporation Attribute identifying device, attribute identifying method, and program storage medium
CN112530409B (en) * 2020-12-01 2024-01-23 平安科技(深圳)有限公司 Speech sample screening method and device based on geometry and computer equipment
CN113807455A (en) * 2021-09-26 2021-12-17 北京有竹居网络技术有限公司 Method, apparatus, medium, and program product for constructing clustering model

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110189757A (en) * 2019-06-27 2019-08-30 电子科技大学 A kind of giant panda individual discrimination method, equipment and computer readable storage medium
CN111091840A (en) * 2019-12-19 2020-05-01 浙江百应科技有限公司 Method for establishing gender identification model and gender identification method
CN111312292A (en) * 2020-02-18 2020-06-19 北京三快在线科技有限公司 Emotion recognition method and device based on voice, electronic equipment and storage medium
CN111461252A (en) * 2020-04-13 2020-07-28 中国地质大学(武汉) Chick sex detector and detection method
CN111640438A (en) * 2020-05-26 2020-09-08 同盾控股有限公司 Audio data processing method and device, storage medium and electronic equipment
CN112802484A (en) * 2021-04-12 2021-05-14 四川大学 Panda sound event detection method and system under mixed audio frequency
CN113345422A (en) * 2021-04-23 2021-09-03 北京巅峰科技有限公司 Voice data processing method, device, equipment and storage medium
CN114582360A (en) * 2022-02-23 2022-06-03 腾讯音乐娱乐科技(深圳)有限公司 Method, apparatus and computer program product for identifying audio sensitive content

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
基于功率谱和共振峰的母羊发声信号识别;宣传忠等;《农业工程学报》;20151231;全文 *
基于声纹的大熊猫个体识别系统分析与研究;路红坤;《中国优秀博硕士学位论文全文数据库(硕士)农业科技辑》;20200115;第4、21、27、29、35页 *
基于音频的大熊猫交配结果预测;闫蔚然;《中国优秀博硕士学位论文全文数据库(硕士)基础科学辑》;20220215;第15页 *

Also Published As

Publication number Publication date
CN114863939A (en) 2022-08-05

Similar Documents

Publication Publication Date Title
Venkataramanan et al. Emotion recognition from speech
Bhavan et al. Bagged support vector machines for emotion recognition from speech
Serizel et al. Acoustic features for environmental sound analysis
Stöter et al. Countnet: Estimating the number of concurrent speakers using supervised learning
Cheng et al. A call-independent and automatic acoustic system for the individual recognition of animals: A novel model using four passerines
Sultana et al. Bangla speech emotion recognition and cross-lingual study using deep CNN and BLSTM networks
Peng et al. Multi-resolution modulation-filtered cochleagram feature for LSTM-based dimensional emotion recognition from speech
CN111899757B (en) Single-channel voice separation method and system for target speaker extraction
CN111754988A (en) Sound scene classification method based on attention mechanism and double-path depth residual error network
Avci An expert system for speaker identification using adaptive wavelet sure entropy
Vignolo et al. Feature optimisation for stress recognition in speech
Dua et al. A hybrid noise robust model for multireplay attack detection in Automatic speaker verification systems
CN109584904A (en) The sightsinging audio roll call for singing education applied to root LeEco identifies modeling method
Sunny et al. Recognition of speech signals: an experimental comparison of linear predictive coding and discrete wavelet transforms
Rajeswari et al. Dysarthric speech recognition using variational mode decomposition and convolutional neural networks
Dua et al. Optimizing integrated features for Hindi automatic speech recognition system
Zheng et al. MSRANet: Learning discriminative embeddings for speaker verification via channel and spatial attention mechanism in alterable scenarios
Lai et al. RPCA-DRNN technique for monaural singing voice separation
CN114863939B (en) Panda attribute identification method and system based on sound
Sunny et al. Feature extraction methods based on linear predictive coding and wavelet packet decomposition for recognizing spoken words in malayalam
CN115881156A (en) Multi-scale-based multi-modal time domain voice separation method
CN115116469A (en) Feature representation extraction method, feature representation extraction device, feature representation extraction apparatus, feature representation extraction medium, and program product
CN113707172A (en) Single-channel voice separation method, system and computer equipment of sparse orthogonal network
Tailor et al. Deep learning approach for spoken digit recognition in Gujarati language
Gul et al. Single channel speech enhancement by colored spectrograms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant