CN114595772A - Infant crying classification method based on Transformer fusion model - Google Patents

Infant crying classification method based on Transformer fusion model Download PDF

Info

Publication number
CN114595772A
CN114595772A CN202210236093.2A CN202210236093A CN114595772A CN 114595772 A CN114595772 A CN 114595772A CN 202210236093 A CN202210236093 A CN 202210236093A CN 114595772 A CN114595772 A CN 114595772A
Authority
CN
China
Prior art keywords
model
spectrogram
feature
transformer
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210236093.2A
Other languages
Chinese (zh)
Inventor
李彬
江波
王妍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN202210236093.2A priority Critical patent/CN114595772A/en
Publication of CN114595772A publication Critical patent/CN114595772A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for classifying baby crying based on a Transformer fusion model; transforming an input audio sample through an audio processing module to obtain a spectrogram of an audio; obtaining the spectrogram characteristics by passing the obtained spectrogram through a pre-trained Resnet-50 model; inputting the obtained spectrogram characteristics into a spectrogram enhancement module and an attention mechanism module respectively, and extracting characteristic representations after data enhancement and differentiable characteristic representations in and among channels respectively; fusing the feature representation generated by bilinear through a Transformer fusion module, highlighting useful information, inhibiting redundant information and further enhancing the representation capability of a feature map; and (5) using the fused feature map for classifying the baby cry, and obtaining a final classification result through multiple iterative training.

Description

Infant crying classification method based on Transformer fusion model
Technical Field
The invention relates to a computer voice technology, in particular to a method for classifying baby crying based on a Transformer fusion model.
Background
The automatic classification of baby cry is a vital research field in bioengineering, and the signal of baby cry is analyzed by adopting medical and engineering techniques to distinguish the physiological and pathological states of baby cry, and different from the oral information of adult voice, it is difficult to identify what baby tries to transmit through their cry, so that it is important to research and design an effective classification model of baby cry, and efficiently obtain and identify the physiological and pathological states of baby cry.
Traditional classification models for baby crying are mostly based on single-branch classification models. The method mainly comprises a method based on a traditional machine learning classifier comprising models such as MLP, SVM and decision tree and a method based on a deep learning classifier comprising Resnet-50, a migration Resnet-50 combined SVM model, a graph volume model and an R-CNN series. However, the traditional machine learning classifier has many limitations, such as small data scale and poor generalization capability, and is difficult to apply to complex and variable actual scenes. However, the method of combining the deep learning classifier mostly only focuses on deeply extracting the complex baby cry feature representation, and how to expand the limited baby cry sample and fully dig the distinguishable feature representation inside the baby cry channel and among the channels cannot be considered at the same time.
In summary, the existing classification of baby cry has the following problems:
(1) the existing baby cry classification method cannot adaptively acquire differentiable features, and ignores the information interaction of features inside and among frequency spectrogram feature channels.
(2) Also in the problem of baby cry data with limited labels, the existing method cannot well increase the robustness of spectrogram features due to the sensitivity of the baby cry data and the time consumption for transcribing the original cry data.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to solve the defects in the prior art and provides a method for classifying baby crying based on a Transformer fusion model, which is based on the information fusion, and on one hand, a spectrogram enhancement module is used for extracting the characteristic representation of robustness to achieve the purpose of expanding the data of the baby crying; on the other hand, a spatial and channel attention mechanism module is used for extracting the distinguishable feature expressions in the channels and between the channels, and finally the feature expressions of the channel and the channel are fused through a Transformer module, so that the classification performance is further improved by fully utilizing the fusion idea.
The technical scheme is as follows: the invention relates to a method for classifying baby crying based on a Transformer fusion model, which comprises the following steps of:
step (1), inputting crying audio data of a baby to be classified, preprocessing the input audio data through an audio processing module, and generating a spectrogram;
step (2), constructing a training model, and preliminarily extracting spectrogram characteristics;
taking a Resnet-50 pre-training model as a basic network, then fixing parameters of the front 8 layers of the convolutional layer in the Resnet-50 model, only parameters of the rear two layers participate in the training process, and then adding two full-connection layers with 1024 neurons and a dropout layer with the speed of 0.7 to the basic network to further obtain a training model;
step 3, extracting robust feature representation, differentiable feature representation in channels and differentiable feature representation among channels by a spectrogram enhancement module and an attention mechanism module respectively for the spectrogram features obtained in the step 2;
fusing the two feature representations obtained in the step (3) through a Transformer fusion module; and then, using the fused feature map for classifying the crying of the baby, and obtaining a final classification result through multiple iterative training.
Further, the preprocessing in step (1) refers to converting audio file samples (including format samples such as wav, pcm, mp3, etc.) into spectrograms with a size of 256 × 256 by audio processing software (e.g., Sound Exchange, etc.), wherein the horizontal axis of a single-channel spectrogram represents time and the vertical axis represents frequency.
Further, the method for constructing the training base network model in the step (2) comprises the following steps:
taking a Resnet-50 pre-training model as a backbone network, then fixing parameters of the front 8 layers of the convolutional layers in the Resnet-50 model, including 8 groups of convolutional layers from Conv1 to Conv8 in the Resnet-50, and updating the parameters of the rear two layers after the training process; and then adding two full-connection layers with 1024 neurons and a dropout layer with the speed of 0.7 into the backbone network to form a required basic network model for preliminarily extracting spectrogram features.
Further, the specific method for extracting the spectrogram robustness feature by the spectrogram enhancement module in the step (3) comprises the following steps: respectively performing masking operation on a time domain channel and a frequency domain channel of the spectrogram, namely setting two masking frequency domain channels, wherein the range of random values is between 0 and 20; two masking time domain channels are arranged, the range of random values is between 10 and 30, and time warping operation is removed because the baby cries do not have strong semantic information in a voice time sequence.
Under the condition of insufficient training data, the frequency spectrum of the audio data can be enhanced through the frequency spectrogram enhancing module, and the data set can be dynamically expanded; the spectrogram enhancement can help the neural network to better learn the feature representation of the spectrogram, can also increase the robustness of the training network against time domain deformation and partial segment loss of the frequency domain, and finally effectively improves the performance of the final audio classifier.
Further, the attention mechanism module in the step (3) comprises a channel attention mechanism and a space attention mechanism, and the specific working process is as follows:
for the channel attention mechanism, channel information is first aggregated using global maximum pooling for each feature map; then the feature descriptors generated by the global maximum pooling are sent to the two-layer perceptron; equation (1) is as follows:
Pc=Mmlp(Mmax(P)), (1)
the final feature map is generated by multiplying the channel attention map and the original feature map; as shown in equation (2):
Figure BDA0003542248720000031
wherein M ismax(. to) represents the global maximum pooling of feature maps between channels, Mmlp(. is) a two-layer perceptron, P representing a characteristic diagram of the input, PcIs a generated channel attention map, Pf1A feature map generated by the channel attention mechanism is shown.
Introducing global max pooling for each point in the feature map for a spatial attention mechanism, and then sending feature descriptors generated by the global max pooling to the two-layer perceptron; and finally, multiplying the channel attention diagram and the original feature diagram to generate a feature diagram, wherein the formula (3) is as follows:
Figure BDA0003542248720000032
wherein M in the formula (3)max(. o) represents the global max pooling, P, of each point in the feature mapf2A feature map generated by a spatial attention mechanism is shown.
Compared with a single-branch spectrogram enhancement module, the dual-branch spectrogram enhancement module can obtain more robust feature embedding.
Further, the Transformer fusion module in the step (4) does not set position coding, and adopts a layer of Transformer Block fusion robust feature representation, distinguishable feature representation in channels and distinguishable feature representation between channels; the sentence tokens in the Transformer fusion module are set to 49, each token being 1 × 1 × 128(w × h × c) in size; and finally, obtaining a final classification result through the 2-layer full connection layer and the softmax.
On one hand, the invention considers the learning of the internal characteristics of the channels, thereby being beneficial to the information interaction between the channels; on the other hand, discriminative feature representations between channels are learned based on the channel attention and spatial attention modules. Therefore, a Transformer is used here to fuse the two branch modules, organically fusing the learned features together.
Further, in the iterative training process of the step (4), a random gradient method is used for 200 iterations, the batch of each iteration is set to be 32, the model learning rate is set to be 0.0001, the result of each ten iterations is not changed, and the final model is saved for model testing;
the model is trained and tested in a 5-fold cross validation mode, and finally the average test precision of 5 folds is used as a classification result.
Has the advantages that: the invention can provide a spectrogram enhanced random mask module combined space and channel attention mechanism under the problems of limited baby cry samples and how to effectively dig the distinguishable feature representation of the baby cry, and obtains better classification feature representation of the baby cry. Meanwhile, a Transformer mechanism is fused, so that the complementary action among the characteristics is realized, and the robustness of the classification of the baby crying is further improved.
Compared with the prior art, the invention has the following advantages:
(1) the training model constructed by the invention is a novel bilinear fusion network, and audio frequency spectrogram features are mined in a multi-level manner.
(2) The spectrogram enhancing module extracts the robustness characteristic representation of the spectrogram in the channel, can improve the audio number and can dynamically expand a data set.
(3) The attention mechanism module extracts the distinguishable feature representation and fully excavates the distinguishable features among the channels.
(4) The feature fusion module can realize the complementary action among features, highlight useful information, inhibit redundant information and further enhance the characterization capability of the feature map.
Drawings
FIG. 1 is a general classification flow diagram of the present invention;
FIG. 2 is a schematic diagram of a network model according to an embodiment;
FIG. 3 is a sample audio frequency spectrum of an embodiment;
FIG. 4 is a schematic diagram of a feature matrix of an embodiment;
FIG. 5 is a schematic diagram of the enhanced spectrum in the embodiment;
FIG. 6 is a diagram illustrating the classification result of baby cry types according to the embodiment.
Detailed Description
The technical solution of the present invention is described in detail below, but the scope of the present invention is not limited to the embodiments.
As shown in fig. 1, the method for classifying baby crying based on the Transformer fusion model of the present invention includes the following steps:
inputting crying audio data of a baby to be classified, and converting an audio sample into a spectrogram with the size of 256 multiplied by 256 by using Sound Exchange audio processing software, wherein the horizontal axis of the spectrogram of a single channel represents time, and the vertical axis represents frequency;
step (2), constructing a training model, and preliminarily extracting spectrogram characteristics;
as shown in fig. 2, a Resnet-50 pre-training model is used as a basic network, then parameters of the first 8 layers of convolutional layers in the Resnet-50 model are fixed, only parameters of the second two layers participate in a training process, and then two full-connection layers with 1024 neurons and a dropout layer with a rate of 0.7 are added to the basic network to obtain a training model;
step 3, extracting robust feature representation, differentiable feature representation in channels and differentiable feature representation among channels by a spectrogram enhancement module and an attention mechanism module respectively for the spectrogram features obtained in the step 2;
fusing the two feature representations obtained in the step (3) through a Transformer fusion module; and then, using the fused feature map for classifying the crying of the baby, and obtaining a final classification result through multiple iterative training.
Example 1:
the embodiment comprises the following steps:
step (1), firstly, processing training data into a spectrogram with the size of 256 multiplied by 256 as the input of a subsequent model through an SOX tool, wherein the parameter setting of audio comprises that the sampling rate (sample rate) is set to 44100Hz, the sampling size (sample size) of 16 bits, the sampling coding (sample encoding) is set to 16-bit (signed integer) signed integer PCM, the number of channels is 3, and a specific audio spectrogram sample is shown in figure 3;
step (2), using a Resnet-50 pre-training model as a basic network, and carrying out fine tuning to construct a training model, which specifically comprises the following steps:
taking the Resnet-50 pre-training model as a backbone network, then fixing parameters of the front 8 layers of the convolutional layers in the Resnet-50 model, including 8 groups of convolutional layers from Conv1 to Conv8 in the Resnet-50, and updating the parameters of the back two layers in the training process; then, two fully-connected layers with 1024 neurons and a dropout layer with the rate of 0.7 are added into the backbone network to form a required basic network model for preliminarily extracting spectrogram features and generating a 1024 × 32 feature matrix, as shown in fig. 4.
Step (3), further optimizing the extracted spectrogram characteristics through two branches;
one adopts a spectrogram enhancing module, the setting of a distorted time domain signal is removed, and only time domain channels and frequency domain channels of the spectrogram are masked; specifically, two masking frequency domain channels are provided, and the random value ranges from 0 to 20; two masking time domain channels are also set, and the random value ranges from 10 to 30; after the spectrogram enhancement is performed on one of the spectrograms, it is shown in fig. 5.
And the other one adopts a space and channel attention module, and for the channel attention module, each spectrogram feature passes through a layer of global maximum pooling (Max boosting) and a two-layer MLP network to obtain a channel attention feature map. Then multiplying the channel attention feature map by the input original features, and taking the result as the input of a space attention module, wherein the space attention module summarizes and introduces the global maximum pooling of each point in the feature map; then the feature descriptors generated by the global maximum pooling are sent to the two-layer perceptron; and finally, multiplying the channel attention diagram and the original characteristic diagram to generate a characteristic diagram.
Step (4), fusing the two obtained representations by using a Transformer fusion module, removing the position coding in the Transformer module in the embodiment, and fusing the optimized spectrogram characteristics of the two branches by using a layer of Transformer Block; wherein the sentence block (token) is set to 49, each token being 1 × 1 × 128(w × h × c) in size; finally, a final classification result is obtained through the 2-layer full-link layer and softmax, and after one spectrogram is classified, the result is shown in fig. 6.
The classification result of the implementation includes hungry class, sleep class and wakeup class, and the meaning of the baby crying can be judged through the final classification detection of the baby crying audio, for example, hungry indicates that the hungry state is expressed through crying, wakeup indicates that the baby is awake, sleep indicates that the baby is sleepy, and thus the physiological needs of the baby are known.
In the iterative training process of the embodiment, a stochastic gradient method is used for performing 200 iterations, the batch (batch) of each iteration is set to be 32, the model learning rate is set to be 0.0001, the iteration result does not change any more in each ten iterations, and the final model is saved for model testing; the model is trained and tested in a 5-fold cross validation mode, and finally the average test precision of 5 folds is used as a classification result.
Example 2:
in order to verify the reasonability and the effectiveness of the technical scheme, a certain subset of a Baby2020 data set is selected for experiment, acc (acuracy) is used as an objective evaluation index of a classification result, the embodiment is realized based on a deep learning frame Pythroch, and an image processor (GPU) is used for accelerating operation, a 12GB memory and an Nvidia GeForce GTX 2080Ti display card are utilized.
For the Baby2020 data subset, three types of samples from 0 to 3 months old healthy infants were contained, with 1058 samples for the Hungry category, 1257 samples for the sleep category, and 949 samples for the Wakeup category. Of which 2790 audio tones are used as training data and 743 audio tones are used as test data. Table 1 lists the experimental results of the invention, acc reaches 83.14%, the classification performance is superior to other similar methods, and the effective classification of the targets is realized.
TABLE 1 classification results for the Baby2020 data subsets
Figure BDA0003542248720000071
The embodiment shows that the method can greatly improve the classification precision and efficiency of the baby crying audio and is convenient for timely knowing the requirements expressed by the baby crying.

Claims (7)

1. A method for classifying baby crying based on a Transformer fusion model is characterized by comprising the following steps: the method comprises the following steps:
step (1), inputting crying audio data of a baby to be classified, preprocessing the input audio data through an audio processing module, and generating a spectrogram;
step (2), constructing a basic network model, and preliminarily extracting spectrogram characteristics;
taking a Resnet-50 pre-training model as a basic network, then fixing parameters of the front 8 layers of the convolutional layer in the Resnet-50 model, only parameters of the rear two layers participate in the training process, and then adding two full-connection layers with 1024 neurons and a dropout layer with the speed of 0.7 to the basic network to further obtain a basic network model;
step 3, extracting robust feature representation, differentiable feature representation in channels and differentiable feature representation among channels by a spectrogram enhancement module and an attention mechanism module respectively for the spectrogram features obtained in the step 2;
fusing the two feature representations obtained in the step (3) through a Transformer fusion module; and then, using the fused feature map for classifying the crying of the baby, and obtaining a final classification result through multiple iterative training.
2. The method for classifying baby crying based on Transformer fusion model as claimed in claim 1, wherein: the preprocessing in the step (1) is to convert an audio file sample into a spectrogram with the size of 256 × 256 through an audio processing module, wherein the horizontal axis of the spectrogram of a single channel represents time, and the vertical axis represents frequency.
3. The method for classifying baby crying based on Transformer fusion model as claimed in claim 1, wherein: the method for constructing the basic network model in the step (2) comprises the following steps:
taking a Resnet-50 pre-training model as a basic network, then fixing parameters of the front 8 layers of the convolutional layer in the Resnet-50 model, only parameters of the rear two layers participate in the training process, and then adding two full-connection layers with 1024 neurons and a dropout layer with the speed of 0.7 to the basic network to further obtain the basic network model.
4. The method for classifying baby crying based on Transformer fusion model as claimed in claim 1, wherein: the specific method for extracting the spectrogram robustness characteristic by the spectrogram enhancement module in the step (3) comprises the following steps: respectively performing mask operation on a time domain channel and a frequency domain channel of the spectrogram, namely setting two mask frequency domain channels, wherein the range of random values is between 0 and 20; two masking time domain channels are set and the random value ranges between 10 and 30.
5. The method for classifying crying of infant based on Transformer fusion model as claimed in claim 1, wherein: the attention mechanism module in the step (3) comprises a channel attention mechanism and a space attention mechanism, and the specific working process is as follows:
for the channel attention mechanism, channel information is first aggregated using global maximum pooling for each feature map; then the feature descriptors generated by the global maximum pooling are sent to the two-layer perceptron; equation (1) is as follows:
Pc=Mmlp(Mmax(P)), (1)
the final feature map is generated by multiplying the channel attention map and the original feature map; as shown in equation (2):
Figure RE-FDA0003618298220000021
wherein M ismax(. to) represents the global maximum pooling of feature maps between channels, Mmlp(. is) a two-layer perceptron, P representing a characteristic diagram of the input, PcIs a generated channel attention map, Pf1A feature map representing a channel attention mechanism generation;
introducing global max pooling for each point in the feature map for a spatial attention mechanism, and then sending feature descriptors generated by the global max pooling to the two-layer perceptron; and finally, multiplying the channel attention diagram and the original feature diagram to generate a feature diagram, wherein the formula (3) is as follows:
Figure RE-FDA0003618298220000022
wherein M in the formula (3)max(. o) represents the global max pooling, P, of each point in the feature mapf2A feature map generated by a spatial attention mechanism is shown.
6. The method for classifying baby crying based on Transformer fusion model as claimed in claim 1, wherein: the Transformer fusion module in the step (4) does not set position coding, and adopts a layer of Transformer Block fusion robustness characteristic representation, differentiable characteristic representation in channels and differentiable characteristic representation among the channels;
the sentence tokens in the Transformer fusion module are set to 49, each token being 1 × 1 × 128(w × h × c) in size; and finally obtaining a final classification result through the 2 layers of full connection layers and softmax.
7. The method for classifying baby crying based on Transformer fusion model as claimed in claim 1, wherein: in the iterative training process of the step (4), a random gradient method is used for performing 200 iterations, the batch of each iteration is set to be 32, the model learning rate is set to be 0.0001, the result of each ten iterations is not changed, and the final model is stored for model testing;
the model is trained and tested in a 5-fold cross validation mode, and finally the average test precision of 5 folds is used as a classification result.
CN202210236093.2A 2022-03-11 2022-03-11 Infant crying classification method based on Transformer fusion model Pending CN114595772A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210236093.2A CN114595772A (en) 2022-03-11 2022-03-11 Infant crying classification method based on Transformer fusion model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210236093.2A CN114595772A (en) 2022-03-11 2022-03-11 Infant crying classification method based on Transformer fusion model

Publications (1)

Publication Number Publication Date
CN114595772A true CN114595772A (en) 2022-06-07

Family

ID=81818647

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210236093.2A Pending CN114595772A (en) 2022-03-11 2022-03-11 Infant crying classification method based on Transformer fusion model

Country Status (1)

Country Link
CN (1) CN114595772A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116386661A (en) * 2023-06-05 2023-07-04 成都启英泰伦科技有限公司 Crying detection model training method based on dual attention and data enhancement

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116386661A (en) * 2023-06-05 2023-07-04 成都启英泰伦科技有限公司 Crying detection model training method based on dual attention and data enhancement
CN116386661B (en) * 2023-06-05 2023-08-08 成都启英泰伦科技有限公司 Crying detection model training method based on dual attention and data enhancement

Similar Documents

Publication Publication Date Title
Gong et al. Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation
Chatziagapi et al. Data Augmentation Using GANs for Speech Emotion Recognition.
Mei et al. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research
Ke et al. Speech emotion recognition based on SVM and ANN
CN112818861A (en) Emotion classification method and system based on multi-mode context semantic features
Li et al. An evaluation of deep neural network models for music classification using spectrograms
CN112102813B (en) Speech recognition test data generation method based on context in user comment
CN103605990A (en) Integrated multi-classifier fusion classification method and integrated multi-classifier fusion classification system based on graph clustering label propagation
CN115762536A (en) Small sample optimization bird sound recognition method based on bridge transform
CN111899766B (en) Speech emotion recognition method based on optimization fusion of depth features and acoustic features
CN111882042B (en) Neural network architecture automatic search method, system and medium for liquid state machine
CN116010874A (en) Emotion recognition method based on deep learning multi-mode deep scale emotion feature fusion
CN114595772A (en) Infant crying classification method based on Transformer fusion model
Wu Research on automatic classification method of ethnic music emotion based on machine learning
Li et al. Emotion recognition from speech with StarGAN and Dense‐DCNN
Sun et al. Multi-classification speech emotion recognition based on two-stage bottleneck features selection and MCJD algorithm
CN103871413A (en) Men and women speaking voice classification method based on SVM and HMM mixing model
Glickman et al. (A) Data in the Life: Authorship Attribution of Lennon-McCartney Songs
CN105632485A (en) Language distance relation obtaining method based on language identification system
Li et al. Audio recognition of Chinese traditional instruments based on machine learning
CN113806543B (en) Text classification method of gate control circulation unit based on residual jump connection
CN116052718A (en) Audio evaluation model training method and device and audio evaluation method and device
CN115455144A (en) Data enhancement method of completion type space filling type for small sample intention recognition
Martín-Morató et al. Adaptive distance-based pooling in convolutional neural networks for audio event classification
Vásquez et al. Tailed U-Net: Multi-Scale Music Representation Learning.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination