CN113506553A - Audio automatic labeling method based on transfer learning - Google Patents

Audio automatic labeling method based on transfer learning Download PDF

Info

Publication number
CN113506553A
CN113506553A CN202110712420.2A CN202110712420A CN113506553A CN 113506553 A CN113506553 A CN 113506553A CN 202110712420 A CN202110712420 A CN 202110712420A CN 113506553 A CN113506553 A CN 113506553A
Authority
CN
China
Prior art keywords
audio
data set
labeling
labels
automatic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110712420.2A
Other languages
Chinese (zh)
Other versions
CN113506553B (en
Inventor
居辰
韩立新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hohai University HHU
Original Assignee
Hohai University HHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU filed Critical Hohai University HHU
Priority to CN202110712420.2A priority Critical patent/CN113506553B/en
Publication of CN113506553A publication Critical patent/CN113506553A/en
Application granted granted Critical
Publication of CN113506553B publication Critical patent/CN113506553B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0008Associated control or indicating means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/036Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal of musical genre, i.e. analysing the style of musical pieces, usually for selection, filtering or classification

Abstract

The invention discloses an automatic audio labeling method based on transfer learning, which mainly comprises the following steps: data preprocessing, namely establishing primary labels (such as music genres, musical instruments, music moods, singer information and the like) according to the distribution condition of the audio labels, and dividing an original data set into data sets containing the primary labels as the input of a model; transfer learning, namely completing audio classification on the small audio data set by using a learned knowledge structure in the field of image recognition, and labeling the audio data which is not labeled with the primary label; and automatically labeling the audio, namely, using different audio representation forms as input to the data set after the tag reconstruction, constructing a classifier to learn the time-frequency domain characteristics of the audio signal, and automatically labeling. The audio automatic labeling method provided by the invention can effectively relieve the problem that the original audio labeling data set is extremely sparse, increase the diversity and the balance of audio labeling and improve the accuracy of audio automatic labeling.

Description

Audio automatic labeling method based on transfer learning
Technical Field
The invention relates to the field of music information research and the field of transfer learning, in particular to an automatic audio labeling method based on transfer learning.
Background
In recent years, with the expansion of the internet scale and the rapid development of digital multimedia technology, music digitization has become a trend. Online music resources are increasing explosively, and people have higher and higher requirements for the quality of music services. The quality of service for retrieval, recommendation and page navigation of digital music websites depends largely on the quality of the music tags: music retrieval typically relies on music tags of songs as a basis for a categorical retrieval, and music recommendations typically rely on a user's listening history for recommending similar songs for them. However, the songs newly put on the shelf and the cold songs in the digital music website have less label information, so that the long-tail music with low popularity is rarely recommended or accessed, the chance of getting the user or community labels is less, and a negative feedback effect is formed. Although expert annotation can solve the problem of song tail growth, it is costly and cannot be applied to large-scale music libraries. Therefore, the automatic labeling of the music labels according to the music audio has extremely high research value.
The music labels refer to high-level descriptive words capable of expressing music characteristics, and are important components of digital music services, and common music label categories include genres (such as classical, jazz, rock, country and the like), musical instruments (such as guitar, string, piano, drum and the like), emotions (such as happy, relaxed, angry, tension, sadness and the like), singer information (gender, the number of singers and the like) and the like. Music automatic labeling requires the prediction of audio tags from audio information, and mainly comprises two important subtasks: acquiring audio descriptive characteristics which effectively represent the self attributes of music; the mapping between music features to high-level semantic tags is learned. In the traditional audio automatic labeling method, relevant characteristics (such as fundamental frequency, formants, Mel frequency cepstrum coefficients and the like) representing sound are directly calculated from a time domain or a frequency domain through a signal processing system, and then the relevant characteristics are used as input signals in a machine learning stage to carry out model training. However, the design of the artificial features is heavy, requires a lot of expertise, and is difficult to fully describe various aspects of music. With the great achievement of deep learning in various fields of pattern recognition, a new music labeling method is proposed, and the mapping relation from audio features to text labels is learned by combining network structures such as a convolutional neural network and a cyclic neural network, but the problems of single labeling and low accuracy still exist.
At present, the disclosed audio automatic labeling data set comprises labels of genres, emotions, musical instruments, singer information and the like, and mainly performs labeling in a manual labeling mode, so that the problems of label deletion and data sparseness exist mostly. In order to solve the problem, a method is needed to be designed to reasonably expand the tag data and improve the attribute of the music to a certain extent. The transfer learning can transfer the labeled data or the learning effect of the knowledge structure to complete or improve the target field or task from the related field, and is mainly divided into example-based transfer, feature-based transfer and shared parameter-based transfer, for example, learning to ride a bicycle and learning to a motorcycle easily. For the original set, the original set can be divided into a plurality of subdata sets according to the primary labels, and the subdata sets respectively correspond to the automatic audio labeling tasks under different primary labels. VGG and ResNet have good performance in the field of image recognition, and are migrated to an audio classification task, so that the automatic audio labeling effect can be improved to a certain extent, and missing labels in a data set can be predicted. Compared with other automatic audio labeling research methods, the method makes up for the problem of data loss of the original data set, makes the data more balanced and improves the diversity and balance of the labels.
Considering that audio contains rich information in the time domain and the frequency domain, the change of amplitude (volume) along with time can be observed in the time domain, and the frequency domain can embody the change of sound frequency and is related to tone. Different representation forms (audio time domain signals and audio Mel spectrograms) of the audio are used as input, an audio automatic labeling classifier is constructed to learn the characteristics of the audio in a time domain and a frequency domain, and finally decision-level fusion is carried out. Compared with the method that the audio waveform or the audio spectrogram is directly utilized for automatic labeling research, the method not only improves the utilization rate of the audio information, but also improves the accuracy of the automatic audio labeling.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems of annotation loss and data sparseness existing in the existing music automatic annotation research, an audio automatic annotation method based on transfer learning is provided, and the diversity and the balance of audio annotation are improved; meanwhile, different representation forms of the audio are used as input to construct an audio automatic labeling classifier, and time-frequency domain characteristics with richer audio are learned, so that the utilization rate of audio information is improved, the accuracy of audio automatic labeling is improved, and better data support is provided for a music retrieval system and a music recommendation system.
The technical scheme is as follows: an automatic audio labeling method based on transfer learning specifically implements the following steps:
step 1: data pre-processing
Reading the original audio annotation data set, cleaning, deleting the problem data file (for example, the data is empty), and sorting the tag data according to the occurrence frequency from high to low. The MP3 audio file is resampled at a sampling frequency of 11025Hz, the audio signal is mapped to a frequency domain from a time domain through short-time Fourier transform, a spectrogram is obtained, and the change of the frequency along with the time is obtained.
According to the distribution condition of the labels in the original audio labeling data set, m primary labels (such as genres, musical instruments, emotions and singer information) are constructed and divided into m sub data sets, each sub data set corresponds to n secondary labels (such as genres divided into classics, jazz, rock and country), and the secondary labels are labels originally existing in the data set.
Step 2: transfer learning
Through step1, the audio automatic labeling problem is converted into a plurality of audio subset automatic labeling problems. Selecting a subdata set m1Corresponds to n1And (4) each secondary label is input by taking the Mel frequency spectrogram as a model, an end-to-end multi-classification model is built, and knowledge learning is carried out by taking ROC-AUC as an evaluation standard. Network models such as VGG and Resnet have good performance in the field of image recognition, are migrated into audio frequency classes for fine adjustment, and are converted into models suitable for automatic audio frequency labeling.
Selecting a subdata set mi(i>1) Corresponds to niA second-level label for transferring the image recognition network model to the data set miAnd (3) taking the Mel frequency spectrogram as model input, training and fine-tuning the model to obtain the M-position of the modeliTraining parameters on the data set can be simultaneously predicted to obtain other subdata sets mj(j ≠ i) corresponding niAnd (5) a secondary label. Thus, a more balanced one can be obtainedThe data set is labeled, i.e. each audio file is labeled with at least m labels.
Step 3: audio automatic labeling
And performing m classification on the expanded labeled audio data set. Considering that audio signals have different characteristics in a time domain and a frequency domain, respectively taking a time domain waveform and a frequency domain spectrogram as input, constructing an audio automatic labeling classifier for knowledge learning, learning the time sequence characteristics of the audio by using LSTM, learning the frequency domain characteristics of the spectrogram by using ResNet, then performing decision-level fusion on the outputs of the two models, and finally generating the audio automatic labeling classifier as shown in a specific formula (1),
Figure BDA0003134239710000021
wherein n represents the number of models, weightiWeight, P, representing model iiRepresenting the prediction probability value of model i.
The beneficial effects of the invention are specifically expressed as follows:
1) and converting the audio data into a frequency domain through short-time Fourier transform to obtain a representation form of the audio in the frequency domain.
2) And constructing a primary label according to the label distribution of the original data set, corresponding to a plurality of secondary labels, and dividing the data set into a plurality of subdata sets according to the primary label.
3) The network model with better performance in the field of image recognition is migrated to the automatic audio labeling task, fine adjustment and model optimization are carried out, and the accuracy of the audio classification task is improved for the audio under a single primary label.
4) And (3) carrying out classification prediction on each audio subset by applying the model in the step 3), and labeling the data which are not labeled with the primary label, so that an audio labeling data set is expanded, and the diversity and the balance of audio labeling are improved.
5) And performing multi-classification on the expanded data set, taking the audio signal containing effective information in a time domain and a frequency domain into consideration, respectively taking the audio time domain signal and a spectrogram as input, constructing an audio automatic labeling classifier, performing decision-level fusion on the output of the classifier, and performing model training and testing by taking the real labeled data set as a target, so that the utilization rate of the audio information is improved, and the accuracy of the audio automatic labeling is greatly improved.
Drawings
FIG. 1 is a flowchart of an audio automatic labeling method based on transfer learning according to the present invention.
FIG. 2 is a schematic diagram of an automatic audio annotation process according to the present invention.
Detailed Description
The present invention is further illustrated by the following figures and specific examples, which are to be understood as illustrative only and not as limiting the scope of the invention, which is to be given the full breadth of the appended claims and any and all equivalent modifications thereof which may occur to those skilled in the art upon reading the present specification.
Referring to fig. 1, an algorithm flow of the embodiment of the present invention is shown, which includes the following steps:
step1, audio preprocessing: carrying out data cleaning to remove invalid data; resampling MP3 audio data at a sampling rate of 11025Hz, and simultaneously converting the audio signal into a spectrogram by using short-time Fourier transform to obtain the representation of the audio signal in a frequency domain; synonym merging is performed on the data set labels and the data set labels are sorted from high to low in frequency of occurrence.
Step2, dividing the data set: and checking the distribution condition of the labels of the data set, selecting the label with the frequency of 50 before as a secondary label, constructing a primary label (the information of the genre, the emotion, the musical instrument and the singer), and respectively corresponding to one part of the 50 secondary labels, so as to divide the data set into 4 sub-data sets (a genre data set, an emotion data set, a musical instrument data set and a singer information data set).
Step3, training a transfer learning model: the Resnet with better effect in the image field is migrated to the audio classification task, model fine-tuning is respectively carried out on the obtained 4 sub-data sets (network weights except for the full connection layer are frozen, image features are extracted, and full connection layer parameters are trained based on different sub-data sets), so that the accuracy of audio classification is improved.
And 4, expanding the tag data set: on the basis of step3, the label prediction and labeling are performed on other data sets not containing the primary label by applying different models finely adjusted on different sub data sets, for example: the model prediction applied after the fine tuning on the genre dataset does not contain the dataset of genre labels. The operation is repeated until each audio data is labeled with at least 4 labels, so that the labeled data set is expanded, and the diversity and the balance of the labeled data set are improved.
Step 5, automatically labeling audio: and constructing an audio automatic labeling model, and respectively using the time domain waveform and the spectrogram as the representation of the audio signal in a time domain and a frequency domain. Transmitting the audio signal into an LSTM network to learn the time sequence characteristics of the audio, and outputting the probability distribution corresponding to 50 labels; and (4) transmitting the spectrogram into a Resnet network to extract deep features of the picture, and outputting probability distribution corresponding to 50 labels. And then, performing decision-level fusion on the outputs of the LSTM and Resnet, wherein the main formula is as follows (2):
Figure BDA0003134239710000041
the automatic labeling model is trained by taking the expanded label data set as a target to generate an audio automatic labeling model, the model not only improves the utilization rate of audio information, but also improves the accuracy rate of audio automatic labeling, and the model can be used for adding a plurality of labels to unknown labeled songs.

Claims (5)

1. An audio automatic labeling method based on transfer learning mainly comprises the following steps:
step 1: data preprocessing, namely cleaning a data set, and converting audio information into frequency domain information by short-time Fourier transform to obtain a spectrogram as the representation of an audio signal in a frequency domain; and screening the primary labels according to the label distribution of the data set, and dividing the primary labels into a plurality of primary label data (such as a genre data set, an emotion data set and a musical instrument data set).
Step 2: and (3) transfer learning, namely transferring the network models in the image recognition to the m sub-data sets obtained in the step (1) for training and fine adjustment respectively, constructing an audio automatic labeling classifier under a primary label, and expanding an audio labeling data set.
And step 3: and (3) automatic audio labeling, namely, establishing a model for knowledge learning aiming at the data set expanded in the step (2), and training an automatic audio labeling classifier.
2. The automatic audio labeling method based on transfer learning of claim 1, wherein the data set partitioning manner in step1 is to construct m primary labels according to the distribution of the original data set labels, each primary label corresponds to a plurality of secondary labels, wherein the secondary labels are the original data set labels, and thus, the original multi-label big data set is partitioned into sub data sets corresponding to the plurality of primary labels.
3. The automatic audio labeling method based on transfer learning of claim 1, wherein the selection and training of the transfer learning model in step2 mainly utilizes a network model which is better represented in the image field, and transfers the network model to an audio classification task for fine tuning, so as to improve the accuracy of audio classification, and the improvement of the model on the task can be considered, so that an audio classification classifier is constructed, and the audio classification effect is further improved.
4. The automatic audio labeling method based on transfer learning of claim 1, wherein the expansion mode of the original labeling of the data set in the step2 is to select a certain subdata set, the audio classification classifier in the step2 is applied to training and testing, when the classification accuracy reaches a certain height, the classifier can be considered to predict the type of labeling of other data sets, so as to label other data sets, thus, the original labeled data set can be expanded into a data set with at least m labels labeled on each audio, and the diversity and balance of the labeled data sets are improved.
5. The audio automatic labeling method based on transfer learning of claim 1, wherein the audio automatic labeling method of step3 performs multiple classifications for the data set (each audio is labeled with at least m tags) expanded in step 2; taking the spectrogram as input and the m labels as targets, constructing an audio label classifier, and learning the frequency domain characteristics of the audio; because the original signal contains abundant time domain information, the time domain waveform is used as the input of another classifier to learn the time sequence characteristic of the audio signal; and finally, performing decision-level fusion on the outputs (probability distribution) of the two classifiers, wherein a specific formula is shown as (1):
Figure FDA0003134239700000011
wherein n represents the number of models, weightiWeight, P, representing model iiAnd the prediction probability value of the model i is represented, so that the accuracy of the automatic audio labeling model is greatly improved, and the audio information is fully utilized.
CN202110712420.2A 2021-06-25 2021-06-25 Audio automatic labeling method based on transfer learning Active CN113506553B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110712420.2A CN113506553B (en) 2021-06-25 2021-06-25 Audio automatic labeling method based on transfer learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110712420.2A CN113506553B (en) 2021-06-25 2021-06-25 Audio automatic labeling method based on transfer learning

Publications (2)

Publication Number Publication Date
CN113506553A true CN113506553A (en) 2021-10-15
CN113506553B CN113506553B (en) 2023-12-05

Family

ID=78010615

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110712420.2A Active CN113506553B (en) 2021-06-25 2021-06-25 Audio automatic labeling method based on transfer learning

Country Status (1)

Country Link
CN (1) CN113506553B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117476036A (en) * 2023-12-27 2024-01-30 广州声博士声学技术有限公司 Environmental noise identification method, system, equipment and medium
WO2024021882A1 (en) * 2022-07-28 2024-02-01 腾讯科技(深圳)有限公司 Audio data processing method and apparatus, and computer device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170011306A1 (en) * 2015-07-06 2017-01-12 Microsoft Technology Licensing, Llc Transfer Learning Techniques for Disparate Label Sets
CN109918535A (en) * 2019-01-18 2019-06-21 华南理工大学 Music automatic marking method based on label depth analysis
CN111462774A (en) * 2020-03-19 2020-07-28 河海大学 Music emotion credible classification method based on deep learning
CN112199548A (en) * 2020-09-28 2021-01-08 华南理工大学 Music audio classification method based on convolution cyclic neural network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170011306A1 (en) * 2015-07-06 2017-01-12 Microsoft Technology Licensing, Llc Transfer Learning Techniques for Disparate Label Sets
CN109918535A (en) * 2019-01-18 2019-06-21 华南理工大学 Music automatic marking method based on label depth analysis
CN111462774A (en) * 2020-03-19 2020-07-28 河海大学 Music emotion credible classification method based on deep learning
CN112199548A (en) * 2020-09-28 2021-01-08 华南理工大学 Music audio classification method based on convolution cyclic neural network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JORDI PONS ET AL.: "《END-TO-END LEARNING FOR MUSIC AUDIO TAGGING AT SCALE》", 《ARXIV:1711.02520V4》, pages 1 - 8 *
于超: "《音乐情感识别中的迁移学习方法研究》", 《现代计算机(专业版)》, no. 6, pages 3 - 6 *
冯楚祎: "《基于深度学习的音乐自动标注方法研究》", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 09 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024021882A1 (en) * 2022-07-28 2024-02-01 腾讯科技(深圳)有限公司 Audio data processing method and apparatus, and computer device and storage medium
CN117476036A (en) * 2023-12-27 2024-01-30 广州声博士声学技术有限公司 Environmental noise identification method, system, equipment and medium
CN117476036B (en) * 2023-12-27 2024-04-09 广州声博士声学技术有限公司 Environmental noise identification method, system, equipment and medium

Also Published As

Publication number Publication date
CN113506553B (en) 2023-12-05

Similar Documents

Publication Publication Date Title
Nam et al. Deep learning for audio-based music classification and tagging: Teaching computers to distinguish rock from bach
Aucouturier et al. " The way it Sounds": timbre models for analysis and retrieval of music signals
CN106295717B (en) A kind of western musical instrument classification method based on rarefaction representation and machine learning
CN112435642B (en) Melody MIDI accompaniment generation method based on deep neural network
CN113506553B (en) Audio automatic labeling method based on transfer learning
Kiela et al. Learning neural audio embeddings for grounding semantics in auditory perception
Kolozali et al. Automatic ontology generation for musical instruments based on audio analysis
CN113813609B (en) Game music style classification method and device, readable medium and electronic equipment
Gan Music feature classification based on recurrent neural networks with channel attention mechanism
Krause et al. Classifying Leitmotifs in Recordings of Operas by Richard Wagner.
Mounika et al. Music genre classification using deep learning
Armentano et al. Genre classification of symbolic pieces of music
Li et al. Music genre classification based on fusing audio and lyric information
Kai Automatic recommendation algorithm for video background music based on deep learning
Shaikh et al. Music genre classification using neural network
Bhat et al. Deep learning approach to joint identification of instrument pitch and raga for Indian classical music
Kolozali et al. A framework for automatic ontology generation based on semantic audio analysis
Fuhrmann et al. Quantifying the Relevance of Locally Extracted Information for Musical Instrument Recognition from Entire Pieces of Music.
Miao et al. Construction of multimodal music automatic annotation model based on neural network algorithm
Xue et al. Research on the Filtering and Classification Method of Interactive Music Education Resources Based on Neural Network
Fu et al. Improve symbolic music pre-training model using MusicTransformer structure
Tang et al. Construction of Music Classification and Detection Model Based on Big Data Analysis and Genetic Algorithm
MOLGORA Musical instrument recognition: a transfer learning approach
Joseph Fernandez Comparison of Deep Learning and Machine Learning in Music Genre Categorization
CN114817620A (en) Song comparison method and device, equipment, medium and product thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant