CN113506553A

CN113506553A - Audio automatic labeling method based on transfer learning

Info

Publication number: CN113506553A
Application number: CN202110712420.2A
Authority: CN
Inventors: 居辰; 韩立新
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2021-06-25
Filing date: 2021-06-25
Publication date: 2021-10-15
Anticipated expiration: 2041-06-25
Also published as: CN113506553B

Abstract

The invention discloses an automatic audio labeling method based on transfer learning, which mainly comprises the following steps: data preprocessing, namely establishing primary labels (such as music genres, musical instruments, music moods, singer information and the like) according to the distribution condition of the audio labels, and dividing an original data set into data sets containing the primary labels as the input of a model; transfer learning, namely completing audio classification on the small audio data set by using a learned knowledge structure in the field of image recognition, and labeling the audio data which is not labeled with the primary label; and automatically labeling the audio, namely, using different audio representation forms as input to the data set after the tag reconstruction, constructing a classifier to learn the time-frequency domain characteristics of the audio signal, and automatically labeling. The audio automatic labeling method provided by the invention can effectively relieve the problem that the original audio labeling data set is extremely sparse, increase the diversity and the balance of audio labeling and improve the accuracy of audio automatic labeling.

Description

Audio automatic labeling method based on transfer learning

Technical Field

The invention relates to the field of music information research and the field of transfer learning, in particular to an automatic audio labeling method based on transfer learning.

Background

In recent years, with the expansion of the internet scale and the rapid development of digital multimedia technology, music digitization has become a trend. Online music resources are increasing explosively, and people have higher and higher requirements for the quality of music services. The quality of service for retrieval, recommendation and page navigation of digital music websites depends largely on the quality of the music tags: music retrieval typically relies on music tags of songs as a basis for a categorical retrieval, and music recommendations typically rely on a user's listening history for recommending similar songs for them. However, the songs newly put on the shelf and the cold songs in the digital music website have less label information, so that the long-tail music with low popularity is rarely recommended or accessed, the chance of getting the user or community labels is less, and a negative feedback effect is formed. Although expert annotation can solve the problem of song tail growth, it is costly and cannot be applied to large-scale music libraries. Therefore, the automatic labeling of the music labels according to the music audio has extremely high research value.

The music labels refer to high-level descriptive words capable of expressing music characteristics, and are important components of digital music services, and common music label categories include genres (such as classical, jazz, rock, country and the like), musical instruments (such as guitar, string, piano, drum and the like), emotions (such as happy, relaxed, angry, tension, sadness and the like), singer information (gender, the number of singers and the like) and the like. Music automatic labeling requires the prediction of audio tags from audio information, and mainly comprises two important subtasks: acquiring audio descriptive characteristics which effectively represent the self attributes of music; the mapping between music features to high-level semantic tags is learned. In the traditional audio automatic labeling method, relevant characteristics (such as fundamental frequency, formants, Mel frequency cepstrum coefficients and the like) representing sound are directly calculated from a time domain or a frequency domain through a signal processing system, and then the relevant characteristics are used as input signals in a machine learning stage to carry out model training. However, the design of the artificial features is heavy, requires a lot of expertise, and is difficult to fully describe various aspects of music. With the great achievement of deep learning in various fields of pattern recognition, a new music labeling method is proposed, and the mapping relation from audio features to text labels is learned by combining network structures such as a convolutional neural network and a cyclic neural network, but the problems of single labeling and low accuracy still exist.

At present, the disclosed audio automatic labeling data set comprises labels of genres, emotions, musical instruments, singer information and the like, and mainly performs labeling in a manual labeling mode, so that the problems of label deletion and data sparseness exist mostly. In order to solve the problem, a method is needed to be designed to reasonably expand the tag data and improve the attribute of the music to a certain extent. The transfer learning can transfer the labeled data or the learning effect of the knowledge structure to complete or improve the target field or task from the related field, and is mainly divided into example-based transfer, feature-based transfer and shared parameter-based transfer, for example, learning to ride a bicycle and learning to a motorcycle easily. For the original set, the original set can be divided into a plurality of subdata sets according to the primary labels, and the subdata sets respectively correspond to the automatic audio labeling tasks under different primary labels. VGG and ResNet have good performance in the field of image recognition, and are migrated to an audio classification task, so that the automatic audio labeling effect can be improved to a certain extent, and missing labels in a data set can be predicted. Compared with other automatic audio labeling research methods, the method makes up for the problem of data loss of the original data set, makes the data more balanced and improves the diversity and balance of the labels.

Considering that audio contains rich information in the time domain and the frequency domain, the change of amplitude (volume) along with time can be observed in the time domain, and the frequency domain can embody the change of sound frequency and is related to tone. Different representation forms (audio time domain signals and audio Mel spectrograms) of the audio are used as input, an audio automatic labeling classifier is constructed to learn the characteristics of the audio in a time domain and a frequency domain, and finally decision-level fusion is carried out. Compared with the method that the audio waveform or the audio spectrogram is directly utilized for automatic labeling research, the method not only improves the utilization rate of the audio information, but also improves the accuracy of the automatic audio labeling.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems of annotation loss and data sparseness existing in the existing music automatic annotation research, an audio automatic annotation method based on transfer learning is provided, and the diversity and the balance of audio annotation are improved; meanwhile, different representation forms of the audio are used as input to construct an audio automatic labeling classifier, and time-frequency domain characteristics with richer audio are learned, so that the utilization rate of audio information is improved, the accuracy of audio automatic labeling is improved, and better data support is provided for a music retrieval system and a music recommendation system.

The technical scheme is as follows: an automatic audio labeling method based on transfer learning specifically implements the following steps:

step 1: data pre-processing

Reading the original audio annotation data set, cleaning, deleting the problem data file (for example, the data is empty), and sorting the tag data according to the occurrence frequency from high to low. The MP3 audio file is resampled at a sampling frequency of 11025Hz, the audio signal is mapped to a frequency domain from a time domain through short-time Fourier transform, a spectrogram is obtained, and the change of the frequency along with the time is obtained.

According to the distribution condition of the labels in the original audio labeling data set, m primary labels (such as genres, musical instruments, emotions and singer information) are constructed and divided into m sub data sets, each sub data set corresponds to n secondary labels (such as genres divided into classics, jazz, rock and country), and the secondary labels are labels originally existing in the data set.

Step 2: transfer learning

Through step1, the audio automatic labeling problem is converted into a plurality of audio subset automatic labeling problems. Selecting a subdata set m₁Corresponds to n₁And (4) each secondary label is input by taking the Mel frequency spectrogram as a model, an end-to-end multi-classification model is built, and knowledge learning is carried out by taking ROC-AUC as an evaluation standard. Network models such as VGG and Resnet have good performance in the field of image recognition, are migrated into audio frequency classes for fine adjustment, and are converted into models suitable for automatic audio frequency labeling.

Selecting a subdata set m_i(i>1) Corresponds to n_iA second-level label for transferring the image recognition network model to the data set m_iAnd (3) taking the Mel frequency spectrogram as model input, training and fine-tuning the model to obtain the M-position of the model_iTraining parameters on the data set can be simultaneously predicted to obtain other subdata sets m_j(j ≠ i) corresponding n_iAnd (5) a secondary label. Thus, a more balanced one can be obtainedThe data set is labeled, i.e. each audio file is labeled with at least m labels.

Step 3: audio automatic labeling

And performing m classification on the expanded labeled audio data set. Considering that audio signals have different characteristics in a time domain and a frequency domain, respectively taking a time domain waveform and a frequency domain spectrogram as input, constructing an audio automatic labeling classifier for knowledge learning, learning the time sequence characteristics of the audio by using LSTM, learning the frequency domain characteristics of the spectrogram by using ResNet, then performing decision-level fusion on the outputs of the two models, and finally generating the audio automatic labeling classifier as shown in a specific formula (1),

wherein n represents the number of models, weight_iWeight, P, representing model i_iRepresenting the prediction probability value of model i.

The beneficial effects of the invention are specifically expressed as follows:

1) and converting the audio data into a frequency domain through short-time Fourier transform to obtain a representation form of the audio in the frequency domain.

2) And constructing a primary label according to the label distribution of the original data set, corresponding to a plurality of secondary labels, and dividing the data set into a plurality of subdata sets according to the primary label.

3) The network model with better performance in the field of image recognition is migrated to the automatic audio labeling task, fine adjustment and model optimization are carried out, and the accuracy of the audio classification task is improved for the audio under a single primary label.

4) And (3) carrying out classification prediction on each audio subset by applying the model in the step 3), and labeling the data which are not labeled with the primary label, so that an audio labeling data set is expanded, and the diversity and the balance of audio labeling are improved.

5) And performing multi-classification on the expanded data set, taking the audio signal containing effective information in a time domain and a frequency domain into consideration, respectively taking the audio time domain signal and a spectrogram as input, constructing an audio automatic labeling classifier, performing decision-level fusion on the output of the classifier, and performing model training and testing by taking the real labeled data set as a target, so that the utilization rate of the audio information is improved, and the accuracy of the audio automatic labeling is greatly improved.

Drawings

FIG. 1 is a flowchart of an audio automatic labeling method based on transfer learning according to the present invention.

FIG. 2 is a schematic diagram of an automatic audio annotation process according to the present invention.

Detailed Description

The present invention is further illustrated by the following figures and specific examples, which are to be understood as illustrative only and not as limiting the scope of the invention, which is to be given the full breadth of the appended claims and any and all equivalent modifications thereof which may occur to those skilled in the art upon reading the present specification.

Referring to fig. 1, an algorithm flow of the embodiment of the present invention is shown, which includes the following steps:

step1, audio preprocessing: carrying out data cleaning to remove invalid data; resampling MP3 audio data at a sampling rate of 11025Hz, and simultaneously converting the audio signal into a spectrogram by using short-time Fourier transform to obtain the representation of the audio signal in a frequency domain; synonym merging is performed on the data set labels and the data set labels are sorted from high to low in frequency of occurrence.

Step2, dividing the data set: and checking the distribution condition of the labels of the data set, selecting the label with the frequency of 50 before as a secondary label, constructing a primary label (the information of the genre, the emotion, the musical instrument and the singer), and respectively corresponding to one part of the 50 secondary labels, so as to divide the data set into 4 sub-data sets (a genre data set, an emotion data set, a musical instrument data set and a singer information data set).

Step3, training a transfer learning model: the Resnet with better effect in the image field is migrated to the audio classification task, model fine-tuning is respectively carried out on the obtained 4 sub-data sets (network weights except for the full connection layer are frozen, image features are extracted, and full connection layer parameters are trained based on different sub-data sets), so that the accuracy of audio classification is improved.

And 4, expanding the tag data set: on the basis of step3, the label prediction and labeling are performed on other data sets not containing the primary label by applying different models finely adjusted on different sub data sets, for example: the model prediction applied after the fine tuning on the genre dataset does not contain the dataset of genre labels. The operation is repeated until each audio data is labeled with at least 4 labels, so that the labeled data set is expanded, and the diversity and the balance of the labeled data set are improved.

Step 5, automatically labeling audio: and constructing an audio automatic labeling model, and respectively using the time domain waveform and the spectrogram as the representation of the audio signal in a time domain and a frequency domain. Transmitting the audio signal into an LSTM network to learn the time sequence characteristics of the audio, and outputting the probability distribution corresponding to 50 labels; and (4) transmitting the spectrogram into a Resnet network to extract deep features of the picture, and outputting probability distribution corresponding to 50 labels. And then, performing decision-level fusion on the outputs of the LSTM and Resnet, wherein the main formula is as follows (2):

the automatic labeling model is trained by taking the expanded label data set as a target to generate an audio automatic labeling model, the model not only improves the utilization rate of audio information, but also improves the accuracy rate of audio automatic labeling, and the model can be used for adding a plurality of labels to unknown labeled songs.

Claims

1. An audio automatic labeling method based on transfer learning mainly comprises the following steps:

step 1: data preprocessing, namely cleaning a data set, and converting audio information into frequency domain information by short-time Fourier transform to obtain a spectrogram as the representation of an audio signal in a frequency domain; and screening the primary labels according to the label distribution of the data set, and dividing the primary labels into a plurality of primary label data (such as a genre data set, an emotion data set and a musical instrument data set).

Step 2: and (3) transfer learning, namely transferring the network models in the image recognition to the m sub-data sets obtained in the step (1) for training and fine adjustment respectively, constructing an audio automatic labeling classifier under a primary label, and expanding an audio labeling data set.

And step 3: and (3) automatic audio labeling, namely, establishing a model for knowledge learning aiming at the data set expanded in the step (2), and training an automatic audio labeling classifier.

2. The automatic audio labeling method based on transfer learning of claim 1, wherein the data set partitioning manner in step1 is to construct m primary labels according to the distribution of the original data set labels, each primary label corresponds to a plurality of secondary labels, wherein the secondary labels are the original data set labels, and thus, the original multi-label big data set is partitioned into sub data sets corresponding to the plurality of primary labels.

3. The automatic audio labeling method based on transfer learning of claim 1, wherein the selection and training of the transfer learning model in step2 mainly utilizes a network model which is better represented in the image field, and transfers the network model to an audio classification task for fine tuning, so as to improve the accuracy of audio classification, and the improvement of the model on the task can be considered, so that an audio classification classifier is constructed, and the audio classification effect is further improved.

4. The automatic audio labeling method based on transfer learning of claim 1, wherein the expansion mode of the original labeling of the data set in the step2 is to select a certain subdata set, the audio classification classifier in the step2 is applied to training and testing, when the classification accuracy reaches a certain height, the classifier can be considered to predict the type of labeling of other data sets, so as to label other data sets, thus, the original labeled data set can be expanded into a data set with at least m labels labeled on each audio, and the diversity and balance of the labeled data sets are improved.

5. The audio automatic labeling method based on transfer learning of claim 1, wherein the audio automatic labeling method of step3 performs multiple classifications for the data set (each audio is labeled with at least m tags) expanded in step 2; taking the spectrogram as input and the m labels as targets, constructing an audio label classifier, and learning the frequency domain characteristics of the audio; because the original signal contains abundant time domain information, the time domain waveform is used as the input of another classifier to learn the time sequence characteristic of the audio signal; and finally, performing decision-level fusion on the outputs (probability distribution) of the two classifiers, wherein a specific formula is shown as (1):

wherein n represents the number of models, weight_iWeight, P, representing model i_iAnd the prediction probability value of the model i is represented, so that the accuracy of the automatic audio labeling model is greatly improved, and the audio information is fully utilized.