CN111326164A

CN111326164A - Semi-supervised music melody extraction method

Info

Publication number: CN111326164A
Application number: CN202010071808.4A
Authority: CN
Inventors: 张维维; 毕胜; 房少军
Original assignee: Dalian Maritime University
Current assignee: Dalian Maritime University
Priority date: 2020-01-21
Filing date: 2020-01-21
Publication date: 2020-06-23
Anticipated expiration: 2040-01-21
Also published as: CN111326164B

Abstract

The invention discloses a semi-supervised music main melody extraction method, which comprises the following steps: carrying out normalization, resampling and filtering pretreatment on an input audio signal to obtain an audio signal simulating the auditory characteristic of human ears; performing constant Q spectrum transformation on the audio signal to obtain variable resolution ratio spectrum signals with frequencies distributed according to logarithms, aggregating the magnitude spectra of adjacent frames to obtain a characteristic vector, constructing an input vector set of an extreme learning machine according to the characteristic vector, and obtaining an output vector set according to a training set; carrying out parameter training on the extreme learning machine, and carrying out melody pitch rough estimation by utilizing an extreme learning machine network; and searching a spectrum peak in the semitone range of the coarse estimation 2/3 of the melody pitch of each frame, outputting the frequency corresponding to the spectrum peak as the melody pitch of the frame, and finely adjusting the melody pitch.

Description

Semi-supervised music melody extraction method

Technical Field

The invention belongs to the field of audio signal processing, and particularly relates to a semi-supervised music melody extraction method.

Background

The features adopted by the existing main melody extraction method are divided into two types: the first is low-level artificial features, which are set by researchers according to signal characteristics, and have significantly reduced performance and poor generalization when processing complex music with multiple sound sources; the second type of features are advanced self-learning features, which are obtained by self-learning of an algorithm on the basis of a data set, and the performance of the features depends heavily on the capacity and diversity of a training set. Recently, deep learning provides a new solution for music melody extraction, but it takes a long time to train to obtain appropriate network parameters, and the capacity and diversity of the training set are still important factors that limit the performance of the training set. However, the existing main melody extraction annotation data sets are still lacking, which in turn limits the performance of such methods.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a semi-supervised music theme extraction method, which specifically comprises the following steps:

carrying out normalization, resampling and filtering pretreatment on an input audio signal to obtain an audio signal simulating the auditory characteristic of human ears;

performing constant Q spectrum transformation on the audio signal to obtain variable resolution spectrum signals with frequencies distributed according to logarithms, aggregating the magnitude spectra of adjacent frames to obtain a characteristic vector, constructing an input vector set of an extreme learning machine according to the characteristic vector, and generating an output vector set according to a training set marking result;

carrying out parameter training on the extreme learning machine, and carrying out melody pitch rough estimation by utilizing an extreme learning machine network; if the first element of the output vector is the maximum value, the frame is judged as a rotation-law-free frame, and if the other elements except the first element are the maximum values, the corresponding pitch is used as a coarse pitch estimation result of the frame melody;

and searching a spectrum peak in the semitone range of the coarse estimation 2/3 of the melody pitch of each frame, outputting the frequency corresponding to the spectrum peak as the melody pitch of the frame, and finely adjusting the melody pitch.

The input vector set and the output vector set of the extreme learning machine are constructed in the following way:

constructing an input vector set of the extreme learning machine by using a plurality of feature vectors; reading the mark value corresponding to each feature vector, generating a one-bit effective output vector with the first element of 1 and other elements of 0 in a non-rotation section, and generating a one-bit effective output vector with the pitch corresponding to 1 and other elements of 0 in a rotation section after each quantization; and for the unmarked data set, each element of the output vector is 0, and an output vector set corresponding to the input vector set is obtained.

By adopting the technical scheme, the invention provides the semi-supervised music main melody extraction method, which can learn the unmarked data characteristics, efficiently learn the high-level melody characteristic representation, has higher generalization and also has better effect on unseen data sets.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flowchart illustrating a semi-supervised music theme extraction method according to the present invention;

FIG. 2 is a flow chart of the present invention for testing a limit learning machine.

FIG. 3 is a time domain waveform diagram in accordance with an embodiment of the present invention;

FIG. 4 is a CQT spectrum in an embodiment of the present invention;

FIG. 5 is a diagram illustrating a prediction matrix of a semi-supervised extreme learning machine in an embodiment of the present invention;

FIG. 6 is a diagram illustrating the estimated value and the true value of the audio segment according to the present invention.

Detailed Description

In order to make the technical solutions and advantages of the present invention clearer, the following describes the technical solutions in the embodiments of the present invention clearly and completely with reference to the drawings in the embodiments of the present invention:

as shown in fig. 1, a semi-supervised music theme extraction method specifically includes the following steps:

s1: and preprocessing the input audio signal, wherein the preprocessing process comprises normalization, resampling and filtering to obtain the audio signal simulating the auditory characteristics of human ears. The detailed process comprises the following steps:

s11, normalizing the audio signal to enable the mean value to be 0 and the variance to be 1;

s12, according to the characteristics that the amplitude of a higher harmonic signal in the music signal is relatively small and the influence on a signal processing result is weak, resampling the input audio signal, sampling to 16kHz, namely discarding frequency components above 8 kHz;

s13, filtering the audio signal by adopting an equal loudness filter, enhancing audio components which are relatively sensitive to the human auditory system, and attenuating components which are not sensitive to the human auditory system to obtain the audio signal simulating the auditory characteristics of human ears.

S2: carrying out constant Q spectrum transformation on an audio signal on the basis of preprocessing to obtain variable resolution spectrum signals with frequencies distributed according to logarithms, wherein the variable resolution spectrum signals comprise amplitude spectrums and phase spectrums; reading the mark value corresponding to each feature vector, generating a one-bit effective output vector with the first element of 1 and other elements of 0 in the non-rotation section, generating a one-bit effective output vector with the pitch corresponding to each quantized element of 1 and other elements of 0 in the rotation section, and for the unmarked data set, outputting each element of the vector of 0, thereby obtaining an output vector set corresponding to the input vector set. The method comprises the following specific steps:

s21, in order to obtain the segmented spectrum characteristics of the audio signal, Constant Q Transformation (CQT) is carried out on the audio signal to obtain a variable resolution frequency spectrum with the frequency meeting the characteristics of logarithmic distribution, and the frequency resolution of a lower frequency segment and the time resolution of a higher frequency segment are improved;

s22, solving a CQT spectrum to obtain a modulus value, obtaining a CQT amplitude spectrum, abandoning a phase spectrum, aggregating CQT amplitude spectra of adjacent frames, taking the current frame as a center, respectively connecting 3 CQT amplitude spectra in series in front and at the back to obtain a long vector formed by connecting the CQT amplitude spectra of the front and at the back 7 frames in series, and taking the long vector as a feature vector of the current frame;

and S23, quantizing melody pitches in the marked data set, generating a one-bit effective output vector with 1 first element and 0 other values in a no-rotation section, generating a one-bit effective output vector with 1 corresponding to each quantized pitch and 0 other elements in a rotation section, and setting 0 elements in the output vector corresponding to the no-marked data set.

S3: carrying out parameter training on the extreme learning machine, and carrying out melody pitch rough estimation by utilizing an extreme learning machine network; if the first element of the output vector is the maximum value, the frame is judged as a rotation-law-free frame, and if the other elements except the first element are the maximum values, the corresponding pitch is used as a coarse pitch estimation result of the frame melody;

the training process of the extreme learning machine comprises the following steps:

s31, random initialization is carried out on ELM input weight, and input weight parameters are uniformly distributed in a range of [ -1,1 ];

s32, performing random initialization on ELM input bias, wherein input bias parameters are uniformly distributed in a range of [ -1,1 ];

and S33, calculating an ELM hidden layer output matrix.

S4: and searching a spectrum peak in the semitone range of the coarse estimation 2/3 of the melody pitch of each frame, outputting the frequency corresponding to the spectrum peak as the melody pitch of the frame, and finely adjusting the melody pitch.

Example (b):

a certain piece of music audio with a duration of 8.2 seconds is collected, and the time domain waveform diagram is shown in fig. 3. The audio piece was subjected to CQT spectral analysis, the CQT magnitude spectrum of which is shown in FIG. 4. The pitch prediction of the melody is performed on the audio signal by using the trained semi-supervised extreme learning machine network parameters, and the obtained prediction matrix is shown in fig. 5. The maximum value is taken for each row of the prediction matrix to obtain the coarse melody pitch estimation result, then the spectral peak search is carried out on 2/3 semitone ranges of each frame of coarse estimation to obtain the final melody pitch estimation, and the estimation value and the true value of the section of audio are shown in fig. 6. The embodiment shows that the method can output more accurate melody pitch estimation results.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. A semi-supervised music theme extraction method is characterized by comprising the following steps:

2. The method of claim 1, further comprising: the input vector set and the output vector set of the extreme learning machine are constructed in the following way: