CN111326164A - Semi-supervised music melody extraction method - Google Patents

Semi-supervised music melody extraction method Download PDF

Info

Publication number
CN111326164A
CN111326164A CN202010071808.4A CN202010071808A CN111326164A CN 111326164 A CN111326164 A CN 111326164A CN 202010071808 A CN202010071808 A CN 202010071808A CN 111326164 A CN111326164 A CN 111326164A
Authority
CN
China
Prior art keywords
pitch
learning machine
melody
extreme learning
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010071808.4A
Other languages
Chinese (zh)
Other versions
CN111326164B (en
Inventor
张维维
毕胜
房少军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian Maritime University
Original Assignee
Dalian Maritime University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian Maritime University filed Critical Dalian Maritime University
Priority to CN202010071808.4A priority Critical patent/CN111326164B/en
Publication of CN111326164A publication Critical patent/CN111326164A/en
Application granted granted Critical
Publication of CN111326164B publication Critical patent/CN111326164B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/26Pre-filtering or post-filtering
    • G10L19/265Pre-filtering, e.g. high frequency emphasis prior to encoding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/0212Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using orthogonal transformation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/032Quantisation or dequantisation of spectral components
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Auxiliary Devices For Music (AREA)

Abstract

The invention discloses a semi-supervised music main melody extraction method, which comprises the following steps: carrying out normalization, resampling and filtering pretreatment on an input audio signal to obtain an audio signal simulating the auditory characteristic of human ears; performing constant Q spectrum transformation on the audio signal to obtain variable resolution ratio spectrum signals with frequencies distributed according to logarithms, aggregating the magnitude spectra of adjacent frames to obtain a characteristic vector, constructing an input vector set of an extreme learning machine according to the characteristic vector, and obtaining an output vector set according to a training set; carrying out parameter training on the extreme learning machine, and carrying out melody pitch rough estimation by utilizing an extreme learning machine network; and searching a spectrum peak in the semitone range of the coarse estimation 2/3 of the melody pitch of each frame, outputting the frequency corresponding to the spectrum peak as the melody pitch of the frame, and finely adjusting the melody pitch.

Description

Semi-supervised music melody extraction method
Technical Field
The invention belongs to the field of audio signal processing, and particularly relates to a semi-supervised music melody extraction method.
Background
The features adopted by the existing main melody extraction method are divided into two types: the first is low-level artificial features, which are set by researchers according to signal characteristics, and have significantly reduced performance and poor generalization when processing complex music with multiple sound sources; the second type of features are advanced self-learning features, which are obtained by self-learning of an algorithm on the basis of a data set, and the performance of the features depends heavily on the capacity and diversity of a training set. Recently, deep learning provides a new solution for music melody extraction, but it takes a long time to train to obtain appropriate network parameters, and the capacity and diversity of the training set are still important factors that limit the performance of the training set. However, the existing main melody extraction annotation data sets are still lacking, which in turn limits the performance of such methods.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a semi-supervised music theme extraction method, which specifically comprises the following steps:
carrying out normalization, resampling and filtering pretreatment on an input audio signal to obtain an audio signal simulating the auditory characteristic of human ears;
performing constant Q spectrum transformation on the audio signal to obtain variable resolution spectrum signals with frequencies distributed according to logarithms, aggregating the magnitude spectra of adjacent frames to obtain a characteristic vector, constructing an input vector set of an extreme learning machine according to the characteristic vector, and generating an output vector set according to a training set marking result;
carrying out parameter training on the extreme learning machine, and carrying out melody pitch rough estimation by utilizing an extreme learning machine network; if the first element of the output vector is the maximum value, the frame is judged as a rotation-law-free frame, and if the other elements except the first element are the maximum values, the corresponding pitch is used as a coarse pitch estimation result of the frame melody;
and searching a spectrum peak in the semitone range of the coarse estimation 2/3 of the melody pitch of each frame, outputting the frequency corresponding to the spectrum peak as the melody pitch of the frame, and finely adjusting the melody pitch.
The input vector set and the output vector set of the extreme learning machine are constructed in the following way:
constructing an input vector set of the extreme learning machine by using a plurality of feature vectors; reading the mark value corresponding to each feature vector, generating a one-bit effective output vector with the first element of 1 and other elements of 0 in a non-rotation section, and generating a one-bit effective output vector with the pitch corresponding to 1 and other elements of 0 in a rotation section after each quantization; and for the unmarked data set, each element of the output vector is 0, and an output vector set corresponding to the input vector set is obtained.
By adopting the technical scheme, the invention provides the semi-supervised music main melody extraction method, which can learn the unmarked data characteristics, efficiently learn the high-level melody characteristic representation, has higher generalization and also has better effect on unseen data sets.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flowchart illustrating a semi-supervised music theme extraction method according to the present invention;
FIG. 2 is a flow chart of the present invention for testing a limit learning machine.
FIG. 3 is a time domain waveform diagram in accordance with an embodiment of the present invention;
FIG. 4 is a CQT spectrum in an embodiment of the present invention;
FIG. 5 is a diagram illustrating a prediction matrix of a semi-supervised extreme learning machine in an embodiment of the present invention;
FIG. 6 is a diagram illustrating the estimated value and the true value of the audio segment according to the present invention.
Detailed Description
In order to make the technical solutions and advantages of the present invention clearer, the following describes the technical solutions in the embodiments of the present invention clearly and completely with reference to the drawings in the embodiments of the present invention:
as shown in fig. 1, a semi-supervised music theme extraction method specifically includes the following steps:
s1: and preprocessing the input audio signal, wherein the preprocessing process comprises normalization, resampling and filtering to obtain the audio signal simulating the auditory characteristics of human ears. The detailed process comprises the following steps:
s11, normalizing the audio signal to enable the mean value to be 0 and the variance to be 1;
s12, according to the characteristics that the amplitude of a higher harmonic signal in the music signal is relatively small and the influence on a signal processing result is weak, resampling the input audio signal, sampling to 16kHz, namely discarding frequency components above 8 kHz;
s13, filtering the audio signal by adopting an equal loudness filter, enhancing audio components which are relatively sensitive to the human auditory system, and attenuating components which are not sensitive to the human auditory system to obtain the audio signal simulating the auditory characteristics of human ears.
S2: carrying out constant Q spectrum transformation on an audio signal on the basis of preprocessing to obtain variable resolution spectrum signals with frequencies distributed according to logarithms, wherein the variable resolution spectrum signals comprise amplitude spectrums and phase spectrums; reading the mark value corresponding to each feature vector, generating a one-bit effective output vector with the first element of 1 and other elements of 0 in the non-rotation section, generating a one-bit effective output vector with the pitch corresponding to each quantized element of 1 and other elements of 0 in the rotation section, and for the unmarked data set, outputting each element of the vector of 0, thereby obtaining an output vector set corresponding to the input vector set. The method comprises the following specific steps:
s21, in order to obtain the segmented spectrum characteristics of the audio signal, Constant Q Transformation (CQT) is carried out on the audio signal to obtain a variable resolution frequency spectrum with the frequency meeting the characteristics of logarithmic distribution, and the frequency resolution of a lower frequency segment and the time resolution of a higher frequency segment are improved;
s22, solving a CQT spectrum to obtain a modulus value, obtaining a CQT amplitude spectrum, abandoning a phase spectrum, aggregating CQT amplitude spectra of adjacent frames, taking the current frame as a center, respectively connecting 3 CQT amplitude spectra in series in front and at the back to obtain a long vector formed by connecting the CQT amplitude spectra of the front and at the back 7 frames in series, and taking the long vector as a feature vector of the current frame;
and S23, quantizing melody pitches in the marked data set, generating a one-bit effective output vector with 1 first element and 0 other values in a no-rotation section, generating a one-bit effective output vector with 1 corresponding to each quantized pitch and 0 other elements in a rotation section, and setting 0 elements in the output vector corresponding to the no-marked data set.
S3: carrying out parameter training on the extreme learning machine, and carrying out melody pitch rough estimation by utilizing an extreme learning machine network; if the first element of the output vector is the maximum value, the frame is judged as a rotation-law-free frame, and if the other elements except the first element are the maximum values, the corresponding pitch is used as a coarse pitch estimation result of the frame melody;
the training process of the extreme learning machine comprises the following steps:
s31, random initialization is carried out on ELM input weight, and input weight parameters are uniformly distributed in a range of [ -1,1 ];
s32, performing random initialization on ELM input bias, wherein input bias parameters are uniformly distributed in a range of [ -1,1 ];
and S33, calculating an ELM hidden layer output matrix.
S4: and searching a spectrum peak in the semitone range of the coarse estimation 2/3 of the melody pitch of each frame, outputting the frequency corresponding to the spectrum peak as the melody pitch of the frame, and finely adjusting the melody pitch.
Example (b):
a certain piece of music audio with a duration of 8.2 seconds is collected, and the time domain waveform diagram is shown in fig. 3. The audio piece was subjected to CQT spectral analysis, the CQT magnitude spectrum of which is shown in FIG. 4. The pitch prediction of the melody is performed on the audio signal by using the trained semi-supervised extreme learning machine network parameters, and the obtained prediction matrix is shown in fig. 5. The maximum value is taken for each row of the prediction matrix to obtain the coarse melody pitch estimation result, then the spectral peak search is carried out on 2/3 semitone ranges of each frame of coarse estimation to obtain the final melody pitch estimation, and the estimation value and the true value of the section of audio are shown in fig. 6. The embodiment shows that the method can output more accurate melody pitch estimation results.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims (2)

1. A semi-supervised music theme extraction method is characterized by comprising the following steps:
carrying out normalization, resampling and filtering pretreatment on an input audio signal to obtain an audio signal simulating the auditory characteristic of human ears;
performing constant Q spectrum transformation on the audio signal to obtain variable resolution spectrum signals with frequencies distributed according to logarithms, aggregating the magnitude spectra of adjacent frames to obtain a characteristic vector, constructing an input vector set of an extreme learning machine according to the characteristic vector, and generating an output vector set according to a training set marking result;
carrying out parameter training on the extreme learning machine, and carrying out melody pitch rough estimation by utilizing an extreme learning machine network; if the first element of the output vector is the maximum value, the frame is judged as a rotation-law-free frame, and if the other elements except the first element are the maximum values, the corresponding pitch is used as a coarse pitch estimation result of the frame melody;
and searching a spectrum peak in the semitone range of the coarse estimation 2/3 of the melody pitch of each frame, outputting the frequency corresponding to the spectrum peak as the melody pitch of the frame, and finely adjusting the melody pitch.
2. The method of claim 1, further comprising: the input vector set and the output vector set of the extreme learning machine are constructed in the following way:
constructing an input vector set of the extreme learning machine by using a plurality of feature vectors; reading the mark value corresponding to each feature vector, generating a one-bit effective output vector with the first element of 1 and other elements of 0 in a non-rotation section, and generating a one-bit effective output vector with the pitch corresponding to 1 and other elements of 0 in a rotation section after each quantization; and for the unmarked data set, each element of the output vector is 0, and an output vector set corresponding to the input vector set is obtained.
CN202010071808.4A 2020-01-21 2020-01-21 Semi-supervised music theme extraction method Active CN111326164B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010071808.4A CN111326164B (en) 2020-01-21 2020-01-21 Semi-supervised music theme extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010071808.4A CN111326164B (en) 2020-01-21 2020-01-21 Semi-supervised music theme extraction method

Publications (2)

Publication Number Publication Date
CN111326164A true CN111326164A (en) 2020-06-23
CN111326164B CN111326164B (en) 2023-03-21

Family

ID=71171018

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010071808.4A Active CN111326164B (en) 2020-01-21 2020-01-21 Semi-supervised music theme extraction method

Country Status (1)

Country Link
CN (1) CN111326164B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113781988A (en) * 2021-07-30 2021-12-10 北京达佳互联信息技术有限公司 Subtitle display method, subtitle display device, electronic equipment and computer-readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008281776A (en) * 2007-05-10 2008-11-20 Mitsubishi Electric Corp Melody extraction device and melody extraction method
CN105957538A (en) * 2016-05-09 2016-09-21 大连民族大学 Polyphony melody extraction method based on significance
CN107133643A (en) * 2017-04-29 2017-09-05 天津大学 Note signal sorting technique based on multiple features fusion and feature selecting
CN108647596A (en) * 2018-04-27 2018-10-12 大连民族大学 Music Melody extraction method
CN110223712A (en) * 2019-06-05 2019-09-10 西安交通大学 A kind of music emotion recognition method based on two-way convolution loop sparse network
CN110600055A (en) * 2019-08-15 2019-12-20 杭州电子科技大学 Singing voice separation method using melody extraction and voice synthesis technology

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008281776A (en) * 2007-05-10 2008-11-20 Mitsubishi Electric Corp Melody extraction device and melody extraction method
CN105957538A (en) * 2016-05-09 2016-09-21 大连民族大学 Polyphony melody extraction method based on significance
CN107133643A (en) * 2017-04-29 2017-09-05 天津大学 Note signal sorting technique based on multiple features fusion and feature selecting
CN108647596A (en) * 2018-04-27 2018-10-12 大连民族大学 Music Melody extraction method
CN110223712A (en) * 2019-06-05 2019-09-10 西安交通大学 A kind of music emotion recognition method based on two-way convolution loop sparse network
CN110600055A (en) * 2019-08-15 2019-12-20 杭州电子科技大学 Singing voice separation method using melody extraction and voice synthesis technology

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
张维维 等: "结合改进欧几里得算法和动态规划的音乐主旋律提取" *
赵亮;: "改进极限学习机的电子音乐分类模型" *
陈晓鸥;杨德顺;: "音乐情感识别研究进展" *
黄清宝;蒋成龙;林小峰;徐辰华;唐鹏;张梦桥;: "基于和声搜索算法的极限学习机网络优化" *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113781988A (en) * 2021-07-30 2021-12-10 北京达佳互联信息技术有限公司 Subtitle display method, subtitle display device, electronic equipment and computer-readable storage medium

Also Published As

Publication number Publication date
CN111326164B (en) 2023-03-21

Similar Documents

Publication Publication Date Title
CN110136731B (en) Cavity causal convolution generation confrontation network end-to-end bone conduction voice blind enhancement method
CN110019931B (en) Audio classification method and device, intelligent equipment and storage medium
CN101599271B (en) Recognition method of digital music emotion
Wang et al. Specaugment++: A hidden space data augmentation method for acoustic scene classification
CN109378010A (en) Neural network model training method, voice denoising method and device
CN111369982A (en) Training method of audio classification model, audio classification method, device and equipment
CN109100814B (en) A kind of audio magnetotelluric method signal antinoise method based on noise classification
CN105788592A (en) Audio classification method and apparatus thereof
CN111899757B (en) Single-channel voice separation method and system for target speaker extraction
CN115545086B (en) Migratable feature automatic selection acoustic diagnosis method and system
Li et al. Sams-net: A sliced attention-based neural network for music source separation
CN112885375A (en) Global signal-to-noise ratio estimation method based on auditory filter bank and convolutional neural network
CN112086100A (en) Quantization error entropy based urban noise identification method of multilayer random neural network
CN111326164B (en) Semi-supervised music theme extraction method
Li et al. A si-sdr loss function based monaural source separation
Shifas et al. A non-causal FFTNet architecture for speech enhancement
CN117409761B (en) Method, device, equipment and storage medium for synthesizing voice based on frequency modulation
US20230136220A1 (en) Quantifying Signal Purity by means of Machine Learning
CN114302301A (en) Frequency response correction method and related product
CN104318931A (en) Emotional activity obtaining method and apparatus of audio file, and classification method and apparatus of audio file
Moliner et al. Realistic gramophone noise synthesis using a diffusion model
Küçükbay et al. Hand-crafted versus learned representations for audio event detection
Jiang et al. Music signal recognition based on the mathematical and physical equation inversion method
Yao et al. [Retracted] Research on Segmentation Experience of Music Signal Improved Based on Maximization of Negative Entropy
Ye et al. Music style transfer with vocals based on CycleGAN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant