CN111326164A - Semi-supervised music melody extraction method - Google Patents
Semi-supervised music melody extraction method Download PDFInfo
- Publication number
- CN111326164A CN111326164A CN202010071808.4A CN202010071808A CN111326164A CN 111326164 A CN111326164 A CN 111326164A CN 202010071808 A CN202010071808 A CN 202010071808A CN 111326164 A CN111326164 A CN 111326164A
- Authority
- CN
- China
- Prior art keywords
- pitch
- learning machine
- melody
- extreme learning
- frame
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 12
- 239000013598 vector Substances 0.000 claims abstract description 47
- 238000001228 spectrum Methods 0.000 claims abstract description 32
- 230000005236 sound signal Effects 0.000 claims abstract description 20
- 238000012952 Resampling Methods 0.000 claims abstract description 5
- 210000005069 ears Anatomy 0.000 claims abstract description 5
- 238000001914 filtration Methods 0.000 claims abstract description 5
- 230000009466 transformation Effects 0.000 claims abstract description 5
- 230000004931 aggregating effect Effects 0.000 claims abstract description 4
- 238000010606 normalization Methods 0.000 claims abstract description 4
- 238000000034 method Methods 0.000 claims description 7
- 238000013139 quantization Methods 0.000 claims description 2
- 239000011295 pitch Substances 0.000 description 20
- 238000010586 diagram Methods 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 4
- 238000007781 pre-processing Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000010183 spectrum analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/26—Pre-filtering or post-filtering
- G10L19/265—Pre-filtering, e.g. high frequency emphasis prior to encoding
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/02—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
- G10L19/0212—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using orthogonal transformation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/02—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
- G10L19/032—Quantisation or dequantisation of spectral components
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/90—Pitch determination of speech signals
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Auxiliary Devices For Music (AREA)
Abstract
The invention discloses a semi-supervised music main melody extraction method, which comprises the following steps: carrying out normalization, resampling and filtering pretreatment on an input audio signal to obtain an audio signal simulating the auditory characteristic of human ears; performing constant Q spectrum transformation on the audio signal to obtain variable resolution ratio spectrum signals with frequencies distributed according to logarithms, aggregating the magnitude spectra of adjacent frames to obtain a characteristic vector, constructing an input vector set of an extreme learning machine according to the characteristic vector, and obtaining an output vector set according to a training set; carrying out parameter training on the extreme learning machine, and carrying out melody pitch rough estimation by utilizing an extreme learning machine network; and searching a spectrum peak in the semitone range of the coarse estimation 2/3 of the melody pitch of each frame, outputting the frequency corresponding to the spectrum peak as the melody pitch of the frame, and finely adjusting the melody pitch.
Description
Technical Field
The invention belongs to the field of audio signal processing, and particularly relates to a semi-supervised music melody extraction method.
Background
The features adopted by the existing main melody extraction method are divided into two types: the first is low-level artificial features, which are set by researchers according to signal characteristics, and have significantly reduced performance and poor generalization when processing complex music with multiple sound sources; the second type of features are advanced self-learning features, which are obtained by self-learning of an algorithm on the basis of a data set, and the performance of the features depends heavily on the capacity and diversity of a training set. Recently, deep learning provides a new solution for music melody extraction, but it takes a long time to train to obtain appropriate network parameters, and the capacity and diversity of the training set are still important factors that limit the performance of the training set. However, the existing main melody extraction annotation data sets are still lacking, which in turn limits the performance of such methods.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a semi-supervised music theme extraction method, which specifically comprises the following steps:
carrying out normalization, resampling and filtering pretreatment on an input audio signal to obtain an audio signal simulating the auditory characteristic of human ears;
performing constant Q spectrum transformation on the audio signal to obtain variable resolution spectrum signals with frequencies distributed according to logarithms, aggregating the magnitude spectra of adjacent frames to obtain a characteristic vector, constructing an input vector set of an extreme learning machine according to the characteristic vector, and generating an output vector set according to a training set marking result;
carrying out parameter training on the extreme learning machine, and carrying out melody pitch rough estimation by utilizing an extreme learning machine network; if the first element of the output vector is the maximum value, the frame is judged as a rotation-law-free frame, and if the other elements except the first element are the maximum values, the corresponding pitch is used as a coarse pitch estimation result of the frame melody;
and searching a spectrum peak in the semitone range of the coarse estimation 2/3 of the melody pitch of each frame, outputting the frequency corresponding to the spectrum peak as the melody pitch of the frame, and finely adjusting the melody pitch.
The input vector set and the output vector set of the extreme learning machine are constructed in the following way:
constructing an input vector set of the extreme learning machine by using a plurality of feature vectors; reading the mark value corresponding to each feature vector, generating a one-bit effective output vector with the first element of 1 and other elements of 0 in a non-rotation section, and generating a one-bit effective output vector with the pitch corresponding to 1 and other elements of 0 in a rotation section after each quantization; and for the unmarked data set, each element of the output vector is 0, and an output vector set corresponding to the input vector set is obtained.
By adopting the technical scheme, the invention provides the semi-supervised music main melody extraction method, which can learn the unmarked data characteristics, efficiently learn the high-level melody characteristic representation, has higher generalization and also has better effect on unseen data sets.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flowchart illustrating a semi-supervised music theme extraction method according to the present invention;
FIG. 2 is a flow chart of the present invention for testing a limit learning machine.
FIG. 3 is a time domain waveform diagram in accordance with an embodiment of the present invention;
FIG. 4 is a CQT spectrum in an embodiment of the present invention;
FIG. 5 is a diagram illustrating a prediction matrix of a semi-supervised extreme learning machine in an embodiment of the present invention;
FIG. 6 is a diagram illustrating the estimated value and the true value of the audio segment according to the present invention.
Detailed Description
In order to make the technical solutions and advantages of the present invention clearer, the following describes the technical solutions in the embodiments of the present invention clearly and completely with reference to the drawings in the embodiments of the present invention:
as shown in fig. 1, a semi-supervised music theme extraction method specifically includes the following steps:
s1: and preprocessing the input audio signal, wherein the preprocessing process comprises normalization, resampling and filtering to obtain the audio signal simulating the auditory characteristics of human ears. The detailed process comprises the following steps:
s11, normalizing the audio signal to enable the mean value to be 0 and the variance to be 1;
s12, according to the characteristics that the amplitude of a higher harmonic signal in the music signal is relatively small and the influence on a signal processing result is weak, resampling the input audio signal, sampling to 16kHz, namely discarding frequency components above 8 kHz;
s13, filtering the audio signal by adopting an equal loudness filter, enhancing audio components which are relatively sensitive to the human auditory system, and attenuating components which are not sensitive to the human auditory system to obtain the audio signal simulating the auditory characteristics of human ears.
S2: carrying out constant Q spectrum transformation on an audio signal on the basis of preprocessing to obtain variable resolution spectrum signals with frequencies distributed according to logarithms, wherein the variable resolution spectrum signals comprise amplitude spectrums and phase spectrums; reading the mark value corresponding to each feature vector, generating a one-bit effective output vector with the first element of 1 and other elements of 0 in the non-rotation section, generating a one-bit effective output vector with the pitch corresponding to each quantized element of 1 and other elements of 0 in the rotation section, and for the unmarked data set, outputting each element of the vector of 0, thereby obtaining an output vector set corresponding to the input vector set. The method comprises the following specific steps:
s21, in order to obtain the segmented spectrum characteristics of the audio signal, Constant Q Transformation (CQT) is carried out on the audio signal to obtain a variable resolution frequency spectrum with the frequency meeting the characteristics of logarithmic distribution, and the frequency resolution of a lower frequency segment and the time resolution of a higher frequency segment are improved;
s22, solving a CQT spectrum to obtain a modulus value, obtaining a CQT amplitude spectrum, abandoning a phase spectrum, aggregating CQT amplitude spectra of adjacent frames, taking the current frame as a center, respectively connecting 3 CQT amplitude spectra in series in front and at the back to obtain a long vector formed by connecting the CQT amplitude spectra of the front and at the back 7 frames in series, and taking the long vector as a feature vector of the current frame;
and S23, quantizing melody pitches in the marked data set, generating a one-bit effective output vector with 1 first element and 0 other values in a no-rotation section, generating a one-bit effective output vector with 1 corresponding to each quantized pitch and 0 other elements in a rotation section, and setting 0 elements in the output vector corresponding to the no-marked data set.
S3: carrying out parameter training on the extreme learning machine, and carrying out melody pitch rough estimation by utilizing an extreme learning machine network; if the first element of the output vector is the maximum value, the frame is judged as a rotation-law-free frame, and if the other elements except the first element are the maximum values, the corresponding pitch is used as a coarse pitch estimation result of the frame melody;
the training process of the extreme learning machine comprises the following steps:
s31, random initialization is carried out on ELM input weight, and input weight parameters are uniformly distributed in a range of [ -1,1 ];
s32, performing random initialization on ELM input bias, wherein input bias parameters are uniformly distributed in a range of [ -1,1 ];
and S33, calculating an ELM hidden layer output matrix.
S4: and searching a spectrum peak in the semitone range of the coarse estimation 2/3 of the melody pitch of each frame, outputting the frequency corresponding to the spectrum peak as the melody pitch of the frame, and finely adjusting the melody pitch.
Example (b):
a certain piece of music audio with a duration of 8.2 seconds is collected, and the time domain waveform diagram is shown in fig. 3. The audio piece was subjected to CQT spectral analysis, the CQT magnitude spectrum of which is shown in FIG. 4. The pitch prediction of the melody is performed on the audio signal by using the trained semi-supervised extreme learning machine network parameters, and the obtained prediction matrix is shown in fig. 5. The maximum value is taken for each row of the prediction matrix to obtain the coarse melody pitch estimation result, then the spectral peak search is carried out on 2/3 semitone ranges of each frame of coarse estimation to obtain the final melody pitch estimation, and the estimation value and the true value of the section of audio are shown in fig. 6. The embodiment shows that the method can output more accurate melody pitch estimation results.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.
Claims (2)
1. A semi-supervised music theme extraction method is characterized by comprising the following steps:
carrying out normalization, resampling and filtering pretreatment on an input audio signal to obtain an audio signal simulating the auditory characteristic of human ears;
performing constant Q spectrum transformation on the audio signal to obtain variable resolution spectrum signals with frequencies distributed according to logarithms, aggregating the magnitude spectra of adjacent frames to obtain a characteristic vector, constructing an input vector set of an extreme learning machine according to the characteristic vector, and generating an output vector set according to a training set marking result;
carrying out parameter training on the extreme learning machine, and carrying out melody pitch rough estimation by utilizing an extreme learning machine network; if the first element of the output vector is the maximum value, the frame is judged as a rotation-law-free frame, and if the other elements except the first element are the maximum values, the corresponding pitch is used as a coarse pitch estimation result of the frame melody;
and searching a spectrum peak in the semitone range of the coarse estimation 2/3 of the melody pitch of each frame, outputting the frequency corresponding to the spectrum peak as the melody pitch of the frame, and finely adjusting the melody pitch.
2. The method of claim 1, further comprising: the input vector set and the output vector set of the extreme learning machine are constructed in the following way:
constructing an input vector set of the extreme learning machine by using a plurality of feature vectors; reading the mark value corresponding to each feature vector, generating a one-bit effective output vector with the first element of 1 and other elements of 0 in a non-rotation section, and generating a one-bit effective output vector with the pitch corresponding to 1 and other elements of 0 in a rotation section after each quantization; and for the unmarked data set, each element of the output vector is 0, and an output vector set corresponding to the input vector set is obtained.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010071808.4A CN111326164B (en) | 2020-01-21 | 2020-01-21 | Semi-supervised music theme extraction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010071808.4A CN111326164B (en) | 2020-01-21 | 2020-01-21 | Semi-supervised music theme extraction method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111326164A true CN111326164A (en) | 2020-06-23 |
CN111326164B CN111326164B (en) | 2023-03-21 |
Family
ID=71171018
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010071808.4A Active CN111326164B (en) | 2020-01-21 | 2020-01-21 | Semi-supervised music theme extraction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111326164B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113781988A (en) * | 2021-07-30 | 2021-12-10 | 北京达佳互联信息技术有限公司 | Subtitle display method, subtitle display device, electronic equipment and computer-readable storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2008281776A (en) * | 2007-05-10 | 2008-11-20 | Mitsubishi Electric Corp | Melody extraction device and melody extraction method |
CN105957538A (en) * | 2016-05-09 | 2016-09-21 | 大连民族大学 | Polyphony melody extraction method based on significance |
CN107133643A (en) * | 2017-04-29 | 2017-09-05 | 天津大学 | Note signal sorting technique based on multiple features fusion and feature selecting |
CN108647596A (en) * | 2018-04-27 | 2018-10-12 | 大连民族大学 | Music Melody extraction method |
CN110223712A (en) * | 2019-06-05 | 2019-09-10 | 西安交通大学 | A kind of music emotion recognition method based on two-way convolution loop sparse network |
CN110600055A (en) * | 2019-08-15 | 2019-12-20 | 杭州电子科技大学 | Singing voice separation method using melody extraction and voice synthesis technology |
-
2020
- 2020-01-21 CN CN202010071808.4A patent/CN111326164B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2008281776A (en) * | 2007-05-10 | 2008-11-20 | Mitsubishi Electric Corp | Melody extraction device and melody extraction method |
CN105957538A (en) * | 2016-05-09 | 2016-09-21 | 大连民族大学 | Polyphony melody extraction method based on significance |
CN107133643A (en) * | 2017-04-29 | 2017-09-05 | 天津大学 | Note signal sorting technique based on multiple features fusion and feature selecting |
CN108647596A (en) * | 2018-04-27 | 2018-10-12 | 大连民族大学 | Music Melody extraction method |
CN110223712A (en) * | 2019-06-05 | 2019-09-10 | 西安交通大学 | A kind of music emotion recognition method based on two-way convolution loop sparse network |
CN110600055A (en) * | 2019-08-15 | 2019-12-20 | 杭州电子科技大学 | Singing voice separation method using melody extraction and voice synthesis technology |
Non-Patent Citations (4)
Title |
---|
张维维 等: "结合改进欧几里得算法和动态规划的音乐主旋律提取" * |
赵亮;: "改进极限学习机的电子音乐分类模型" * |
陈晓鸥;杨德顺;: "音乐情感识别研究进展" * |
黄清宝;蒋成龙;林小峰;徐辰华;唐鹏;张梦桥;: "基于和声搜索算法的极限学习机网络优化" * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113781988A (en) * | 2021-07-30 | 2021-12-10 | 北京达佳互联信息技术有限公司 | Subtitle display method, subtitle display device, electronic equipment and computer-readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN111326164B (en) | 2023-03-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110136731B (en) | Cavity causal convolution generation confrontation network end-to-end bone conduction voice blind enhancement method | |
CN110019931B (en) | Audio classification method and device, intelligent equipment and storage medium | |
CN101599271B (en) | Recognition method of digital music emotion | |
Wang et al. | Specaugment++: A hidden space data augmentation method for acoustic scene classification | |
CN109378010A (en) | Neural network model training method, voice denoising method and device | |
CN111369982A (en) | Training method of audio classification model, audio classification method, device and equipment | |
CN109100814B (en) | A kind of audio magnetotelluric method signal antinoise method based on noise classification | |
CN105788592A (en) | Audio classification method and apparatus thereof | |
CN111899757B (en) | Single-channel voice separation method and system for target speaker extraction | |
CN115545086B (en) | Migratable feature automatic selection acoustic diagnosis method and system | |
Li et al. | Sams-net: A sliced attention-based neural network for music source separation | |
CN112885375A (en) | Global signal-to-noise ratio estimation method based on auditory filter bank and convolutional neural network | |
CN112086100A (en) | Quantization error entropy based urban noise identification method of multilayer random neural network | |
CN111326164B (en) | Semi-supervised music theme extraction method | |
Li et al. | A si-sdr loss function based monaural source separation | |
Shifas et al. | A non-causal FFTNet architecture for speech enhancement | |
CN117409761B (en) | Method, device, equipment and storage medium for synthesizing voice based on frequency modulation | |
US20230136220A1 (en) | Quantifying Signal Purity by means of Machine Learning | |
CN114302301A (en) | Frequency response correction method and related product | |
CN104318931A (en) | Emotional activity obtaining method and apparatus of audio file, and classification method and apparatus of audio file | |
Moliner et al. | Realistic gramophone noise synthesis using a diffusion model | |
Küçükbay et al. | Hand-crafted versus learned representations for audio event detection | |
Jiang et al. | Music signal recognition based on the mathematical and physical equation inversion method | |
Yao et al. | [Retracted] Research on Segmentation Experience of Music Signal Improved Based on Maximization of Negative Entropy | |
Ye et al. | Music style transfer with vocals based on CycleGAN |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |