CN111785296B

CN111785296B - Music segmentation boundary identification method based on repeated melody

Info

Publication number: CN111785296B
Application number: CN202010459989.8A
Authority: CN
Inventors: 张克俊; 朱凯丽; 殷叶航; 叶雨晴; 伍文棋; 王昊阳
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-05-26
Filing date: 2020-05-26
Publication date: 2022-06-10
Anticipated expiration: 2040-05-26
Also published as: CN111785296A

Abstract

The invention relates to a music segmentation boundary identification method based on repeated melody, belonging to the technical field of audio signal processing. The method comprises the following steps: 1) extracting chroma characteristics from the audio, filling zero from beginning to end, aggregating every adjacent N frames to form a new frame vector, and forming a new frame characteristic vector sequence by all the frame vectors; 2) calculating Euclidean distance between each frame vector and other frame vectors in the frame feature sequence to obtain a self-similarity matrix S; 3) based on the self-similarity matrix S, obtaining a set N of the ith frame vector nearest neighbor frame_iAnd obtaining a recursion graph R of the self-similarity matrix S; 4) carrying out time delay processing on the recursive graph R to obtain a time delay matrix L; 5) carrying out line segment normalization and denoising on the L, and then carrying out reverse time delay processing to obtain a recursion graph R'; 6) and detecting all line segments, clustering the line segments, and sequentially processing from the cluster with the largest line segment to obtain a music segmentation boundary point set B. The recognition capability of the repeated melody in the music can be improved, and the music can be segmented in a shorter time.

Description

Music segmentation boundary identification method based on repeated melody

Technical Field

The invention relates to the technical field of audio signal processing, in particular to a music segmentation boundary identification method based on repeated melody.

Background

Information is often organized in a structure or hierarchy to facilitate dissemination or comprehension. Humans are often very good at perceiving such structures, and this behavior is sometimes even unconsciously performed to let us analyze and fully capture the meaning of given information. However, in consideration of the situation in the big data era, there is an increasing need to obtain information processing support from computers. Therefore, the structure of automatically acquiring information becomes a key task of today's content processing systems. Music is a typical example of a wide range of multimedia content.

Important applications of the music segmentation boundary identification algorithm research are in-production navigation of a player, automatic generation of fragments and mashups, identification of versions of the same production and large-scale musician research. The popularity and development of networking and digital entertainment products has made music one of the most important digital media content.

At present, music plays an important role in the form of soundtracks in movie works as well as in stand-alone entertainment products. Music segmentation is an important basic process in music analysis as a stand-alone entertainment product. For the analysis scene of certain musical works, the great number of the works highlights the importance of automatic music segmentation. As the music score, in practical application, more situations are that the music score is taken for use than the music score appears in the whole, and the automatic music segmentation can greatly improve the efficiency of music score extraction. Therefore, the research of the music segmentation boundary identification algorithm has wide market application prospect.

Foote used the self-similarity matrix in the first study of music segmentation algorithms in 2000 for finding repeated melodies in music. Bruderer et al, 2006, noted that there were some clues that humans were highly related in the perception of musical structure, such as timbre changes, repetition and pause. The 2010 study by Paulus et al indicates that there are three principles for inferring the structure of music: novelty, homogeneity and duplication. The music segmentation algorithm proposed by Serra et al in 2014 comprehensively considers the principles, introduces a calculation method of a recursive graph, and greatly improves the segmentation accuracy, thereby improving the automatic music segmentation efficiency and promoting the development of the automatic music segmentation algorithm.

However, the current algorithm applied to music segmentation has many defects, such as large segmentation granularity of an unsupervised method, difficulty in acquiring short segments of partial music, low degree of combined music theory knowledge, and excessive dependence on a mathematical method. The deep learning method cannot fully consider the repeated property in the segmentation, and has the problems of dependence on data, high model training cost and difficulty in combining with music knowledge.

Disclosure of Invention

The invention aims to provide a music segmentation boundary identification method based on repeated melody, so as to improve the identification capability of the repeated melody in the music and segment the music in a shorter time length scale.

In order to achieve the above object, the method for identifying the music segmentation boundary based on the repeated melody provided by the invention comprises the following steps:

1) extracting chroma characteristics from the audio to obtain a characteristic vector sequence, wherein the sequence is M frames in total; zero padding the head and the tail of the feature vector sequence, aggregating every adjacent N frames to form a new frame vector, and forming a new frame feature vector sequence by all the frame vectors;

2) calculating Euclidean distance between each frame vector and other frame vectors in the frame feature sequence to obtain a self-similarity matrix S;

3) based on the self-similarity matrix S, obtaining a set N of the ith frame vector nearest neighbor frame_iI is 1, 2, …, M, and a recursion graph R of the self-similarity matrix S is obtained;

4) carrying out time delay processing on the recursive graph R to obtain a time delay matrix L;

5) carrying out line segment normalization and denoising on the time delay matrix L, and then carrying out reverse time delay processing to obtain a normalized and denoised recursive graph R';

6) and detecting all line segments based on the recursion graph R', clustering the line segments, and sequentially processing from the cluster with the most line segments to obtain a music segmentation boundary point set B.

In the above technical solution, for the repeated segments of music, Pitch Class Profile (Pitch Class Profile) features of the music, also called Chroma features, are extracted in frames, and the features organize frequencies in a given range into 12 Pitch classes to highlight the melody of the music.

Optionally, in one embodiment, in step 3), for the set N_iThe k elements in the frame vector are k frame vectors which are most similar to the ith frame vector in all the frame vectors, and the value of k is 0.01 of the total number of the frame vectors. For each point R in the recursive graph R_i,jIf i belongs to N_jAnd j belongs to N_iThen get R_i,jEqual to 1, otherwise take R_i,jEqual to 0, thus obtaining a recursion map R of the self-similarity matrix S.

Optionally, in one embodiment, in step 4), let L_i,j＝R_{i,(i+j)mod(M-1)}I is 1, 2, …, M, j is 1, 2, …, M, and the time delay matrix L of the recursion graph R is obtained, that is, the main diagonal direction in the recursion graph R is converted into the horizontal direction.

Optionally, in one embodiment, step 5) comprises:

5-1) traversing the time delay matrix L, and defining a point with the value of 1; when one point is found, all points connected with the point are determined through breadth-first searching, and if the step distance is less than 3, the points are considered to be connected;

5-2) counting the number of each point with the same vertical coordinate in the connected points, if the number of the point with the largest number of points in the vertical coordinate is more than 5, keeping the point of the vertical coordinate in the points, and taking the value of other points as 0; otherwise, all the points are set to be 0;

5-3) R'_{i,(i+j)mod(M-1)}＝L_i,jI is 1, 2, …, M, j is 1, 2, …, M, and a regularized and denoised recursion map R' is obtained.

Optionally, in an embodiment, in step 6), the clustering of line segments includes:

the recursive graph R' is traversed and the stride is set to 3.

Find all line segments in the graph and use { x₁,x₂,y₁,y₂Normalizing each line segment, x₁And x₂Is the horizontal coordinate of the starting point and the stopping point, y₁And y₂Is the ordinate of the start and stop points;

taking a line segment, traversing other line segments, and finding all line segments which correspond to the line segment and are the same segment of melody for clustering; the basis for judging the melody corresponding to the same segment is as follows: x is the number of₁And x₂The common length of (a) accounts for more than 80% of each.

Optionally, in an embodiment, in step 6), after clustering the line segments, the cluster with the largest number of line segments is taken, and all x are subjected to clustering₁And x₂Taking an average value to obtain

And

then for each line segment in the cluster, according to x₁And

x₂and

respectively for y₁And y₂Corrected to obtain y'₁And y'₂(ii) a Will be provided with

And all of y'₁、y’₂The following processing is performed as time x: and (3) checking whether a segment boundary point which is less than n frames away from the time point x exists in the music segment boundary point set B, and if not, adding the time point x into the B.

Compared with the prior art, the invention has the advantages that:

the matrix denoising method utilizes the music theory knowledge and the actual experience to perform matrix denoising, fully considers the main reasons of noise generation in music segmentation, and can more thoroughly and efficiently reduce errors caused by noise. The segment clustering-based segment point acquisition method preferentially considers melody segments with a large number of repeated times, and the method of taking the average value as the segment point further reduces errors and improves generalization performance.

Drawings

FIG. 1 is a flowchart illustrating a method for recognizing boundaries of music segments based on repeated melodies according to an embodiment of the present invention;

FIG. 2 is a diagram of a recurrence plot R in an embodiment of the present invention;

FIG. 3 is a diagram of a delay matrix L according to an embodiment of the present invention;

FIG. 4 is a recursive graph R' after warping and denoising in the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described with reference to the following embodiments and accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments without any inventive step, are within the scope of protection of the invention.

Unless defined otherwise, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this invention belongs. The use of the word "comprising" or "comprises", and the like, in the context of this application, is intended to imply that the elements or steps preceding the word comprise those elements or steps listed below, but not the exclusion of other elements or steps.

Examples

In the music segmentation boundary identification method based on the repeated melody, a music segmentation algorithm based on a self-similarity matrix is constructed, and automatic identification of music structure segmentation points is realized. The method can replace manual labeling, is used for generating the music structure sequence, and can be further applied to music analysis, automatic fragment generation and the like. Referring to fig. 1, the specific process is as follows:

s100, extracting chroma characteristics from the audio to obtain a characteristic vector sequence, wherein the sequence is M frames in total; zero padding the head and the tail of the feature vector sequence, aggregating every adjacent N frames to form a new frame vector, and forming a new frame feature vector sequence by all the frame vectors;

the sample music has a characteristic sequence of 12-dimensional vectors of length 1344. And zero padding from head to tail to obtain a length sequence with the length of 1350, and aggregating each adjacent 7 frames to form a new frame feature sequence to obtain a 12x 7-dimensional vector sequence with the length of 1344.

S200, calculating Euclidean distance between each frame vector in the frame feature sequence and other frame vectors to obtain a self-similarity matrix S of 1344x 1344.

S300, obtaining a set N of the nearest neighbor frames of the ith frame based on the self-similarity matrix S_iI is 1, 2, …, M, and from this, the self-similarity moment is obtainedA recursion diagram R of the array S, see fig. 2;

set N_iThe k elements in (a) are the k frames most similar to the ith frame in all frames. For each point R in the recursion map_i,jIf i belongs to N_jAnd j belongs to N_iThen get R_i,jEqual to 1, otherwise take R_i,jEqual to 0, resulting in a recurrence plot R of 1344x 1344. The value of k is 0.01 of the total number of frames, and 13 is taken in this embodiment.

S400, carrying out time delay processing on the recursion graph R to obtain a time delay matrix L, which is shown in FIG. 3;

preface L_i,j＝R_{i,(i+j)mod(M-1)}And obtaining a time delay matrix L of the recursion diagram R, and converting the main diagonal direction in the recursion diagram R into the horizontal direction, so that the calculation efficiency is improved.

S500, conducting line segment normalization and denoising on the time delay matrix L, and then conducting reverse time delay processing to obtain a normalized and denoised recursion graph R', see figure 4.

Firstly, traversing the time delay matrix L, and defining the time delay matrix L with the value of 1 as a point. Every time a point is found, all points connected with the point are determined through breadth-first search, and if the step distance is less than 3, the points are considered to be connected. And counting the number of each point with the same vertical coordinate in the connected points, if the number of the point with the largest number of points in the vertical coordinate is more than 5, keeping the point of the vertical coordinate in the points, and taking the values of other points as 0. Otherwise, taking the values of all the points to be 0. For example, a series of points is { (1,1), (2,1), (3,1), (4,1), (5,1), (6,1), (2,2), (3,2), (4,2) }, then at most 6 points with ordinate 1 will be retained, and points with ordinate 2 will be erased. Then, let R'_{i,(i+j)mod(M-1)}＝L_i,jAnd obtaining a regular and denoised recursion graph R'.

S600, based on the recursion graph R', all line segments are detected and clustered, and the cluster with the most line segments is processed in sequence to obtain a music segmentation boundary point set B.

First, all line segments in the recursive graph R 'are found and expressed in a standardized way, the recursive graph R' is traversed, the step distance is set to be 3, and all line segments are found. After finding the line segment, use { x₁,x₂,y₁,y₂Denotes x₁And x₂Is the horizontal coordinate of the starting point and the stopping point, y₁And y₂Is the ordinate of the start and stop points. If a segment is {1,9,10,19}, it represents that the 10 th frame to the 18 th frame are similar to the 1 st frame to the 9 th frame. Then x in all line segments₁And x₂The portions of the common portion accounting for more than 80% of each other are grouped in the same cluster, such as {1,9,10,18}, {2,9,20,27} and {2,9,31,38 }. After clustering, for x₁Take the average value and corresponding y₁Marks, e.g. here x₁Average value of 2, corresponding to 3 y₁Will be taken as 11, 20 and 31. Check if there are points in the set of boundary points B that fall within 20 frames (related to the required segmentation duration) and add them to B if not. Thus, a result of segmenting the sample music is obtained.

Claims

1. A music segmentation boundary identification method based on repeated melody is characterized by comprising the following steps:

6) detecting all line segments based on the recursion graph R', clustering the line segments, and sequentially processing from the cluster with the most line segments to obtain a music segmentation boundary point set B;

in step 6), the clustering of line segments includes:

traverse the recursive graph R', find all the line segments in the graph, and use { x }₁,x₂,y₁,y₂Normalizing each line segment, x₁And x₂Is the horizontal coordinate of the starting point and the stopping point, y₁And y₂Is the ordinate of the start and stop points;

taking a line segment, traversing other line segments, and finding all line segments which correspond to the line segment and are the same segment of melody for clustering; judging that the correspondence is positive;

the basis of the same segment of melody is as follows: x is the number of₁And x₂The common length of (A) accounts for more than 80% of each;

in step 6), after clustering the line segments, taking the cluster with the most line segments, and for all x₁And x₂Taking an average value to obtain

And

then for each line segment in the cluster, according to x₁And

x₂and

2. The method as claimed in claim 1, wherein the step 3) is performed for the collectionIn the case of N_iThe k elements in the frame vector are k frame vectors which are most similar to the ith frame vector in all the frame vectors, and the value of k is 0.01 of the total number of the frame vectors.

3. The method as claimed in claim 1, wherein in step 3), for each point R in the recursive graph R, the boundary of the music segment is identified_i,jIf i belongs to N_jAnd j belongs to N_iThen get R_i,jEqual to 1, otherwise take R_i,jEqual to 0, thus obtaining a recursion map R of the self-similarity matrix S.

4. The method as claimed in claim 1, wherein the step 4) is executed by L_i,j＝R_{i,(i+j)mod(M-1)}I is 1, 2, …, M, j is 1, 2, …, M, and the time delay matrix L of the recursion graph R is obtained, that is, the main diagonal direction in the recursion graph R is converted into the horizontal direction.

5. The method of claim 4, wherein step 5) comprises:

5-1) traversing the time delay matrix L, and defining a point with the value of 1; when one point is found, all points connected with the point are determined through breadth-first search, and if the step distance is less than 3, the points are considered to be connected;

6. The method of claim 1, wherein the step pitch is set to 3 when traversing the recursive graph R'.