CN110189768B

CN110189768B - Chinese folk song geographical classification method based on conditional random field

Info

Publication number: CN110189768B
Application number: CN201910395241.3A
Authority: CN
Inventors: 杨新宇; 罗晶; 丁建行; 魏洁; 董怡卓; 张亦弛; 夏小景; 崔宇涵; 吉姝蕾
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2019-05-13
Filing date: 2019-05-13
Publication date: 2021-02-02
Anticipated expiration: 2039-05-13
Also published as: CN110189768A

Abstract

The invention discloses a Chinese folk song region classification method based on a conditional random field. The invention provides a method for modeling the frame characteristics of folk songs by adopting a conditional random field in consideration of the time sequence of music, wherein a labeling sequence of the folk songs is calculated by combining a limited Boltzmann computer, parameters are learned by using a quasi-Newton algorithm and a k-time contrast divergence method, and finally, the music region classification is realized. Compared with the traditional method, the method solves the problem of the lack of the time sequence relation of the characteristic sequence, and simultaneously adopts the limited Boltzmann computer to calculate the conditional random field tagging sequence, thereby solving the bottleneck problem of the accuracy of the conventional research and calculation of the tagging sequence. In addition, the limited Boltzmann machine learns the audio frame characteristics to obtain the high-level music characteristics, so that the difference between the frame characteristics is increased, and the difficulty of manual audio characteristic design is simplified. The method effectively solves the problem of classification precision of the folk songs, and improves the classification result of the region style of the folk songs.

Description

Chinese folk song geographical classification method based on conditional random field

Technical Field

The invention belongs to the field of machine learning and data mining, and particularly relates to a Chinese folk song region classification method based on a conditional random field.

Background

With the rapid development of multimedia technology, music has been converted from conventional recording, magnetic tape, and the like to digital music, and huge digital music data needs to be managed more efficiently. Under the background, music information retrieval and music classification identification have important academic significance and wide application scenes, and great attention is paid to the academic and industrial fields. The development of music classification technology is helpful to intelligently manage different categories of music and help users to realize high-speed music retrieval in interested categories. With the mass growth of music data, how to further improve the classification precision and efficiency of music is very important.

With the spread and development of Chinese culture, Chinese folk songs with obvious regional styles are beginning to be contacted and liked by more people, so the research on the classification of the Chinese folk songs based on the regional styles is particularly important. However, folk songs are generally impulse editing and vocal singing, lack strict creation rules, and have fuzzy categories of folk song styles in various regions, and cannot well represent the categories of the region styles by directly and independently using frame features, so that related research results are few.

Disclosure of Invention

The invention aims to solve the problem of missing of a characteristic sequence time sequence in the traditional folk song regional style classification algorithm, and provides a Chinese folk song regional classification method based on a conditional random field.

In order to achieve the purpose, the invention adopts the following technical scheme to realize the purpose:

a Chinese folk song regional classification method based on conditional random field, extract the audio frequency characteristic of the music at first, then set up the audio frequency characteristic time sequence model based on conditional random field, combine limited Boltzmann computer its label sequence at the same time, use quasi-Newton's algorithm and k times contrast divergence method to study the parameter, carry on the implementation of regional classification of Chinese folk song finally, include the following steps specifically:

1) audio feature extraction of music: comprises the steps of selecting audio characteristics, framing music segments and extracting music audio characteristics, and particularly comprises the following steps,

1-1) selection of audio features: analyzing from the angles of different variable domains of the audio signal, and selecting time domain characteristics, frequency domain characteristics and cepstrum domain characteristics as audio characteristics representing the tone color and melody of the music;

1-2) framing of music pieces: taking into account music audioShort-time stationarity, sampling the music audio m into continuous short-time segments, m ═ m₁,m₂,...,m_i,...,m_NIn which m is_iReferred to as "frame", N denotes the sequence length;

1-3) extracting music audio characteristics: extracting 1-1) relevant features v ═ v in the music audio piece m in units of frames₁,v₂,...,v_i,...,v_NIn which v is_i∈R^dA feature vector containing d-dimensional data representing the ith frame;

2) establishing an audio characteristic time sequence model: modeling the folk songs with different region styles through a conditional random field by considering the time sequence relation among the audio features of each frame;

3) calculating an audio prediction annotation sequence: the audio prediction labeling sequence calculation process based on the conditional random field model comprises the learning of music advanced features and the determination of a labeling sequence, and specifically comprises the following steps,

3-1) learning of music advanced features: taking the extracted feature sequence v as the input of a restricted Boltzmann machine, performing network learning by adopting a k-time contrast divergence method, and taking the abstract feature obtained by calculating a hidden layer as a high-level feature of music, wherein the abstract feature is expressed as x ═ x { (x)₁,x₂,...,x_i,...,x_NIn which x_i∈R^dA music high-level feature vector containing d-dimensional data representing the ith frame;

3-2) determination of annotation sequence: taking the music advanced feature vector x as an observed value of the conditional random field observation sequence, calculating a feature function by adopting a restricted Boltzmann machine, and further obtaining a conditional random field labeling sequence y ═ y { (y)₁,y₂,...,y_i,...,y_N}，y_iRepresenting the ith frame high level feature x_iMarking corresponding region categories;

4) and (3) realizing music region classification: and identifying the region type of the song according to the obtained models corresponding to different region styles.

In a further improvement of the present invention, the audio features selected in step 1-1) include short-time average Zero Crossing Rate (ZCR), Spectral Centroid (SC), Spectral Flux (SF), spectral attenuation cutoff frequency (SRP), Chroma feature, Linear Prediction Cepstrum Coefficient (LPCC) and mel-frequency cepstrum coefficient (MFCC), which are 7 kinds and 86 dimensions.

A further development of the invention is that said step 2) is specifically operative to: taking a high-level characteristic sequence of each folk song in a class of region type folk songs in a training set as an observation sequence of a conditional random field, establishing a model for each class of region type folk songs by adopting a parameterized form of the conditional random field, and calculating a conditional random field model parameter representing each class of region type folk songs by adopting a parameter solving method; the method comprises the following specific steps:

(i) taking the audio frame characteristic sequence of any one of the folk songs in the region-type folk song sample set in the training set as the observation sequence of the conditional random field, and adopting the probability form of the conditional random field model fitting the folk song sample set of each region type, namely

Wherein, Z (x)^(t)) Is a normalization factor, f_k(x^(t),y^(t)) Is to all time characteristic functions

The sum is obtained by summing up the sum,

comprising a state feature function s_lAnd a transfer characteristic function t_kThrough s_lAnd t_kCalculating region category labels of each frame in the audio frame sequence of the folk songs and transfer paths among the frames; w is a_kIs a characteristic function f_kCorresponding weights, including state feature function s_lAnd a transfer characteristic function t_kCorresponding state weight mu_lAnd a transfer weight λ_k；

(ii) Calculating conditional random field model parameter w representing each type of region type folk songs by adopting BFGS algorithm in quasi-Newton method_kAccording to the maximumThe parameter solution is to take logarithm of probability density function of a data set satisfying a conditional random field to obtain a log-likelihood function L (w), and continuously iterate and update the L (w) as an optimized target function to maximize log-likelihood function values of all songs, thereby obtaining an optimal solution w of model parameters^*。

The further improvement of the invention is that in the step 3-1), a restricted Boltzmann machine is adopted to model the audio features of the folk songs, the originally input features are mapped through nonlinear space transformation, and the high-level feature sequence learned by the hidden layer is used as an observation sequence, which is beneficial to increasing the difference between original audio feature frames; wherein, the estimation of the limited Boltzmann machine parameter is calculated according to the maximum log-likelihood function estimation rule: firstly, pre-training network parameters by adopting a k-time contrast divergence algorithm; then, the parameters calculated by the k times of contrast divergence algorithm are used as the initial values of the network again, and the network parameters are finely adjusted by adopting a back propagation algorithm; and finally, obtaining the optimal parameters of the network.

The invention has the further improvement that the specific calculation process of the step 3-2) is as follows: firstly, adding a Softmax layer behind a hidden layer of a limited Boltzmann machine, wherein each unit in the Softmax layer represents a folk song with a region style category, so that the Softmax layer has a discrimination mechanism; then, judging the labeling state of the high-level features at each moment, setting the corresponding dimension of the state feature function to be 1, and finally obtaining the labeling sequence of the song; the method comprises the following specific steps:

(i) establishing a restricted Boltzmann machine discrimination model for all audio frame characteristics in a data set, namely adding a Softmax layer behind a hidden layer of the restricted Boltzmann machine, wherein each unit in the Softmax layer represents a folk song of a region style class, so that the restricted Boltzmann machine has a discrimination mechanism;

(ii) when the state of the region type label at any time in the label sequence is calculated, the abstract feature learned at the time is judged by the formula (2) to be labeled, namely, the state feature function s_lSetting the corresponding dimension as 1;

in the formula:

the expression takes the value of j when the probability P (-) takes the maximum value,

representing the weight value on a line connected with the jth node of the Softmax layer, and determining the transfer characteristic function t at any two adjacent moments after region category labels are obtained_k；

(iii) And finally, obtaining a labeling sequence of the tth song in the data set.

The further improvement of the invention is that the specific calculation process of the step 4) is as follows: firstly, taking high-level features obtained by songs as observation sequences of conditional random fields to be respectively brought into the trained conditional random fields, then respectively calculating the posterior probability of a prediction tagging sequence generated by each conditional random field through a forward-backward algorithm, and finally selecting the class represented by the conditional random field with the maximum posterior probability value as the prediction standard of testing the song region style class; the method comprises the following specific steps:

(i) taking the high-level characteristics obtained by each tested song as an observation sequence of a conditional random field, respectively bringing the observation sequence into the trained conditional random field, and respectively calculating the posterior probability of a tagging sequence generated by each conditional random field when the observation sequence is given by a forward-backward algorithm; wherein the given observation sequence refers to a predicted tagging sequence of the observation sequence in the conditional random field, which is calculated by a restricted boltzmann computer;

(ii) selecting the category represented by the conditional random field with the maximum posterior probability value as a prediction standard of the testing song region style category, namely satisfying the formula (3):

in the formula:

representing the value of j at which the probability P (-) is taken to be maximum.

The invention has the following beneficial technical effects:

the invention provides a Chinese folk song regional classification method based on a conditional random field, which comprises the steps of firstly extracting audio features of music, then providing a method for modeling frame features of folk songs by adopting the conditional random field, combining a limited Boltzmann computer to calculate a labeling sequence, learning parameters by using a quasi-Newton algorithm and a k-time contrast divergence method, and finally realizing music regional classification. Compared with the traditional folk song region classification method, the method adopts the conditional random field to establish the time sequence relation among the frame feature sequences, and simultaneously adopts the limited Boltzmann machine to solve the problem of bottleneck of the accuracy of the conventional research and calculation of the labeling sequences, so that the region style classification performance of the folk song is further improved. Theoretical analysis and experimental analysis prove that the accuracy and precision of the method are improved in the Chinese folk song geographical classification problem.

Drawings

FIG. 1 is a diagram of a regionally classified global model based on conditional random fields according to the present invention.

Detailed Description

The present invention is described in further detail below with reference to the attached drawings.

Referring to fig. 1, the Chinese folk song geographical classification method based on the conditional random field provided by the invention firstly extracts the audio features of music, then establishes an audio feature time sequence model based on the conditional random field, simultaneously calculates the labeling sequence thereof by combining a limited Boltzmann computer, learns the parameters by using a quasi-Newton algorithm and a k-time contrast divergence method, and finally realizes the geographical classification of the Chinese folk song, which specifically comprises the following steps:

1) audio feature extraction of music: the method comprises the steps of selecting audio features, framing music segments and extracting music audio features, and specifically comprises the following steps:

step1 audio feature selection: analyzing from the angles of different variable domains of the audio signal, selecting time domain characteristics, frequency domain characteristics and cepstrum domain characteristics as audio characteristics for representing the tone color and melody of the music, and specifically selecting the characteristics according to a table 1;

table 1: the audio features selected in the present invention are used to characterize the timbre of music and melody.

Wherein, the selected 7 audio features all contribute to music representation, and ZCR can well represent the music tone color and is used for end point detection, pitch detection and tone segmentation of the audio signal; SC maps the distribution of audio signal frequencies; SF reflects the change amount of the energy spectrum of two adjacent frames; SRP reflects the energy spectrum, a measure of the spectral envelope; the Chroma characteristic considers the existence of harmony in music, reduces the interference of noise and non-tonal sound, and has stronger capacity of melody distinguishing and reducing misjudgment; the LPCC is a model simulating human phonation, and the MFCC is acoustic characteristics calculated by combining human auditory mechanism, and can well reflect the difference of folk songs in different regional styles on tone.

Step2 framing of music piece: in consideration of the short-time stationarity of the music audio, the music audio m is sampled into continuous short-time segments m ═ { m ═ m }₁,m₂,…,m_i,…,m_NIn which m is_iReferred to as "frame", N denotes the sequence length;

step3 music audio feature extraction: extracting relevant features for the music audio frequency segment m by taking a frame as a unit, wherein v is { v ═ v }₁,v₂,…,v_i,…,v_NWhere v is_i∈R^dA feature vector containing d-dimensional data representing the ith frame.

2) Establishing an audio characteristic time sequence model: and (4) modeling the folk songs with different regional styles through the conditional random field by considering the time sequence relation among the audio features of each frame. The key steps are described as follows:

The sum is obtained by summing up the sum,

(ii) Calculating conditional random field model parameter w representing each type of region type folk songs by adopting BFGS algorithm in quasi-Newton method_kAccording to the maximum likelihood function estimation rule, parameter solving is to obtain a log likelihood function L (w) by taking the logarithm of the probability density function of the data set satisfying the conditional random field, and continuously iteratively update the L (w) as an optimized target function to maximize the log likelihood function values of all songs, thereby obtaining the optimal solution w of the model parameters^*。

3) Calculating an audio prediction annotation sequence: the audio prediction labeling sequence calculation process based on the conditional random field model comprises the learning of music advanced features and the determination of a labeling sequence, and specifically comprises the following steps:

step1 learning of music advanced features: with reference to FIG. 1, the dashed box represents a restricted Boltzmann machine structure, comprising a visible layer VAnd several hidden layers H^N. Taking the extracted feature sequence v as the input of a restricted Boltzmann machine, performing network learning by adopting a k-time contrast divergence method, and taking the abstract feature obtained by calculating a hidden layer as a high-level feature of music, wherein the abstract feature is expressed as x ═ x { (x)₁,x₂,...,x_i,...,x_NIn which x_i∈R^dA music high-level feature vector containing d-dimensional data representing the ith frame;

determination of Step2 annotation sequence: taking the music advanced feature vector x as an observed value of the conditional random field observation sequence, calculating a feature function by adopting a restricted Boltzmann machine, and further obtaining a conditional random field labeling sequence y ═ y { (y)₁,y₂,...,y_i,...,y_N}，y_iRepresenting the ith frame high level feature x_iAnd marking the corresponding region types. The key steps are described as follows:

in the formula:

representing the weight value on a line connected with the jth node of the Softmax layer, and determining the transfer characteristic function t at any two adjacent moments after region category labels are obtained_k。

4) And (3) realizing music region classification: and identifying the region type of the song according to the obtained models corresponding to different region styles. The specific calculation process is as follows: firstly, taking high-level features obtained by songs as observation sequences of conditional random fields to be respectively brought into the trained conditional random fields, then respectively calculating the posterior probability of a prediction labeling sequence generated by each conditional random field through a forward-backward algorithm, and finally selecting the class represented by the conditional random field with the maximum posterior probability value as the prediction standard of the tested song region style class. The method comprises the following specific steps:

in the formula:

Referring to table 2, it can be seen from the confusion matrix experimental results for classifying the Shanxi folk songs, the Jiangsu folk songs and the Hunan folk songs that the method for classifying the Chinese folk songs based on the conditional random field provided by the invention obtains better classification results.

Table 2: the classification method provided by the invention compares the accuracy and adopts a classification confusion matrix to evaluate.

Referring to table 3, the classification accuracy of the region of the chinese folk songs, which is encoded by the conditional random field according to the present invention, is higher than that of the existing folk song classification method.

Table 3: and comparing the accuracy of the classification algorithm of different regional styles of folk songs.

Claims

1. A Chinese folk song regional classification method based on conditional random field is characterized in that audio features of music are extracted firstly, then an audio feature time sequence model is built based on the conditional random field, meanwhile, a limited Boltzmann computer is combined to calculate a labeling sequence, a quasi-Newton algorithm and a k-time contrast divergence method are used for learning parameters, and finally, the regional classification of the Chinese folk song is realized, and the method specifically comprises the following steps:

1-1) selection of audio features: analyzing from the angles of different variable domains of the audio signal, and selecting time domain characteristics, frequency domain characteristics and cepstrum domain characteristics as audio characteristics representing the tone color and melody of the music; the selected audio features comprise short-time average zero-crossing rate, frequency spectrum mass center, frequency spectrum flow, frequency spectrum attenuation cut-off frequency, Chroma features, linear prediction cepstrum coefficient and Mel frequency cepstrum coefficient, and the total number is 7, and 86 dimensions are adopted;

1-2) framing of music pieces: in consideration of the short-time stationarity of music audio, the music audio m is sampled into continuous short-time segments, m ═ m₁,m₂,...,m_i,...,m_NIn which m is_iReferred to as "frame", N denotes the sequence length；

2) establishing an audio characteristic time sequence model: modeling the folk songs with different region styles through a conditional random field by considering the time sequence relation among the audio features of each frame; the specific operation is as follows: taking a high-level characteristic sequence of each folk song in a class of region type folk songs in a training set as an observation sequence of a conditional random field, establishing a model for each class of region type folk songs by adopting a parameterized form of the conditional random field, and calculating a conditional random field model parameter representing each class of region type folk songs by adopting a parameter solving method; the method comprises the following specific steps:

The sum is obtained by summing up the sum,

comprising a state feature function s_lAnd a transfer characteristic function t_kThrough s_lAnd t_kCalculating region category labels of each frame in the audio frame sequence of the folk songs and transfer paths among the frames; w is a_kIs a characteristic function f_kCorresponding weights, including status bitsCharacteristic function s_lAnd a transfer characteristic function t_kCorresponding state weight mu_lAnd a transfer weight λ_k；

(ii) Calculating conditional random field model parameter w representing each type of region type folk songs by adopting BFGS algorithm in quasi-Newton method_kAccording to the maximum likelihood function estimation rule, parameter solving is to obtain a log likelihood function L (w) by taking the logarithm of the probability density function of the data set satisfying the conditional random field, and continuously iteratively update the L (w) as an optimized target function to maximize the log likelihood function values of all songs, thereby obtaining the optimal solution w of the model parameters^*；

2. The regional classification method of folk songs in China based on the conditional random field according to claim 1, wherein in the step 3-1), a restricted Boltzmann machine is adopted to model the audio features of folk songs, the originally input features are mapped through nonlinear space transformation, and high-level feature sequences learned by a hidden layer are taken as an observation sequence, so that the difference between original audio feature frames is increased; wherein, the estimation of the limited Boltzmann machine parameter is calculated according to the maximum log-likelihood function estimation rule: firstly, pre-training network parameters by adopting a k-time contrast divergence algorithm; then, the parameters calculated by the k times of contrast divergence algorithm are used as the initial values of the network again, and the network parameters are finely adjusted by adopting a back propagation algorithm; and finally, obtaining the optimal parameters of the network.

3. The method for classifying Chinese folk songs based on the conditional random field as claimed in claim 2, wherein the step 3-2) comprises the following steps: firstly, adding a Softmax layer behind a hidden layer of a limited Boltzmann machine, wherein each unit in the Softmax layer represents a folk song with a region style category, so that the Softmax layer has a discrimination mechanism; then, judging the labeling state of the high-level features at each moment, setting the corresponding dimension of the state feature function to be 1, and finally obtaining the labeling sequence of the song; the method comprises the following specific steps:

in the formula:

expressing get probabilityP (-) takes the value of j when the value is maximum,

4. The method for classifying Chinese folk songs based on the conditional random field as claimed in claim 3, wherein the step 4) comprises the following specific calculation processes: firstly, taking high-level features obtained by songs as observation sequences of conditional random fields to be respectively brought into the trained conditional random fields, then respectively calculating the posterior probability of a prediction tagging sequence generated by each conditional random field through a forward-backward algorithm, and finally selecting the class represented by the conditional random field with the maximum posterior probability value as the prediction standard of testing the song region style class; the method comprises the following specific steps:

in the formula:

express get such thatThe probability P (-) takes the value of j at the maximum.