CN102841932A

CN102841932A - Content-based voice frequency semantic feature similarity comparative method

Info

Publication number: CN102841932A
Application number: CN2012102772962A
Authority: CN
Inventors: 严勤; 张二芬
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2012-08-06
Filing date: 2012-08-06
Publication date: 2012-12-26

Abstract

The invention relates to a content-based voice frequency semantic feature similarity comparative method. The method includes extracting music with a frame size of 5 seconds and a frame shift of 0.5 seconds and then extracting characteristic parameters in the music; composing the characteristic parameters into to characteristic vectors; constructing a word bank including description of 174 keywords and then constructing a Hidden Markov Model by using each keyword as a model and the characteristic vectors as a training sample; outputting a probability polynomial for the Hidden Markov Model to obtain probability distribution based on the keywords; and comparing similarity based on the given keywords according to a KL formula. According to the method, 174 categories are provided for each song so that high level semantic description of each piece of music is elaborately conducted. Moreover, by adoption of the Hidden Markov Model to build a model of semantic keywords, substantive characteristics capable of representing the music and the high level semantic description are connected to remedy a semantic vacancy from a low level to a high level.

Description

A kind of content-based audio frequency semantic feature similarity comparative approach

Technical field

The present invention relates to a kind of Audio Processing and mode identification technology, relate in particular to a kind of similarity comparative approach based on hidden Markov model.

Background technology

The research of content-based audio frequency semantic feature similarity comparative approach; Be based on the music retrieval of content and an important branch in music recommend field; Specifically be meant through the audio frequency characteristics analysis; Different voice datas are composed with different semantics, made audio frequency keep similar acoustically with identical semanteme.Because music and people's sense of hearing perception is closely related; It has more passed on a kind of emotion; A kind of mood of very difficult quantification, this specific character of music have determined that extrinsic informations such as used title of the song, singer are to music analysis and inapplicable in the systematic searching technology of audio frequency.Therefore find some characteristic that can characterize music and how the high-layer semantic information of music is described all and be badly in need of very much.

How to extract the lower level characteristic (tone, melody and rhythm etc.) in the music, make unordered audio frequency become in proper order in order, the audio retrieval technology that is based on content realizes key in application.Present research all is based on a certain audio frequency characteristics; Such as the cepstrum coefficient MFCC that has extracted based on the Mei Er frequency; Or earlier with in human auditory's perception etc. characteristics such as loudness pre-emphasis, intensity, loudness carry out a series of engineering simulation; Thereby adopt all-pole modeling to carry out linear prediction analysis afterwards and obtain corresponding LPC coefficient, the research that also has uses the behavioral characteristics of MFCC or LPC to portray the time-varying characteristics of sound signal, i.e. the single order of primitive character and second order difference.For music content, it is not enough that the low layer acoustic feature is only arranged, and high-level semantic notion how to describe music also is a key issue.Along with the raising of living condition, People more and more is paid attention to the cultivation of spirit taste, in the different occasions people music that demand is different, the purposes of music has been proposed more and clearer and more definite and careful requirement, and these requirements are that traditional research can't realize.

Summary of the invention

Goal of the invention: the objective of the invention is to: a kind of content-based audio frequency semantic feature similarity comparative approach is provided; It can extract the characteristic parameter of music signal; And utilize the characteristic parameter that is extracted to set up HMM based on semantic keyword, can compare the similarity of the semantic feature of music then according to probability model.

Technical scheme: embodiments of the invention are realized through following technical scheme:

A kind of content-based audio frequency semantic feature similarity comparative approach comprises the steps:

1) extract frame length 5s, frame moves the music of 0.5s, then extracts the characteristic parameter in the music;

2) with above-mentioned characteristic parameter composition characteristic vector;

3) make up the lexicon that 174 keywords are described, then with each keyword as model, be training sample structure hidden Markov model with the eigenvector;

4), obtain probability distribution based on keyword to the hmm output probability polynomial expression;

5) relatively compare according to the KL formula based on the similarity of given keyword.

Described hidden Markov model building method comprises the steps:

1) according to formula

b_{j} (o_{t}) = Π_{s = 1}^{S} [Σ_{m - 1}^{Ms} c_{Jsm} N (o_{St}, μ_{Jsm}, Σ_{Jsm})] r_{s}

Obtain the probability b of state output observed reading;

Wherein N is a Gaussian probability-density function, and O is that the characteristic coefficient of music is an observation sequence, and μ, ∑, c are respectively average, variance and weight coefficient, and M is that the Gauss that each state comprises mixes first number;

2) iterations is set,, and adds up and obtain ∑ with the probability P (O/ λ) of all training audio frequency observation sequences of Viterbi algorithm computation HMM output ₁In, with the Baum-Welch algorithm model parameter is carried out revaluation again, obtain λ 1, again with the probability P (O/ λ 1) of all training audio frequency observation sequences of Viterbi algorithm computation HMM output, and add up and obtain ∑ ₂In;

3) with ∑ ₁And ∑ ₂The result compares, and whether judged result less than predetermined threshold value be, then need not to carry out revaluation and calculates, and λ 1 as result of calculation output, is then done new round computing with λ 1.

Said initial probability is taken as [1.0 1.0 1.0], and state transition probability is taken as

[\begin{matrix} 0.0 & 1.0 & 0.0 \\ 0.0 & 0.6 & 0.4 \\ 0.0 & 0.0 & 0.0 \end{matrix}]

The said method that obtains probability distribution is following:

According to formula

P (i | χ) = \frac{{(Π_{t = 1}^{T} p (x_{t} | i))}^{\frac{1}{T}}}{Σ_{v = 1}^{| v |} {(Π_{t = 1}^{T} p (x_{t} | v))}^{\frac{1}{T}}}

Calculate the probability that each word occurs in a first song, obtain the probability vector of all keywords in this first song then, i=1 wherein ..., | v |, what p (i) represented is that certain keyword will appear at the prior probability in certain first song, p (i)=1/|v|, x={x ₁..., x _T, T is the frame number by frame extraction characteristic of every first song.

Described similarity comparison step is following:

1) particular keywords of the given inquiry song of selection obtains inquiring about the semantic polynomial expression q of song;

2) through the KL distance between each semantic polynomial expression p of

computing semantic polynomial expression q and lane database, wherein v is the dictionary of selecting for use.

The characteristic parameter of said sound signal is a frequency spectrum parameter, and said frequency spectrum parameter comprises: rhythm, melody sharpness, homophony, tone, tone sharpness, tone center, accent intensity.

In extracting the characteristic parameter process, music file converted to the audio frequency of monophony wav form, the position speed of every section music is 256kbps, and sample size is 16, and SF is 16kHz.

Beneficial effect: the present invention has provided 174 classifications for each first song, and this has just carried out detailed high-level semantic to each song and has described.And adopt hidden Markov model to set up the model of semantic keyword, can represent the essential characteristic of music to dock then, remedy low layer to high-rise semantic vacancy with the high-level semantic description.

Description of drawings

Fig. 1 is the structural drawing of embodiment provided by the invention;

Fig. 2 is the characteristic parameter extraction process flow diagram of embodiment provided by the invention;

Fig. 3 is the HMM model training process flow diagram of embodiment provided by the invention.

Embodiment

Below in conjunction with Figure of description the present invention is further detailed:

Song that the present invention uses and semantic keyword be from Computer Audio Lab 500 databases, and the 500 first pop musics that this musical database is concentrated come from the artist of 500 nearest 50 years country variants.The country that these music are different in the whole world is all popular each other, and the people of different countries is familiar with these songs, and CAL500 has just had global representativeness like this, has solved the difference that the region of music is brought.The description that CAL500 carries out the semantic feature of music comprises many-side, for example comprises audio frequency understanding, performance emotion, the form of expression etc.This database has obtained widely using, and comprises the semantic tagger and the retrieval work of music for The Characteristics in the future, can be as a test set commonly used.

Embodiment provided by the invention is based on the audio frequency semantic feature similarity comparative approach of content; Its structure is as shown in Figure 1, comprising: characteristic parameter extraction module, high-layer semantic information describing module, HMM set up module, labeling module and similarity comparison module.

Signal transitive relation between each module is following:

Input signal frame gets into said characteristic parameter extraction module, in this module, the digital audio signal sequence of input is extracted characteristic.

In the high-layer semantic information describing module, the first song of each in the storehouse is at least by 3 audience's auditions and carry out the marking of keyword label.174 specific word relevant with music have been formed semantic label (semantic lables) in the final data storehouse, and promptly the first song of each in the storehouse all has this 174 labels, and each label all has the marking value, and score value is distributed between 0 to 1.

Setting up in the module of HMM, at first set up training sample and test sample book based on the characteristic parameter that extracts, the present invention is directed to 500 first songs, select 425 head as training sample set at random, 75 remaining first songs are as the test sample book collection.Each keyword root is according to separately mark value, selected value greater than 0 song as training sample.

In labeling module; The keyword model that HMM is trained out carries out probability calculation, so for the first new song of a first song, particularly one outside the storehouse; Probability vector with all training patterns that obtain this first song is called semantic polynomial expression with this probability vector.By this semantic polynomial expression that obtains, we just can obtain the degree of correlation of a first semantic keyword of singing.

At the similarity comparison module, we use the KL distance that the semantic polynomial expression of one first song of appointment and the semantic polynomial expression of the song in the storehouse are carried out

Describe in the face of the concrete processing procedure of each module down, as follows:

One, characteristic parameter extraction module

The principle of work of characteristic parameter extraction module is as shown in Figure 2, and its major function is to extract the characteristic parameter of input audio signal, mainly is frequency spectrum parameter; Said frequency spectrum parameter mainly comprises: rhythm (tempo), melody sharpness (pulseclarity), homophony (mode); Tone (key); Tone sharpness (keyclarity), tone center (tonalcentroid) transfers intensity (keystrength).

In extracting the characteristic parameter process, we convert music file the audio frequency of monophony wav form to, and the position speed of every section music is 256kbps, and sample size is 16, and SF is 16kHz, and audio format is PCM.With reference to the MIRtoolbox kit, that extraction time adopts is frame length 5s, and frame moves 0.5s; Extract the above characteristic parameter, obtain the rhythm (tempo) of 1 dimension, the melody sharpness (pulseclarity) of 1 dimension; The homophony (mode) of 1 dimension, the tone (key) of 1 dimension, the tone sharpness (keyclarity) of 1 dimension; The tone center (tonalcentroid) of 6 dimensions; The accent intensity (keystrength) of 24 dimensions, eigenvector when finally forming one 35 tie up long, this step is carried out under the matlab environment.Per song with this song of txt document storing by frame extract eigenvector.

Two, high-layer semantic information describing module

The high-layer semantic information purpose of description is from domestic consumer, to collect training data, and specific practice is the method for music being carried out the keyword mark through user's audition on one side music on one side, and semantic label is provided a definitions set basis clearly.These semantic speech comprise the mark of 18 kinds of expression emotions, like emotion-happy, and not-emotion-happy etc.; The mark of 36 kinds of expression schools, like genre-Pop, genre-Rock etc.; The mark of 29 kinds of music utensils, like instrument-bass, instrument-piano etc., or the like; This data set will reflect the degree of getting in touch between semantic speech and the song, and therefore for each first song, we have also provided the corresponding score value of label when providing these a series of keyword label.Each first song all representes that by a digital vectors score value is distributed between 0 to 1 like this, and this first song of 0 expression is uncorrelated with this keyword, and 1 expression is extremely relevant.

Three, HMM's sets up module

The principle of work of setting up module of HMM is as shown in Figure 3:

HMM (Hidden Markov Model; Hidden Markov model) is a kind of of Markov chain; Its state can not observe directly; But can observe each observation vector through the observation vector sequence all is to show as various states through some probability density distribution, and each observation vector is to be produced by a status switch with response probability Density Distribution.It is a kind of probability model that is used to describe the statistics of random processes characteristic with parametric representation, is a dual random process, is made up of two parts: Markov chain and general random process.Wherein Markov chain is used for the transfer of the state of describing, and describes with transition probability.The general random process is used for the relation between description state and observation sequence, describes with the observed value probability.

For the HMM model, its state conversion process can not be observed, thereby is referred to as " concealing " Markov model.

HMM is defined as follows:

(1) X represents the set of one group of state, wherein X={S ₁, S ₂... S _N, status number is N, and uses q _τRepresent t state constantly.Though state is hidden,, there is the meaning of some physics all relevant with state or state set for plurality of applications.The inner contact of state is exactly can be to other state from a state;

(2) O represents one group of set O={V that can observe symbol ₁, V ₂... V _M, M is the number of the different observed value that possibly export from each state;

(3) state transition probability distribution A={a _Ij, a here _Ij=P{q _I-1=S _j| q _τ=S _i, 1＜i, j≤N.In particular cases, each state can a step arrive other any state, at this moment to (i j) has a arbitrarily _Ij＞O.HMM for other has a _Ij=0 (for one or more pairs of i, j);

(4) the observation probability distribution B={b of state j _j(k) }, expression state j exports the probability of corresponding observed value, wherein b _j(k)=P{O _τ=V _k| q _τ=S _j, 1≤j≤N, 1≤k≤M;

(5) init state distribution π=π _i, π _i=P{q ₁=S _i, 1≤i≤N;

By last, HMM can be defined as a five-tuple λ:

λ＝(X，O，π，A，B)

Or be abbreviated as

λ＝(π，A，B)

Above three key elements of said HMM are actual can separated into two parts, one of which is the Markov chain, is described by, A, another part is a stochastic process, is described by B.

HMM training, i.e. parameter estimation problem, a given observation sequence O will confirm a λ model through certain method, makes P (O/ λ) maximum.

The embodiment of the invention is taken as [1.0 1.0 1.0] with initial probability, and state transition probability is taken as

[\begin{matrix} 0.0 & 1.0 & 0.0 \\ 0.0 & 0.6 & 0.4 \\ 0.0 & 0.0 & 0.0 \end{matrix}];

According to the mixed Gaussian function:

b_{j} (o_{t}) = Π_{s = 1}^{S} [Σ_{m - 1}^{Ms} c_{jsm} N (o_{st}, μ_{jsm}, Σ_{jsm})] r_{s};

Obtain parameter b, b is the probability of state output observed reading, and wherein N is a Gaussian probability-density function, and O is that the characteristic coefficient of music is an observation sequence, and μ, ∑, c are respectively average, variance and weight coefficient, and M is that the Gauss that each state comprises mixes first number;

After the initial model parameter, iterations is set, with the probability P (O/ λ) of all training audio frequency observation sequences of Viterbi algorithm computation HMM output; And add up and obtain with the Baum-Welch algorithm model parameter being carried out revaluation again in the ∑ 1, obtain λ 1; Again with the probability P (O/ λ 1) of all training audio frequency observation sequences of Viterbi algorithm computation HMM output, and add up and obtain in the ∑ 2, ∑ 1 is compared with ∑ 2 results; Whether judged result is then to need not to carry out revaluation and calculate less than predetermined threshold value; λ 1 as result of calculation output, is then done new round computing with λ 1.

Four, labeling module

The function of labeling module is the first song for appointment, and the best keyword that can provide this first song is described, and our method is to use bayes rule to remove to calculate the prior probability of each keyword in the lexicon.

According to Bayesian formula:

P (i | χ) = \frac{p (χ | i) p (i)}{p (χ)}

I=1 arbitrarily wherein ..., | v |, what p (i) represented is that certain keyword will appear at the prior probability in certain first song, we can define a unified standard, p (i)=1/|v |, x={x ₁..., x _T, T is the frame number by frame extraction characteristic of every first song, can be independent in short-term with regarding as between every first frame of singing and the frame, so, According to total probability formula, we can know,

p (χ) = Σ_{v = 1}^{| v |} p (χ | v) p (v),

Like this,

P (i | χ) = \frac{{(Π_{t = 1}^{T} p (x_{t} | i))}^{\frac{1}{T}}}{Σ_{v = 1}^{| v |} {(Π_{t = 1}^{T} p (x_{t} | v))}^{\frac{1}{T}}} - - - (1)

We use formula (1) can calculate the probability that each word occurs in a first song.For a first song, the probability vector with all keyword models that obtain this first song is called semantic polynomial expression with this probability vector, can arrange out the degree of correlation of these relevant semantic keywords, and the best keyword that can provide this first song is described.

Five, similarity comparison module

Similarity comparison module function is, the given song that will inquire about, and the particular keywords that selection will be inquired about, we at first obtain inquiring about the semantic polynomial expression of song, are called for short the inquiry polynomial expression.We calculate the KL distance between each the semantic polynomial expression p that inquires about polynomial expression q and our lane database then, and through the calculating of distance, we can compare based on the similarity degree between the song of some keyword.

KL distance calculation formula is following:

KL (q | | p) = Σ_{i = 1}^{| v |} q_{i} \log \frac{q_{i}}{p_{i}}

V is the dictionary that we select for use, comprises 174 independently semantic keywords.Use this method, we can also accomplish the recommendation based on particular keywords, can satisfy user's different demands.

Claims

1. a content-based audio frequency semantic feature similarity comparative approach is characterized in that: comprise the steps:

2. a kind of content-based audio frequency semantic feature similarity comparative approach according to claim 1, it is characterized in that: described hidden Markov model building method comprises the steps:

1) obtains the probability b that state is exported observed reading according to formula ;

3. a kind of content-based audio frequency semantic feature similarity comparative approach according to claim 2; It is characterized in that: said initial probability is taken as [1.0 1.0 1.0], and state transition probability is taken as

.

4. a kind of content-based audio frequency semantic feature similarity comparative approach according to claim 1 is characterized in that: the said method that obtains probability distribution is following:

According to formula

5. a kind of content-based audio frequency semantic feature similarity comparative approach according to claim 1, it is characterized in that: described similarity comparison step is following:

2) through the KL distance between each semantic polynomial expression p of computing semantic polynomial expression q and lane database, wherein v is the dictionary of selecting for use.

6. a kind of content-based audio frequency semantic feature similarity comparative approach according to claim 1; It is characterized in that: the characteristic parameter of said sound signal is a frequency spectrum parameter, and said frequency spectrum parameter comprises: rhythm, melody sharpness, homophony, tone, tone sharpness, tone center, accent intensity.

7. a kind of content-based audio frequency semantic feature similarity comparative approach according to claim 1; It is characterized in that: the audio frequency that in extracting the characteristic parameter process, music file is converted to monophony wav form; The position speed of every section music is 256kbps; Sample size is 16, and SF is 16kHz.