CN109166591B

CN109166591B - Classification method based on audio characteristic signals

Info

Publication number: CN109166591B
Application number: CN201810994308.0A
Authority: CN
Inventors: 龙华; 杨明亮; 邵玉斌; 杜庆治
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2018-08-29
Filing date: 2018-08-29
Publication date: 2022-07-19
Anticipated expiration: 2038-08-29
Also published as: CN109166591A

Abstract

The invention relates to a classification method based on audio characteristic signals, and belongs to the technical field of audio signal processing. The invention classifies the audio characteristic signals after the dimension reduction processing by utilizing the Gaussian kernel function and the Bayesian prior knowledge. The classification algorithm based on the audio characteristic signals can be used for audio broadcast monitoring, artificial intelligent speech recognition, audio scene mode distinguishing and the like. The invention mainly carries out audio classification aiming at the audio characteristic signal coefficient domain characteristics, and has better universality and stability compared with the prior art of carrying out classification based on audio content. The invention utilizes the excellent nonlinear characteristics of the Gaussian kernel function and the high-efficiency optimization algorithm to avoid the defects of single application scene, low running speed and poor classification effect caused by linear mapping. The algorithm theory is simple, easy to program and implement, and more practical and practical in engineering projects.

Description

Classification method based on audio characteristic signals

Technical Field

The invention relates to a classification method based on audio characteristic signals, and belongs to the technical field of audio characteristic signal processing.

Background

In order to improve the recognition efficiency and accuracy based on audio signals, and meanwhile, audio feature classification is a great position in audio monitoring management and control of wireless broadcasting, so that the research on classification algorithms based on audio feature signals is particularly important, and the current main classification algorithms include Bayes classification algorithms, decision tree algorithms, support vector machine algorithms and the like, and most of the classification algorithms are poor in classification effect, complex in algorithm, large in calculation amount, difficult to realize programming and the like. The algorithm utilizes the excellent nonlinear characteristic of the Gaussian kernel function and combines the Bayes prior theory, can obtain a satisfactory result aiming at the classification problem of the audio characteristic signal subjected to the dimensionality reduction processing, and also shows an excellent effect in the actual engineering.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a classification method based on audio characteristic signals, which comprises the steps of firstly extracting audio characteristic parameters of the audio signals and carrying out dimension reduction treatment, sending the feature parameters after dimension reduction into a built classification model, and judging the category of a test point by using the input and output similarity probability, thereby realizing the purpose of audio classification, namely audio identification.

The technical scheme of the invention is as follows: a classification method based on audio characteristic signals. The method specifically comprises the following steps:

(1) audio signal acquisition: and acquiring an audio signal to obtain an audio sample.

(2) Audio signal preprocessing: and converting the analog signals in the collected audio samples into digital signals, and writing the digital signals into the WAV file. And filtering and framing the digital signals to be written into the WAV file.

(3) Characteristic parameter extraction: and programming realizes the extraction of high-dimensional characteristic parameters of Linear Prediction Coefficients (LPC), Linear Prediction Cepstrum Coefficients (LPCC) and Mel Frequency Cepstrum Coefficients (MFCC) of the preprocessed audio signals.

(4) Reducing the dimension of the characteristic parameter: and sending the extracted audio characteristic parameters into a built dimension reduction model for dimension reduction treatment, and storing the dimension-reduced audio characteristic parameters into a table.

(5) Building a classification model: firstly, describing the similarity of one class and the other class by using an implicit function f obeying Gaussian distribution, secondly, compressing the output value of f to a [01] range by using a compression function, wherein the obtained compression value is the similarity of the two classes, distinguishing the classes according to the similarity, and the built model is the required classification model.

(6) Audio feature parameter classification: and (5) sending the audio characteristic quantity subjected to the dimensionality reduction in the step (4) into the classification model in the step (5) for audio characteristic parameter classification, and performing data visualization display on a classification result.

In the audio collection in the step (1), the audio collection is to collect an audio sample by using an audio collection device, and the audio collection device sets a sampling frequency (the sampling frequency satisfies nyquist sampling theorem), a sampling channel number (set according to a collection object), and quantization precision when collecting the audio signal.

In the above method for dimension reduction analysis based on audio feature signals, the signal preprocessing in step (2) includes the following steps:

(1) using a rectangular window function w (n) (upper limit frequency is generally f)_H3400Hz, lower limit frequency f_L60-100 Hz) filtering the collected audio signal x (n) to obtain a signal y_a(n) wherein

(2) Because the audio signal is not a stable signal and is not suitable for directly extracting the characteristic parameters, the audio signal y after the filtering processing is carried out_a(n) dividing the audio signal into a plurality of audio signal segments, one audio signal segment is called a frame, and the time range of each audio signal segment is between 10 and 30 ms. The frames are partially overlapped, the overlapped part is called frame shift, and the frame shift takes 1/2 or 1/3 of the length of the frame.

In the above classification method based on audio feature signals, the feature parameter extraction in step (3) is to perform feature parameter extraction on the audio signals after being framed by Linear Prediction Coefficients (LPC), Linear Prediction Cepstrum Coefficients (LPCC), mel-frequency cepstrum coefficients (MFCC), and to put them into 3 tables, respectively.

In the classification method based on the audio characteristic signals, in the step (4), the feature parameter dimension reduction is to obtain the optimal projection direction of the feature vector by using a Fisher criterion algorithm, the contribution degree of each feature component to the identification is judged by increasing or decreasing the feature components, the dimension reduction processing of the audio characteristic signals is carried out by combining the two characteristics to obtain a better dimension reduction result, the dimension component is more important when the Fisher ratio is larger, and the Fisher linear judgment criterion is that the dimension component is more important when the Fisher ratio is larger, wherein

In the formula r_FisherIs the Fisher ratio or Fisher criterion of the characteristic components; o ° o_betweenThe inter-class variance of the characteristic components is represented, namely the variance of the mean values of different voice characteristic components; o ° o_withinRepresents the intra-class variance of the feature components, i.e., the mean of the variances of the same speech feature components.

Where ρ represents the dimension of the characteristic parameter;

representing the mean value of rho dimension components of the voice features on all classes;

mean values representing the epsilon-th class of the rho-th dimension components of the speech features; omega_εRepresenting a sequence of speech features of the epsilon-th class; gamma and kappa_εRespectively representing the category number and the sample number of each type of the voice feature sequence;

representing the rho-dimension component of the epsilon-type speech feature sequence.

The inter-class variance of the feature component reflects the degree of difference between different speech samples, while the intra-class variance reflects the degree of density between the same speech samples, and for the feature component, its separability is characterized from both the intra-class variance and the inter-class variance. The larger the Fisher ratio is, the more suitable the dimension characteristic parameter is as the characteristic information of the voice recognition, and the larger dimension component of the Fisher is selected as the dimension reduction result, so that the purpose of reducing the dimension is achieved.

In the above classification method based on the audio characteristic signal, the building of the characteristic quantity classification model in the step (5) includes the following steps:

(1) first, considering a dichotomy problem, two types of audio feature signals after dimension reduction processing are respectively defined as y being 1 and y being 0, and x represents the audio feature signals after dimension reduction processing. The model introduces an implicit function f (x) and a response function delta (f (x)), wherein the implicit function f (x) follows Gaussian distribution, and the response function compresses the result of f (x) to [01]]In the interval, the likelihood of data can be written as a response function here pi (x) ═ p (y ═ 1| x) ═ σ (f (x)), p (y ═ 1| f) ═ 1- δ (f)

(2) Since f is an implicit function subject to Gaussian distribution, the implicit function is assumed to be a Gaussian square exponential kernel function, and the expression is

Wherein σ_f ²For the coefficient parameters of the square exponential kernel, l represents the distance influence factor parameter between the two points x and x', and only two hyperparameters theta (sigma) exist in the kernel function_fL). For a given test point and a combined distribution of implicit functions at x, is

Wherein K is a covariance matrix expressed as

K_*＝[k(x_*,x₁) k(x_*,x₂)...k(x_*,x_n)] K_**＝k(x_*,x_*) (4)

The condition distribution of its implicit function is f_*|f～N(K_*K^-1f,K_**-K_*K^-1K_* ^T) Prediction conditional probability of implicit function

Here, the conditional distribution of the implicit function is the same as the predicted scheduling probability distribution of the implicit function but the expression is not exactly the same, so it is assumed that the average value of the predicted conditional probability output of the implicit function is

The corresponding covariance matrix K' will also be different in the same way, the meaning of which is explained in step (2). Compress implicit function to [01]The interval yields the probability of a class membership and defines δ_*＝δ(f_*)＝φ(f_*) Then there is

δ_*＝∫δ(f_*)p(f_*|f)df_* (5)

The compression processing values are given by Rasmussen and Williams (2006), Chapter 9

(2) For the likelihood, firstly, function analysis is needed, and according to a Bayes formula, the posterior distribution of the implicit function is

To maximize the posterior probability of the implicit function, i.e. to solve the maximum likelihood, the maximum likelihood is obtained by using the optimization algorithms such as simplex

The optimal solution of f can be solved by substituting the equation into (5) and iterating for a certain number of times

Since p (y | f) is not Gaussian distribution, the posterior p (f | x, y) distribution of the implicit function is not analyzed, Laplace is used for approximation, Gaussian distribution q (f | x, y) is used for approximating the posterior distribution p (f | x, y), and log p (f | x, y) at the maximum position of the posterior distribution is subjected to second-order Taylor expansion, so that the Gaussian distribution can be obtained

Thereby obtaining K ═ K + W^-1，

Where W is the Hessian matrix of negative logp (y | f).

(3) The implicit function assumed by the invention is a square exponential kernel function, i.e. a formula versus an over-parameter sigma_fSolving for

p(y|x,θ)＝∫p(y|f)p(f|x)df (9)

To obtain the optimal hyper-parameters, i.e. maximize the conditional probability as much as possible, second-order Taylor expansion is performed on logp (y | x, theta) at the local maximum point, probability normalization is performed, and finally Laplace approximation, i.e. logarithmic likelihood function expansion, is performed, for detailed derivation, see chart 2of Gaussian processing for machine learning of Rasmussen and Williams (2006)

And (3) substituting the solved relevant parameters into the log-likelihood function, optimizing by using a simplex method to solve the optimal hyper-parameters, substituting the relevant parameters into a classification model, and classifying the audio characteristic quantity by using the classification expressions (5) and (6).

In the dimension reduction analysis method based on the audio characteristic signal, in the step (6), the characteristic parameter classification is to send the audio characteristic signal subjected to the dimension reduction processing into the established classification model for classification processing. And finally, carrying out data visualization display on the classification result.

Compared with the existing dimension reduction method based on the audio frequency characteristics, the method has the advantages that:

(1) the invention utilizes the hidden function f obeying Gaussian distribution to connect input and function output, normalizes the function value of the hidden function f to be in a range of [01] by using a compression function, and classifies the function value by using the probability size more intuitively.

(2) The maximum likelihood is solved by Bayesian prior probability, and the classification algorithm is improved by further introducing the kernel function, so that the two classifications are further expanded into high-dimensional classification easily.

(3) The invention is provided for the classification problem of the dimension reduction audio characteristic signal, classifies the data after dimension reduction, has simple principle and easy programming, and has strong robustness for actual audio identification artificial intelligence and broadcast audio monitoring.

Drawings

FIG. 1 is a flowchart of the overall classification process of the present invention.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

As shown in fig. 1, a classification method based on audio characteristic signals includes the following specific steps:

(1) audio signal acquisition: and collecting audio signals to obtain audio samples.

(2) Audio signal preprocessing: and converting the analog signals in the collected audio samples into digital signals, and writing the digital signals into the WAV file. And filtering and framing the digital signal to be written into the WAV file.

(6) Audio feature parameter classification: and (5) sending the audio characteristic quantity subjected to the dimension reduction in the step (4) into the built classification model in the step (5) for classification, and carrying out data visualization display on a classification result.

The audio acquisition is that the audio acquisition is to acquire an audio sample through an audio acquisition device, and the audio acquisition device sets sampling frequency (the sampling frequency meets the Nyquist sampling theorem), sampling channel number and quantization precision when acquiring an audio signal.

The signal preprocessing comprises the following steps:

(1) using a rectangular window function w (n) (the upper limit frequency is generally f)_H3400Hz, lower frequency f_L60-100 Hz) filtering the collected audio signal x (n) to obtain a signal y_a(n) wherein

(2) The audio signal y after the filtering processing is carried out_a(n) dividing the audio signal into a plurality of audio signal segments, wherein one audio signal segment is called a frame, and the time range of each audio signal segment is 10-30 ms. There is a partial overlap between frames, the portion of overlap being referred to as a frame shift, the frame shift taking the length of the frame 1/2 or 1/3.

The characteristic parameter extraction is to extract the characteristic parameters of Linear Predictive Coefficient (LPC), Linear Predictive Cepstrum Coefficient (LPCC) and Mel Frequency Cepstrum Coefficient (MFCC) of the audio signal after the frame division, and the characteristic parameters are distributed and put into 3 tables.

The characteristic parameter extraction is to obtain the optimal projection direction of characteristic vectors by using a Fisher criterion algorithm, judge the contribution degree of each characteristic component to identification by a method of increasing or decreasing the characteristic components, and synthesize the two characteristics to perform dimension reduction processing on audio characteristic signals to obtain a better dimension reduction result, wherein the larger the Fisher ratio is, the more important the dimension components are, wherein the Fisher linear discriminant criterion is as follows

Where ρ represents the dimension of the characteristic parameter;

representing the average value of the rho dimension component of the voice feature on all classes;

representing the mean value of the epsilon-th class of the rho-dimension component of the voice feature; omega_εRepresenting a sequence of speech features of the epsilon-th class; gamma and kappa_εRespectively representing the category number and the sample number of each type of the voice feature sequence;

The larger the Fisher ratio is, the more suitable the dimension characteristic parameter is as the characteristic information of the voice recognition, and the larger dimension component of the Fisher is selected as the dimension reduction result, so that the purpose of reducing the dimension is achieved.

The construction of the characteristic quantity classification model comprises the following steps:

(1) two types of audio characteristic signals after the dimensionality reduction processing are respectively defined as y being 1 and y being 0, and x represents the audio characteristic signals after the dimensionality reduction processing. The model introduces an implicit function f (x) and a response function delta (f (x)), wherein the implicit function f (x) follows a Gaussian distribution, the response function compresses the result of f (x) into a [01] interval, the likelihood of the data can be written as pi (x) ═ p (y ═ 1| x) ═ sigma (f (x)), and the response function of p (y ═ 1| f) ═ 1-delta (f) is

Since f is an implicit function subject to Gaussian distribution, the implicit function is assumed to be a Gaussian square exponential kernel function, and the expression is

Wherein K is a covariance matrix expressed as

K_*＝[k(x_*,x₁) k(x_*,x₂)...k(x_*,x_n)] K_**＝k(x_*,x_*) (8)

The condition distribution of its implicit function is

f_*|f～N(K_*K^-1f,K_**-K_*K^-1K_* ^T) (9)

Predictive conditional probability of implicit functions

The mean value of the conditional probability output of the implicit function prediction is defined as

The covariance matrix is defined as K', which is included in the explanation given in step (2). Passing the output value of the implicit function through a compression functionIs compressed to [01]The interval yields the probability of a class membership and defines δ_*＝δ(f_*)＝φ(f_*) Namely have

δ_*＝∫δ(f_*)p(f_*|f)df_* (11)

(2) Analyzing the likelihood function according to Bayes formula to obtain posterior distribution of hidden function

The maximum likelihood function of the posterior probability is solved by the optimization algorithms such as simplex and the like

The optimal solution of f can be solved by substituting the equation into (10) and iterating for a certain number of times

Because p (y | f) is not Gaussian distribution, the posterior p (f | x, y) distribution of the implicit function is not analyzed, Laplace approximation is carried out, the posterior distribution p (f | x, y) is approximated by Gaussian distribution q (f | x, y), and log p (f | x, y) is subjected to second-order Taylor expansion at the maximum position of the posterior distribution, so that the Gaussian distribution approximation expression is obtained and is shown as

Thereby obtaining

Where W is the Hessian matrix of negative logp (y | f), above

K' is a complete solution process.

(3) The premise of the smooth implementation of the classification algorithm is that the solution of the covariance matrix is realized, the relevant parameters in the implicit function become the key of the problem, and the implicit function assumed by the invention is a square exponential kernel function, namely, a formula for the hyper-parameter sigma_fSolving for

p(y|x,θ)＝∫p(y|f)p(f|x)df (18)

And (3) performing second-order Taylor expansion on the local maximum value point of logp (y | x, theta) and probability normalization processing to obtain the optimal hyper-parameter, namely maximizing the conditional probability as much as possible, and finally performing Laplacian approximation to obtain a log-likelihood function expansion.

And (5) substituting the correlation formulas (15) to (17) into a log-likelihood function, solving an optimal hyper-parameter by using a simplex method for optimization, directly calling a simplex function in programming software to solve, back-substituting the correlation formula parameters to obtain a dimension reduction model, and classifying the data by using (11) and (12).

And the audio characteristic parameter classification is to send the audio characteristic signals subjected to the dimensionality reduction processing into the established classification model for classification processing. And finally, performing data visualization display on the classification result, and giving out corresponding classification accuracy. The invention only describes a two-classification algorithm, and corresponding vectorization expansion is carried out on multi-classification problems.

The present invention is not limited to the above-described embodiments, and can be applied to other related fields within the scope of knowledge of those skilled in the art without departing from the spirit of the present invention.

Claims

1. A method of classification based on audio feature signals, characterized by: the method comprises the following specific steps:

(1) audio signal acquisition: collecting an audio signal to obtain an audio sample;

(2) audio signal preprocessing: converting analog signals in the collected audio samples into digital signals, writing the digital signals into a WAV file, and performing filtering, pre-emphasis and framing processing on the digital signals written into the WAV file;

(3) characteristic parameter extraction: extracting high-dimensional characteristic parameters including a linear prediction coefficient, a linear prediction cepstrum coefficient and a Mel frequency cepstrum coefficient from the preprocessed audio signal;

(4) reducing the dimension of the characteristic parameter: sending the extracted high-dimensional characteristic parameters into a built dimension reduction model for dimension reduction treatment and storage;

(5) building a classification model: firstly, describing the similarity of one class and the other class by using an implicit function f obeying Gaussian distribution, secondly, compressing the output value of the f to a [01] range by using a compression function, and distinguishing the classes according to the size of the compressed value, wherein the built model is the required classification model;

the construction of the classification model comprises the following steps:

(1) two types of audio characteristic signals after dimension reduction processing are respectively defined as two types of y-1 and y-0, x is defined as the audio characteristic signals after dimension reduction processing, an implicit function f (x) and a response function delta (f (x)) are introduced into a classification model, wherein the implicit function f (x) obeys Gaussian distribution, the response function compresses the result of f (x) into a [01] interval, the likelihood function of the data is pi (x) ═ p (y ═ 1| x) ═ sigma (f) (x)), p (y ═ 1| f) ═ 1-delta (f), and the response function is as follows:

the implicit function is assumed to be a Gaussian square exponential kernel function, and the expression is as follows:

wherein sigma_f ²Is a coefficient parameter of square exponential kernel, l represents a distance influence parameter between two points of x and x', and two hyperparameters of a kernel function are theta (sigma)_fL), the joint distribution of implicit functions given a test point and x-is:

wherein K is a covariance matrix, and the expression is as follows:

K_*＝[k(x_*,x₁) k(x_*,x₂)...k(x_*,x_n)] K_**＝k(x_*,x_*)

the condition distribution of the implicit function is as follows:

f_*|f～N(K_*K^-1f,K_**-K_*K^-1K_* ^T)

prediction conditional probability of implicit function:

Cooperative prescriptionThe difference matrix is defined as K', the output value of the implicit function is compressed to [01] by the compression function]The interval yields the probability of a class membership and defines δ_*＝δ(f_*)＝φ(f_*) Namely, the method comprises the following steps:

δ_*＝∫δ(f_*)p(f_*|f)df_*

its compression processing value

(2) Analyzing the likelihood function according to Bayes formula to obtain posterior distribution of implicit function

Using a simplex optimization algorithm one can obtain:

the optimal solution of f is solved by carrying out iterative solution by the prediction conditional probability calculation formula of the implicit function

The Gaussian distribution q (f | x, y) is used to approximate the posterior distribution p (f | x, y), and the second-order Taylor expansion is performed on logp (f | x, y) at the maximum of the posterior distribution, thus obtaining the Gaussian distribution

Thereby obtaining

K′＝K+W^-1，

W is a Hessian matrix of negative logp (y | f), the formula is substituted into a log-likelihood function, and an optimal hyper-parameter is solved by using a simplex method optimization algorithm;

(3) the implicit function assumed by the invention is a square exponential kernel function, and the conditional probability of the output result of the classification model is as follows:

p(y|x,θ)＝∫p(y|f)p(f|x)df

performing second-order Taylor expansion on local maximum value points of logp (y | x, theta), performing probability normalization processing, and finally performing Laplace approximate expansion

The solved hyper-parameters are back substituted to obtain the solved classification model;

(6) audio feature parameter classification: and (5) sending the high-dimensional characteristic parameters of the audio signals subjected to the dimension reduction in the step (4) into the classification model in the step (5) for classification, and visually displaying the classification results.

2. The audio feature signal-based classification method according to claim 1, characterized by: the audio signal is collected by an audio collecting device, and the audio collecting device needs to set the sampling frequency, the sampling channel number and the quantization precision when collecting the audio signal.

3. The audio feature signal-based classification method according to claim 1, characterized by: the signal preprocessing comprises the following steps:

(1) filtering the collected audio signal x (n) by adopting a rectangular window function w (n) to obtain a signal y_a(n) wherein

(2) Filtering to obtain signal y_a(n) pre-emphasizing and dividing into a plurality of audio frame signals, and partially overlapping from frame to frame.

4. The audio feature signal-based classification method according to claim 1, characterized by: the characteristic parameter extraction is to extract the characteristic parameters of a linear prediction coefficient, a linear prediction cepstrum coefficient and a Mel frequency cepstrum coefficient of the audio signal after the framing, and respectively store the characteristic parameters into 3 tables.

5. The audio feature signal-based classification method according to claim 1, characterized by: the specific steps of the feature parameter dimension reduction are as follows: performing dimensionality reduction processing on the audio characteristic signal by using a Fisher criterion, selecting a dimensional component with large Fisher as x as a dimensionality reduction result to achieve the purpose of dimensionality reduction, wherein the Fisher linear discrimination criterion is as follows

In the formula r_FisherIs the Fisher ratio or Fisher criterion of the characteristic components; o ° o_betweenRepresenting the inter-class variance of the feature components, namely the variance of the mean values of different speech feature components; o ° o_withinRepresenting the intra-class variance of the feature components, namely the mean variance of the same voice feature component;

where ρ represents the dimension of the characteristic parameter;

representing the mean value of the rho dimension component of the speech feature on all classes;