CN101667423A

CN101667423A - Compressed domain high robust voice/music dividing method based on probability density ratio

Info

Publication number: CN101667423A
Application number: CN200910196513A
Authority: CN
Inventors: 余小清; 李昌莲; 许雪琼; 万旺根
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2009-09-25
Filing date: 2009-09-25
Publication date: 2010-03-10

Abstract

The invention relates to a compressed domain high robust voice/music dividing method based on probability density ratio, comprising the steps of: extracting new characteristic parameters based on theprobability density ratio from low signal-to-noise compressed domain voice/music mixed data; detecting the change points of the compressed domain voice and music basing on the new characteristic parameters; and dividing to respectively obtain a divided voice segment and a music segment. Compared with the traditional dividing method, the experimental result shows that the voice/music dividing method obtained by the compressed domain high robust voice/music dividing method based on the probability density ratio can obviously improve the accuracy, the noise resistance and the comprehensive performance.

Description

Compressed domain high robust voice/music dividing method based on the probability density ratio

Technical field

The present invention relates to a kind of compressed domain high robust voice/music dividing method based on the probability density ratio, mainly is that the voice/music based on the probability density ratio changes point detecting method under a kind of different physical environment noise low signal-to-noise ratio condition.

Background technology

Technology such as the systematic searching of compression domain voice/music, scene classification are meant utilizes signal Processing and statistical methods, in a large amount of compressed voice/musical databases, search for the technology of special sound/music, and voice/music to cut apart be one of key issue that realizes the systematic searching technology, the particularly processing under physical environment noise low signal-to-noise ratio condition.

Chang Yong voice/music dividing method in the past, major part is all carried out in uncompressed domain, and directly the problem of cutting apart at the compression domain voice/music relates to seldom, and particularly the research under the low signal-to-noise ratio condition is then still less.But consider that most compression domain voice/music can not finish in the standard recording canopy, that have even from the noisy actual environment, therefore the research of cutting apart at physical environment noise low signal-to-noise ratio condition lower compression territory voice/music seems particularly important.Compression domain voice/music data come from the binary code stream behind raw tone/music encoding, but only can not directly embody the key property of raw tone/music from these data stream.Therefore, it is the data source problem of feature extraction that compression domain voice/music data are cut apart what at first will consider, promptly how packed data is handled, and extracts effective characteristic parameter to satisfy the processing requirements of compressed voice/music data with the calculation cost of minimum.Theoretical analysis and experimental result proof are passed through the packed data partial decoding of h, can obtain and raw tone/similar data of music spectral property, can embody the remarkable difference of voice and music based on the compression domain voice/music data characteristics of this data extract, and can be used for further cutting apart and classification.The compressed domain high robust voice/music dividing method that the present invention is based on the probability density ratio adopts above-mentioned thought just, from based on the new characteristic parameter compression domain probability density of the compression domain voice/music extracting data of MPEG1 standard voice the 3rd layer compression technology than (Compressed probability densityratio, CPR), and the compression domain probability density is than zero-crossing rate (Compressed probability density ratiocrossing rate, CPRCR), in compression domain voice/music data, detect the change point of voice and music then, change point at last thus and obtain segmentation result.

Summary of the invention

The objective of the invention is at the defective that exists in the prior art, a kind of compressed domain high robust voice/music dividing method based on the probability density ratio is provided, a voice/music change point detection problem under the different physical environment noise low signal-to-noise ratio conditions in the solution compression domain, can be further used for the identification of compression domain voice/music, voice/music systematic searching, voice/music scene classification etc.For achieving the above object, design of the present invention is:

The compressed domain high robust voice/music dividing method that the present invention is based on the probability density ratio at first has the good noise proofness energy, can realize under different physical environment noise low signal-to-noise ratio conditions that compression domain voice/music data cut apart, and its signal to noise ratio (S/N ratio) can be low to moderate 5dB.This is the further processing of compression domain voice/music data, and as classification and retrieval, identification, scene detection etc. provide good basis.

The purpose that the present invention is based on the compressed domain high robust voice/music dividing method of probability density ratio is to provide the dividing method of a kind of different physical environment noise low signal-to-noise ratio condition lower compression territory voice/music data, from compression domain voice/music data, directly extract the voice/music characteristic parameter, change point by the voice/music data detects compression domain voice/music data is divided into different classes of voice/music section, and then segmentation result is used for the classification of compression domain voice/music and retrieval etc.

The technical scheme that its technical matters that solves the compressed domain high robust voice/music dividing method that the present invention is based on the probability density ratio adopts is: the compression domain voice/music extracting data characteristic parameter under the different physical environment noise low signal-to-noise ratio conditions earlier, again these data are carried out voice/music and change the some detection, change point at last thus and obtain segmentation result.

According to the foregoing invention design, the present invention adopts following technical proposals:

A kind of compressed domain high robust voice/music dividing method based on the probability density ratio, it is characterized in that at first from MP3 (MPEG1-layer3) file, obtaining to embody the data of raw tone/music frequency domain characteristic based on MPEG1 standard voice the 3rd layer compression technology, secondly to the new compression domain probability density of these data extracts than feature parameter (Compressedprobability density ratio, CPR), obtain to embody the compression domain probability density of voice and music different qualities than zero-crossing rate characteristic parameter (Compressed probability density ratio crossingrate based on this parameter then, CPRCR), the last change point that detects voice and music in compression domain voice/music data changes the voice after point finally obtains cutting apart thus, music segments.

This method specifically comprises following five steps:

1), the pre-service of compression domain voice/music data: comprise the obtaining of compression domain voice music blended data, to the reading of decoding frame head and side information, master data read Hafman decoding and quantification;

2), generate and revise discrete cosine transform MDCT matrix: find out the MDCT coefficient in each subband, the coefficient in the subband is arranged, the formation matrix;

3), compression domain voice/music data characteristics Parameter Extraction: comprise compression domain probability density ratio and compression domain probability density asking for than zero-crossing rate characteristic parameter;

4), the change point of voice and music detects: the cut-point that carries out voice/music based on the characteristic parameter that extracts in the step (3) detects;

5), the voice under the different physical environment noise low signal-to-noise ratio condition and the change point of music detect output physical environment noise low signal-to-noise ratio condition lower compression territory voice/music data cut-point, voice, music segments after obtaining cutting apart.

The present invention compared with prior art, have following conspicuous outstanding substantive distinguishing features and remarkable advantage: the present invention directly can effectively embody the significantly characteristic parameter of difference of voice/music from compression domain voice/music extracting data, it is with respect to the method for extracting feature after will the packed data complete solution pressing again, not only simply but also save computing time; Utilize the compression domain probability density can effectively the voice/music cut-point be found out, and this method is for the varying environment noise,, also has good segmentation effect as the noisy sound of automobile noise, train noise and crowd etc. than zero-crossing rate characteristic parameter.Experimental result shows, adopts the present invention to get dividing method than conventional segmentation methods, all is being significantly increased aspect accuracy rate, noise immunity and the combination property.

Description of drawings

Fig. 1 is the process flow diagram that the present invention is based on the compressed domain high robust voice/music dividing method of probability density ratio.

Embodiment

A preferred embodiment accompanying drawings of compressed domain high robust voice/music dividing method that the present invention is based on the probability density ratio is as follows: this was divided into for five steps based on the compressed domain high robust voice/music dividing method of probability density ratio:

The first step: the pre-service of compression domain voice/music data

The processing of compression domain voice/music data is divided into reading of frame head information, the reading of side information, master data read Hafman decoding and quantification.

1), compression domain voice/music blended data obtains

A), from the audio noise storehouse, obtain one section compression domain white noise;

B), obtain pure compression domain voice and music samples from the voice/music storehouse;

C), obtain the compression domain voice/music blended data that signal to noise ratio (S/N ratio) is 5dB;

2), frame head information reads

A), read synchronizing information in the frame;

B), according to synchronizing information, make demoder and synchronization of data streams;

C), determine the reference position of these frame data to obtain its frame head information head simultaneously;

3), side information reads

A), determine the side information reference position of these frame data, i.e. its frame head place of finishing;

B), obtain the side information data Side of this frame;

4), master data reads

A), calculate the length M aindata of master data according to side information;

B), read the master data of this frame, its data length is Maindata;

C), from master data information, obtain convergent-divergent English Scale;

5), Hafman decoding and inverse quantization

A), determine the reference position of Huffman data in the master data according to side information Side;

B), the Huffman data are decoded, obtain the Hafman decoding array is of 32*18 dimension;

C), the data among the array is are carried out inverse quantization.

Second step: generate and revise discrete cosine transform MDCT matrix

The data of each particle are made of 32 subbands and each subband contains 18 coefficients, and according to the principle that frequency distributes from low to high, each particle can form one 32 * 18 matrix.This process is as follows:

1), finds out each sub-band coefficients

A), obtain 32 MDCT coefficients of each subband according to Hafman decoding array is;

B), from each MDCT coefficient of each subband, obtain 18 sub-band coefficients;

C), rearrange coefficient in each subband, obtain one group of new sub-band coefficients array S by frequency height principle;

2), form matrix

A), according to the row vector of sub-band coefficients array S, obtain the sub-band serial number array M of 32 * 18 dimensions according to the sub-band serial number combination;

B), according to mentioned above principle, obtain the MDCT matrix of coefficients array M of two particles in these frame data respectively ₁And M ₂

The 3rd step: compression domain voice/music data characteristics Parameter Extraction

The compression domain feature of being extracted comprise probability density than parameters C PR and probability density than zero-crossing rate CPRCR parameter.

1), asks for the compression domain probability density than CPR characteristic parameter

A), based on bayesian criterion in the statistics;

Set two kinds of hypothesis H ₀, H ₁:

H ₀: the pure noise source of Z=N

H ₁: Z=N+S voice/music+noise audio frequency

H wherein ₁Be exactly that compressed voice+music+noise mixes input, H ₀Be pure noise model.

B), structure noise model;

Suppose H ₀Be compression domain white noise model, according to claim 3,4 method, form the MDCT matrix of white noise, structure white noise herein is necessary for the high s/n ratio environment with respect to compression domain voice/music data.

C), calculating probability density is than bayesian criterion model;

Λ = Π_{K = 1}^{L} \frac{λ_{N} (K)}{λ_{N} (K) + λ_{Z} (K)} \exp {\frac{λ_{Z} (K) {| Z_{K} |}^{2}}{(λ_{N} (K) + λ_{Z} (K)) \cdot λ_{N} (K)}}

Wherein L represents the number of each frame compressed audio MDCT coefficient, and K is the counter of parameter; Z _KK MDCT data representing each frame mixing compressed voice/music data, λ _Z(K), λ _N(K) represent the variance of audio frequency and noise respectively, λ _N(K) can from noise model, estimate to draw λ _Z(K) can draw by following formula based on the input signal model:

\frac{λ_{Z} (K)}{λ_{N} (K) + λ_{Z} (K)} = ϵ_{k}^{n} = α \cdot \frac{{| Z_{K}^{n - 1} |}^{2}}{λ_{N} (k, n - 1)} + (1 - α) P [λ (k) - 1];

Wherein

P (X) = \{\begin{matrix} X & X &GreaterEqual; 0 \\ 0 & Others \end{matrix},

α is a weight coefficient, gets α=0.98 among the present invention.

D) compare CPR based on probability density than bayesian criterion Model Calculation probability density;

{CPR}_{i} = \log Λ = \frac{1}{L} Σ_{K = 1}^{L} {\log Λ}_{K}

2), ask for the compression domain probability density than zero-crossing rate CPRCR parameter

A), calculated threshold;

Calculate per half second compression domain probability density than threshold value,, choose T in order to demonstrate fully the remarkable details characteristic of voice and music ₁T ₂Two threshold values, wherein T ₁=CPR average, T ₂=CPR average * 3, that is:

T_{1} = \frac{1}{N} Σ_{i = 0, N + 1 . . . .}^{(N)} CPR [i]

T_{2} = (\frac{1}{N} Σ_{i = 0, N + 1 . . . .}^{(N)} CPR [i]) * 3 .

Be the probability density ratio of each frame wherein: CPR[i], N is half second a frame number.

B), calculate zero-crossing rate CPRcr1, CPRcr2;

CPRcr 1 = \frac{1}{2} Σ_{m = 0}^{N - 1} | sgn [{CPR}_{n} (m) - T_{1}] | - | sgn [{CPR}_{n} (m - 1) - T_{2}] |

CPRcr 2 = \frac{1}{2} Σ_{m = 0}^{N - 1} | sgn [{CPR}_{n} (m) - T_{2}] | - | sgn [{CPR}_{n} (m - 1) - T_{2}] |

Obtain the CPRcr=CPRcr1+CPRcr2 of this segment data; Sgn is-symbol function wherein, CPR _n(m) n half second m CPR parameter of expression.

C), calculate per half second final compression domain probability density than zero-crossing rate CPRCR;

CPRCR＝CPRCR?/CPRcr?max

Wherein

CPRcr \max = \max_{i &Element; s} (CPRcr 1 (i) + CPRcr 2 (i));

Said process is the normalized to CPRcr.

The 4th step: the change point of compression domain voice/music data detects

For guaranteeing the continuity that compression domain voice/music data are cut apart and preventing erroneous judgement, the present invention requires every section voice of cutting apart or the music length must be greater than one second, and needing the individual CPRCR parameter of continuous N (M=2) to be greater than or less than threshold value r could be as effective CPRCR cut-point.

1), described in step 3, the compression domain probability density of calculating each frame compressed voice/music data obtains per half second compression domain probability density than zero-crossing rate CPRCR parameter based on this characteristic parameter then than parameter;

2), revise CPRCR;

We will not satisfy that a continuous N parameter value is greater than or less than r (r=0.5) but parameter point that threshold value occurs being greater than or less than is called singular point.Find out all singular points and it handled, promptly replace current point according to the data before and after the singular point:

CPRCR [i] = \frac{1}{2} (CPRCR [i - 1] + CPRCR [i + 1]);

Before cutting apart, find out all singular points and can guarantee the validity cut apart, reduce probability of miscarriage of justice.

3), threshold ratio;

The CPRCR threshold ratio is provided with threshold xi=0.5.

4), cut-point detects;

In view of the probability density ratio characteristic of voice music, the small probability density of voice is than many more than music of the quantity of sequence, and hence one can see that, and the CPRCR of voice is little more than the CPRCR of music.Be voice so the section littler than ξ detects, the section bigger than ξ detects and is music;

5), the cut-point of output compressed voice/music data.

The 5th step: different physical environment noise low signal-to-noise ratio condition lower compression territory voice/music changes point and detects

1), different physical environment noise lower compression territory voice music blended datas obtains;

A), from audio repository, obtain train sound, automobile sound as the physical environment noise;

C), based on the physical environment noise, obtaining signal to noise ratio (S/N ratio) is the compression domain voice/music blended data of 5dB;

2), repeat in the first step 2) to the 4th EOS, export the cut-point of corresponding physical environment noise lower compression territory voice/music data.

Experimental result

The physical environment noise storehouse (as the noisy sound of automobile noise, train noise and crowd etc.) that the compressed domain high robust voice/music dividing method that the present invention is based on the probability density ratio uses provincial TV station news to report sound bank, " Ban Derui " special edition music libraries and derive from the sounddogs website.The form of compression domain voice/music data is MP3, and sample frequency is 44.1KHz, and the total time is about 270 minutes (92 compression domain mixing voice/music data sections of 3 minutes *).

The compressed domain high robust voice/music dividing method that we cut apart detection method and the present invention is based on the probability density ratio with traditional B IC above compression domain voice/music data information experimentizes respectively, and measuring accuracy is assessed with the judging nicety rate of voice/music data cut-point.The judging nicety rate of cut-point is defined as: detected judicious cut apart to count account for all number percents of counting of cutting apart to be detected, its computing formula is as follows:

AccuracyRate (%) = (1 - \frac{N_{S &RightArrow; M}}{N} - \frac{N_{M &RightArrow; S}}{N}) * 100 %

In the formula: N _{S → M}Expression was that voice are mistaken for counting of music originally; N _{M → S}Expression was that music is mistaken for counting of voice originally, and N represents that all CPRCR count in the pending sample.

The judging nicety rate of cut-point has embodied in the detected voice/music cut-point, and the ratio that correct cut-point is occupied in all measuring points to be checked has characterized the correctness of testing result.

Experimental result added up show: the cut-point of the compression domain voice/music data when traditional BIC detection method is 5dB to signal to noise ratio (S/N ratio) under the white noise environment detects accuracy rate and only reaches 30.56%, its detection accuracy rate is then lower under the natural noise environment, it only is 25.27% that the cut-point of the compression domain voice/music data when signal to noise ratio (S/N ratio) is 5dB under the train noise environment detects accuracy rate, it only is 22.15% that the cut-point of the compression domain voice/music data when signal to noise ratio (S/N ratio) is 5dB under the automobile noise environment detects accuracy rate, this can not satisfy far away cuts apart demand normally, can think and can not carry out cutting apart of compression domain voice/music data effectively; Use the present invention is based on the compressed domain high robust voice/music dividing method of probability density ratio, the rate of accuracy reached to 82.25% that compression domain voice/music data cut-point when signal to noise ratio (S/N ratio) is 5dB under white noise environment detects, under the natural noise environment, also can realize good segmentation effect, the detection rate of accuracy reached to 81.09% of compression domain voice/music data cut-point when signal to noise ratio (S/N ratio) is 5dB under the train noise environment, the detection rate of accuracy reached to 78.21% of compression domain voice/music data cut-point when signal to noise ratio (S/N ratio) is 5dB under the automobile noise environment.

This shows, the present invention is based on the compressed domain high robust voice/music dividing method of probability density ratio, can carry out effective voice/music cut-point to different physical environment noise low signal-to-noise ratio condition lower compression territory voice/music data detects, voice/music changes a some detection problem under the different physical environment noise low signal-to-noise ratio conditions thereby solved in the compression domain, this invention can be further used for the identification of compression domain voice/music, voice/music systematic searching, various application occasions such as audio scene analysis.

Claims

1, a kind of compressed domain high robust voice/music dividing method based on the probability density ratio is characterized in that: at first extract the new feature parameter based on the probability density ratio can embody voice and music different qualities from low signal-to-noise ratio compression domain voice/music blended data: the compression domain probability density than and the compression domain probability density compare zero-crossing rate; Based on this new feature parameter compression domain voice and music are changed a detection then; Cut apart thus at last, obtain voice, music segments behind the cut-point respectively.

2, the compressed domain high robust voice/music dividing method based on the probability density ratio according to claim 1 is characterized in that the concrete operations step is as follows:

1), the pre-service of compressed voice/music data: comprise the obtaining of compression domain voice music blended data, to the reading of decoding frame head and side information, master data read Hafman decoding and quantification;

5), the voice under the different physical environment noise low signal-to-noise ratio condition and the change point of music detect the cut-point of output physical environment noise low signal-to-noise ratio condition lower compression territory voice/music data, voice, music segments after obtaining cutting apart.

3, the compressed domain high robust voice/music dividing method based on the probability density ratio according to claim 2 is characterized in that the pre-service concrete steps of the compressed voice/music data of described step 1) are:

1., compression domain voice music blended data obtains

2., frame head information reads

A), read synchronizing information in the frame;

3., side information reads

B), obtain the side information data Side of this frame;

4., master data reads

B), read the master data of this frame, its data length is Maindata;

C), from master data information, obtain convergent-divergent English Scale;

5., Hafman decoding and inverse quantization

C), the data among the array is are carried out inverse quantization.

4, the compressed domain high robust voice/music dividing method based on the probability density ratio according to claim 2 is characterized in that described step 2) generation correction discrete cosine transform MDCT matrix concrete steps be:

1., find out each sub-band coefficients

2., form matrix

5, the compressed domain high robust voice/music dividing method based on the probability density ratio according to claim 2 is characterized in that the concrete steps of described step 3) compression domain voice/music data characteristics Parameter Extraction are:

1., ask for the compression domain probability density than CPR characteristic parameter

A), based on bayesian criterion in the statistics, set two kinds the hypothesis H ₀, H ₁:

H ₀: the pure noise source of Z=N

H ₁: Z=N+S voice/music+noise audio frequency

H wherein ₁Be exactly MP3 voice+music+noise input, H ₀Be pure noise model;

B), the structure noise model, suppose H ₀For compression domain white noise model, according to step 2) middle concrete steps method 2., the MDCT matrix of formation white noise, structure white noise herein is necessary for the high s/n ratio environment with respect to compression domain voice/music data;

C), calculating probability density is than bayesian criterion model:

Λ = Π_{K = 1}^{L} \frac{λ_{N} (K)}{λ_{N} (K) + λ_{Z} (K)} \exp {\frac{λ_{Z} (k) {| Z_{K} |}^{2}}{(λ_{N} (K) + λ_{Z} (K)) \cdot λ_{N} (K)}}

\frac{λ_{Z} (K)}{λ_{N} (K) + λ_{Z} (K)} = ϵ_{k}^{n} = α \cdot \frac{{| Z_{K}^{n - 1} |}^{2}}{λ_{N} (k, n - 1)} + (1 - α) P [λ (k) - 1];

Wherein

P (X) = \{\begin{matrix} X & X &GreaterEqual; 0 \\ 0 & Others \end{matrix},

α is a weight coefficient, gets α=0.98.

D) compare CPR based on bayesian criterion Model Calculation compression domain probability density

{CPR}_{i} = \log Λ = \frac{1}{L} Σ_{K = 1}^{L} \log Λ_{K}

2., ask for the compression domain probability density than zero-crossing rate CPRCR parameter

A), calculated threshold

T_{1} = \frac{1}{N} Σ_{i = 0, N + 1 . . . .}^{(N)} CPR [i]

T_{2} = (\frac{1}{N} Σ_{i = 0, N + 1 . . .}^{(N)} CPR [i]) * 3 .

Be the probability density ratio of each frame wherein: CPR[i], N is half second a frame number;

B), calculate zero-crossing rate CPRcr1, CPRcr2

CPRcr 1 = \frac{1}{2} Σ_{m = 0}^{N - 1} | sgn [{CPR}_{n} (m) - T_{1}] - | sgn [{CPR}_{n} (m - 1) - T_{2}]

CPRcr 2 = \frac{1}{2} Σ_{m = 0}^{N - 1} | sgn [{CPR}_{n} (m) - T_{2}] - | sgn [{CPR}_{n} (m - 1) - T_{2}]

Obtain the CPRcr=CPRcr1+CPRcr2 of this segment data; Sgn is-symbol function wherein, CPR _n(m) n half second m CPR parameter of expression;

C), calculate per half second final probability density than zero-crossing rate CPRCR

CPRCR＝CPRCR?/CPRcr?max

Wherein

CPRcr \max = \max_{i &Element; s} (CPRCR 1 (i) + CPRcr 2 (i));

Said process is the normalized to CPRcr.

6, the compressed domain high robust voice/music dividing method based on the probability density ratio according to claim 2 is characterized in that the change spot check measuring tool body step of described step 4) voice/music is:

1., the compression domain probability density of calculating each frame data by described step 3) is than parameter, obtains per half second compression domain probability density than zero-crossing rate CPRCR characteristic parameter based on this characteristic parameter then;

2., revise CPRCR

To not satisfy a continuous N parameter value and be greater than or less than r, r=0.5, but the parameter point that threshold value occurs being greater than or less than is called singular point finds out all singular points and it is handled, and promptly replaces current point according to the data before and after the singular point:

CPRCR [i] = \frac{1}{2} (CPRCR [i - 1] + CPRCR [i + 1]);

Before cutting apart, find out all singular points and can guarantee the validity cut apart, reduce probability of miscarriage of justice;

3., threshold ratio

The CPRCR threshold ratio is provided with threshold xi=0.5.

4., cut-point detects

In view of the probability density ratio characteristic of voice music, the small probability density of voice is than many more than music of the quantity of sequence, and hence one can see that, and the CPRCR of voice is little more than the CPRCR of music.So the section littler than ζ detects and be the speech modification point, the section detection bigger than ζ changes point for music;

5), the cut-point of output compressed voice/music data.

7, the compressed domain high robust voice/music dividing method based on the probability density ratio according to claim 2 is characterized in that voice/music under the different physical environment noise low signal-to-noise ratio conditions of described step 5) changes the concrete steps that point detects and is:

1., different physical environment noise lower compression territory voice music blended datas obtains

2., according to this compression domain voice/music blended data of claim 3-6 step process, the cut-point of output physical environment noise low signal-to-noise ratio condition lower compression territory voice/music data, thereby voice after obtaining cutting apart, music segments.