CN101645265B

CN101645265B - Method and device for identifying audio category in real time

Info

Publication number: CN101645265B
Application number: CN2008101422448A
Authority: CN
Inventors: 付中华; 刘开文
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2008-08-05
Filing date: 2008-08-05
Publication date: 2011-07-13
Anticipated expiration: 2028-08-05
Also published as: CN101645265A

Abstract

The invention discloses a method and a device for identifying an audio category in real time. The method for identifying in real time comprises the following steps: a, calculating and analyzing a short-term energy mean square root of an audio signal in a section, and entering step b when the short-term energy mean square root is smaller than a preset mute detection threshold; b, respectively performing real cepstrum analysis on each frame of the audio signal; and c, calculating the short-term characteristics of the audio signal according to a result of the real cepstrum analysis, and according to the short-term characteristics, identifying the category of the audio signal by a threshold method. The technical scheme of the invention effectively implements real-time identification of the audio category based on a real cepstrum.

Description

A kind of real-time identification method of audio categories and device

Technical field

The present invention relates to the communications field, relate in particular to a kind of real-time identification method and device of audio categories.

Background technology

In the encoding-decoding process of audio frequency, music often adopts different code encoding/decoding modes with voice signal, therefore, needs to discern its classification before the audio coding decoding, determines music or voice.

The identification difficulty of audio categories is the changeable of noise in music and the voice.At present, the identification of music and voice is mainly analyzed by short-time analysis and when long realized.In the short-time analysis, the short-time characteristic that extracts from sound signal processing has in short-term only utilized a spot of useful information, is not enough to reflect the difference of music and voice two class signals; In analyzing when long, owing to lack strong feature description, perhaps by identification as described in long timeslice is realized as the analysis of whole audio file, perhaps the analysis of the statistical nature of deriving by the audio frequency behavioral characteristics realizes described identification; Though the former can reflect the difference of music and voice two class signals preferably, but to having relatively high expectations of sampling rate and calculated amount, time-delay is long, is not suitable for the real-time Communication for Power field, and the latter's recognition feature is sane inadequately, is difficult to guarantee in the communication environment of complexity its identification validity.

Summary of the invention

The technical problem to be solved in the present invention provides a kind of real-time identification method and device of audio categories, has realized the identification based on the audio categories of real cepstrum in real time effectively.

The technical solution adopted for the present invention to solve the technical problems is:

A kind of real-time identification method of audio categories may further comprise the steps:

The short-time energy root mean square of the sound signal in a, the computational analysis interval, described short-time energy root mean square enter step b during less than default silence detection threshold value;

B, respectively each frame signal of described sound signal is carried out real cepstral analysis;

C, calculate the value of each LPH of described sound signal (LongestPitch Hold, fundamental frequency retention time)/PCT (Pitch Continue Time, fundamental frequency continuous time) according to described real cepstral analysis result;

D, judge that the number of times of LPH/PCT=1 is whether greater than default equal times threshold value, if described sound signal is a music.

In the such scheme, whether the number of times of the described LPH/PCT=1 of judgement is greater than comprising after the default equal times threshold value:

E, when the number of times of LPH/PCT=1 is not more than default equal times threshold value, determine the average of LPH/PCT according to the value of each LPH/PCT, whether judge it greater than default average threshold value, if described sound signal is a music, otherwise, enter next step;

F, calculate the APD (Average PitchDensity, average tonality intensity) of described sound signal, and whether judge it greater than default intensity threshold according to described real cepstral analysis result, if described sound signal is a music, otherwise, enter next step;

G, calculate the quantity and the average energy of the tonality frame and the non-tonality frame of described sound signal according to described real cepstral analysis result;

H, determine TNR (Tone Non-toneRatio, the average energy of tonality frame and non-tonality frame than), and judge whether it compares threshold value less than default energy according to the average energy of described tonality frame and non-tonality frame, if, described sound signal is a music, otherwise, enter next step;

I, determine RNT (Ratio of Non-Tone, non-tonality frame ratio) according to the quantity of described tonality frame and non-tonality frame, and whether judge it less than default proportion threshold value, if described sound signal is a music, otherwise described sound signal is voice;

The computing formula of described APD is:

APD = Σ_{i = 1}^{N} \frac{1}{L} Σ_{j = l_{1}}^{l_{2}} | RC x_{i} (j) |

Wherein, N is the signal frame number that described sound signal comprises, RCx _i(j) be the j point result of real cepstral analysis of the i frame signal of described sound signal, L=l ₂-l ₁+ 1, l ₁And l ₂Be respectively the starting point and the terminating point of the part of reflection spectrum detailed information among the described real cepstral analysis result.

In the such scheme, described step c realizes by following steps::

C1, according to the part of reflection spectrum detailed information among the described real cepstral analysis result, determine fundamental frequency continually varying sets of signals in the described sound signal, and the sets of signals that fundamental frequency remains unchanged in each fundamental frequency continually varying sets of signals;

C2, the signal frame number that comprises according to each fundamental frequency continually varying sets of signals are respectively determined corresponding PCT, and the signal frame number that comprises of the sets of signals that remains unchanged according to each fundamental frequency is determined corresponding LPH respectively;

C3, each LPH is divided by with corresponding PCT respectively.

In the such scheme, in each frame signal that each fundamental frequency continually varying sets of signals comprises, the peak difference of the part of reflection spectrum detailed information is less than default peak value error in reading among the real cepstral analysis result of adjacent two frame signals; In each frame signal that the sets of signals that each fundamental frequency remains unchanged comprises, the peak difference of the part of reflection spectrum detailed information keeps error less than default peak value among the real cepstral analysis result of adjacent two frame signals; Described peak value keeps error less than described peak value error in reading.

In the such scheme, described step g realizes by following steps:

G1, according to the part of reflection spectrum detailed information among the described real cepstral analysis result, each frame signal with described sound signal is labeled as tonality frame or non-tonality frame respectively;

The quantity of g2, statistics described tonality frame and non-tonality frame, and calculate the average energy of described tonality frame and non-tonality frame.

In the such scheme, when the peak value of the part of reflection spectrum detailed information is less than default tonality threshold value among the result of the real cepstral analysis of every frame signal of described sound signal, it is labeled as non-tonality frame, otherwise, it is labeled as the tonality frame.

In the such scheme, also comprise the step of the described sound signal of pre-service before the described step a, handle, divide frame processing and windowing process to realize by pre-emphasis successively described sound signal.

A kind of real-time distinguishing apparatus of audio categories comprises:

The silence detection module is used for the short-time energy root mean square of the sound signal in the computational analysis interval, and judges whether mute state of described sound signal according to it;

Real cepstral analysis module is used for when described silence detection module is determined the non-mute state of described sound signal each frame signal of described sound signal being carried out real cepstral analysis;

The audio categories identification module is used for calculating according to described real cepstral analysis result the value of each fundamental frequency retention time of described sound signal LPH/ fundamental frequency PCT continuous time; And whether the number of times of judging LPH/PCT=1 is greater than default equal times threshold value, if represent that then described sound signal is a music.

In the such scheme, described real-time distinguishing apparatus also comprises pretreatment module, be used for successively described sound signal being carried out pre-emphasis and handle, divide frame to handle and windowing process, and the audio signal transmission after will handling is given described silence detection module;

Described audio categories identification module, also be used for when the number of times of LPH/PCT=1 is not more than default equal times threshold value, determine the average of LPH/PCT, judge that whether it is greater than default average threshold value according to the value of each LPH/PCT, if represent that described sound signal is a music; Otherwise calculate the average tonality intensity A PD of described sound signal according to described real cepstral analysis result, and judge that whether it is greater than default intensity threshold, if represent that described sound signal is a music; Otherwise calculate the quantity and the average energy of the tonality frame and the non-tonality frame of described sound signal according to described real cepstral analysis result; The average energy of determining tonality frame and non-tonality frame according to the described tonality frame and the average energy of non-tonality frame is than TNR, and judge its whether less than default energy than threshold value, if represent that described sound signal is a music; Otherwise determine non-tonality frame ratio RNT according to the quantity of described tonality frame and non-tonality frame, and judge that whether it is less than default proportion threshold value, if represent that described sound signal is a music; Otherwise represent that described sound signal is voice.

Beneficial effect of the present invention mainly shows: the real-time distinguishing apparatus of audio categories provided by the invention is used to realize the real-time identification method of audio categories provided by the invention, this method is calculated the short-time characteristic of this sound signal according to the real cepstral analysis result of each frame signal of sound signal, and the employing threshold method, realized identification in real time effectively based on the audio categories of real cepstrum.

Description of drawings

Fig. 1 is the Real time identification process flow diagram of audio categories of the present invention;

Fig. 2 is the real-time distinguishing apparatus structural representation of audio categories of the present invention.

Embodiment

The invention will be further described below in conjunction with accompanying drawing.

With reference to Fig. 1, a kind of real-time identification method of audio categories may further comprise the steps:

S101: preprocessed audio signal comprises that the pre-emphasis of carrying out is successively handled, the branch frame is handled and windowing process;

S102: the short-time energy root mean square of the sound signal in the computational analysis interval, short-time energy root mean square are during less than default silence detection threshold value, and sound signal is non-mute state, enters next step, otherwise sound signal is a mute state, process ends;

S103: respectively each frame signal of the sound signal in the analystal section is carried out real cepstral analysis; Part near 0 among the real cepstral analysis result of every frame signal has mainly reflected large scale information such as power spectrum profile, has mainly reflected the spectrum detailed information away from 0 part, and promptly it can separate the spectrum profile with the spectrum details;

S104: the value of calculating each LPH/PCT of sound signal according to real cepstral analysis result; That is:

At first,, determine fundamental frequency continually varying sets of signals in the sound signal according to the part of reflection spectrum detailed information among the real cepstral analysis result, and the sets of signals that fundamental frequency remains unchanged in each fundamental frequency continually varying sets of signals;

Wherein, in each frame signal that each fundamental frequency continually varying sets of signals comprises, the peak difference of the part of reflection spectrum detailed information is less than default peak value error in reading σ among the real cepstral analysis result of adjacent two frame signals; In each frame signal that the sets of signals that each fundamental frequency remains unchanged comprises, the peak difference of the part of reflection spectrum detailed information keeps error ε less than default peak value among the real cepstral analysis result of adjacent two frame signals; ε is less than σ;

Then, the signal frame number that comprises according to each fundamental frequency continually varying sets of signals is determined corresponding PCT respectively, and the signal frame number that comprises of the sets of signals that remains unchanged according to each fundamental frequency is determined corresponding LPH respectively;

At last, each LPH is divided by with corresponding PCT respectively;

S105: whether the number of times of judging LPH/PCT=1 is greater than default equal times threshold value C ₁If,, sound signal is a music, otherwise, enter next step;

S106: determine the average of LPH/PCT according to the value of each LPH/PCT, judge that whether it is greater than default average threshold value C ₂If,, sound signal is a music, otherwise, enter next step; For music, because its pitch can keep a period of time on a particular value, so the probability that LPH equates with PCT is very big, even be not equal to 1, both ratio also can be relatively near 1, and for voice, its pitch seldom keeps on a particular value, so the probability that LPH equates with PCT is very little, and both difference are also bigger;

S107: calculate the APD of sound signal according to real cepstral analysis result, and judge that whether it is greater than default intensity threshold C ₃, for music, because musical instrument and polyphony, its average pitch is than voice height, if APD is greater than C ₃, sound signal is a music, otherwise, enter next step; The computing formula of APD is:

APD = Σ_{i = 1}^{N} \frac{1}{L} Σ_{j = l_{1}}^{l_{2}} | RC x_{i} (j) |

Wherein, N is the signal frame number that sound signal comprises, RCx _i(j) be the j point result of real cepstral analysis of the i frame signal of sound signal, L=l ₂-l ₁+ 1, l ₁And l ₂Be respectively the starting point and the terminating point of the part of reflection spectrum detailed information among the real cepstral analysis result;

S108: quantity and the average energy of calculating the tonality frame and the non-tonality frame of sound signal according to real cepstral analysis result; That is:

At first, according to the part of reflection spectrum detailed information among the real cepstral analysis result, each frame signal with sound signal is labeled as tonality frame or non-tonality frame respectively;

Because the signal frame that has fundamental frequency to exist is the tonality frame, the signal frame that does not have fundamental frequency to exist is non-tonality frame, when so the peak value of the part of reflection spectrum detailed information is less than default tonality threshold value θ among the result of the real cepstral analysis of every frame signal of sound signal, this frame signal is labeled as non-tonality frame, otherwise, this frame signal is labeled as the tonality frame;

Then, the quantity of statistics tonality frame and non-tonality frame, and the average energy of calculating tonality frame and non-tonality frame;

S109: the average energy according to tonality frame and non-tonality frame is determined TNR, and judge its whether less than default energy than threshold value C ₄If,, sound signal is a music, otherwise, enter next step;

S110: the quantity according to tonality frame and non-tonality frame is determined RNT, and judges that whether it is less than default proportion threshold value C ₅If,, sound signal is a music, otherwise sound signal is voice.

With reference to Fig. 2, a kind of real-time distinguishing apparatus that is used to realize the audio categories of above-mentioned real-time identification method comprises:

Pretreatment module is used for successively sound signal being carried out pre-emphasis and handles, divide frame to handle and windowing process, and the audio signal transmission after will handling is given the silence detection module;

The silence detection module is used in the computational analysis interval short-time energy root mean square of the sound signal after pretreatment module is handled, and judges whether mute state of sound signal according to it;

Real cepstral analysis module is used for when the silence detection module is determined the non-mute state of sound signal each frame signal of sound signal being carried out real cepstral analysis;

The audio categories identification module is used for calculating according to the analysis result of real cepstral analysis module the short-time characteristic of sound signal, and according to short-time characteristic, adopts the classification of threshold method identification sound signal.

To 8kHz sampling, 16 bit quantizations, pre emphasis factor for-0.80, frame length is the sound signal that 32ms, frame move 10ms, the overlapping 22ms of interframe, getting fast fourier transform length is 256, the starting point l of the part of reflection spectrum detailed information among the then real cepstral analysis result ₁Be 14, terminating point l ₂Be 128; Simultaneously, the signal frame number N that σ gets 4, ε gets 1, θ gets 0.2, the sound signal in the analystal section comprises gets 100, then C ₁Get 0, C ₂Get 0.5, C ₃Get 0.6, C ₄Get 0.2, C ₅Get 1, when adopting the method for the invention to discern its classification, the cooperation of 5 kinds of short-time characteristics is judged, can effectively be realized the identification of audio categories.

The above is embodiments of the invention only, is not limited to the present invention, and for a person skilled in the art, the present invention can have various changes and variation.Within the spirit and principles in the present invention all, any modification of being done, be equal to replacement, improvement etc., all should be included within the claim scope of the present invention.

Claims

1. the real-time identification method of an audio categories is characterized in that, may further comprise the steps:

C, calculate the value of each fundamental frequency retention time of described sound signal LPH/ fundamental frequency PCT continuous time according to described real cepstral analysis result;

2. the real-time identification method of audio categories as claimed in claim 1 is characterized in that, whether the number of times of the described LPH/PCT=1 of judgement is greater than comprising after the default equal times threshold value:

F, calculate the average tonality intensity A PD of described sound signal, and whether judge it greater than default intensity threshold according to described real cepstral analysis result, if described sound signal is a music, otherwise, enter next step;

H, determine tonality frame and non-tonality frame according to the described tonality frame and the average energy of non-tonality frame average energy than TNR, and judge its whether less than default energy than threshold value, if described sound signal is a music, otherwise, enter next step;

I, determine non-tonality frame ratio RNT according to the quantity of described tonality frame and non-tonality frame, and whether judge it less than default proportion threshold value, if described sound signal is a music, otherwise described sound signal is voice;

The computing formula of described APD is:

APD = Σ_{i = 1}^{N} \frac{1}{L} Σ_{j = l_{1}}^{l_{2}} | RC x_{i} (j) |

3. the real-time identification method of audio categories as claimed in claim 2 is characterized in that, described step c realizes by following steps:

C3, each LPH is divided by with corresponding PCT respectively.

4. the real-time identification method of audio categories as claimed in claim 3, it is characterized in that: in each frame signal that each fundamental frequency continually varying sets of signals comprises, the peak difference of the part of reflection spectrum detailed information is less than default peak value error in reading among the real cepstral analysis result of adjacent two frame signals; In each frame signal that the sets of signals that each fundamental frequency remains unchanged comprises, the peak difference of the part of reflection spectrum detailed information keeps error less than default peak value among the real cepstral analysis result of adjacent two frame signals; Described peak value keeps error less than described peak value error in reading.

5. the real-time identification method of audio categories as claimed in claim 2 is characterized in that, described step g realizes by following steps:

6. the real-time identification method of audio categories as claimed in claim 5, it is characterized in that: when the peak value of the part of reflection spectrum detailed information is less than default tonality threshold value among the result of the real cepstral analysis of every frame signal of described sound signal, it is labeled as non-tonality frame, otherwise, it is labeled as the tonality frame.

7. the real-time identification method of audio categories as claimed in claim 1, it is characterized in that: also comprise the step of the described sound signal of pre-service before the described step a, handle, divide frame processing and windowing process to realize by pre-emphasis successively described sound signal.

8. the real-time distinguishing apparatus of an audio categories is characterized in that, comprising:

9. the real-time distinguishing apparatus of audio categories as claimed in claim 8, it is characterized in that: described real-time distinguishing apparatus also comprises pretreatment module, be used for successively described sound signal being carried out pre-emphasis and handle, divide frame to handle and windowing process, and the audio signal transmission after will handling is given described silence detection module;