Summary of the invention
The invention provides a kind of content-based automated audio analytic system and method thereof.
Content-based automated audio analytic system of the present invention comprises: the audio stream acquisition module, in order to obtain audio stream from external audio source according to a definite decoding rule; The audio stream segmentation module in order to the audio stream that obtains from the audio stream acquisition module is carried out segmentation, makes that each paragraph after the segmentation has single acoustic feature; Audio stream type identification module in order to analyzing from the audio stream with single acoustic feature of audio stream segmentation module output, obtains the acoustic feature that it is determined; The voice flow analysis module is that the audio stream of voice is discerned and obtained phonetic figure in order to the acoustic feature that audio stream type identification module is determined; Based on the keyword retrieval module of pinyin sequence, retrieve in order to the phonetic figure that the voice flow analysis module is obtained, to interested keyword, obtain the position of this keyword in audio stream.
Content-based automated audio analytic system as shown in Figure 1, wherein, audio stream acquisition module 100 obtains audio stream according to a definite decoding rule from external audio source.External audio source can be an audio file 101, can be video file 102, also can be audio input device 103.For audio file and video file, need decode to file according to certain decoding rule, only comprised the audio stream of data division; For audio input device, as microphone, need provide the interface of this equipment, thereby only be comprised the audio stream of data division to audio analysis system.The audio stream that obtains is sent into the processing that the audio stream segmentation module carries out next stage.
Audio stream segmentation module 200 carries out segmentation to the audio stream that obtains from the audio stream acquisition module.After the segmentation, each segmentation all has single acoustic feature.The audio stream segmentation is needed at first to seek the quiet point in the audio stream by energy variance 201,202 two unit of energy, and quiet point is to obtain by energy and the energy variance of calculating sound signal.When the energy value of sound signal during less than certain threshold level, the quiet point that finds in the audio stream is judged by system; When the variance yields of audio signal energies value during less than certain threshold level, there is quiet point in system's judgement audio stream, after in determining audio stream, having quiet point, calculate the variance of audio signal energies value, when the variance of audio signal energies value during greater than certain threshold level, determine quiet section end in the audio stream, so far determine to find the quiet point in the audio stream.201, the information via of the output of 202 two unit is comprehensive, draws unified breakpoint information.The audio stream of known quiet point is sent into audio frequency characteristics detecting unit 203, to obtain finer audio frequency change point.By calculating the feature difference degree value of the adjacent audio fragment of in audio stream, obtaining, can judge the change point of audio frequency characteristics.After surpassing certain threshold level, feature difference degree value determines that the audio frequency characteristics change point exists.Each segmentation of Unit 203 output has all had single acoustic feature, but owing to the restriction of Unit 203 to the Audio Processing principle, the carve information that obtain this moment can be too meticulous, therefore need carry out waypoint merge cells 204.Unit 204 can detect the situation of change of acoustic characteristic in the adjacent sectional, if find that the acoustic characteristic of adjacent sectional is quite similar, judge that then these two segmentations should merge.Audio stream after the segmentation is admitted to the audio stream type judging module to judge the particular type of this segmentation.
Audio stream type identification module 300 to analyzing from the audio stream with single acoustic feature of audio stream segmentation module output, obtains the acoustic feature that it is determined.The method of analyzing has two kinds, according to time domain and frequency domain character the feature of audio stream is analyzed respectively according to temporal signatures taxon 301 with according to frequency domain character taxon 302.The type that single audio frequency stream is judged by a kind of temporal signatures in zero-crossing rate, short-time energy, short-time energy mean square deviation, quiet frame ratio, the sub-band energy distribution or several temporal signatures in Unit 301; The type that single audio frequency flows is judged by a kind of frequency domain character in linear prediction cepstrum coefficient, the Mel cepstrum coefficient or two kinds of frequency domain characters in Unit 302.These two unit are parallel processing audio streams, and the result of two unit outputs determines the type of acoustic feature thus through merging after the differentiation of degree of confidence.Have the differentiation of the audio stream of single acoustic characteristic through 301 and 302 two unit, its acoustic characteristic can be determined.
Voice flow analysis module 400, the acoustic feature that audio stream type identification module is determined are that the audio stream of voice is discerned and obtained phonetic figure.Voice flow analysis module in system of the present invention is mainly partly analyzed voice flow, and non-voice stream part is abandoned after through audio stream type identification module, has only voice flow to be admitted to analysis and identification that the voice flow analysis module carries out content.Speech analysis is based on mainly that eigenvector analysis behind the branch frame carries out, and therefore, at first enters feature vector sequence extraction unit 401 from the voice flow of audio stream type identification module input, obtains representing the feature vector sequence of this voice flow.In Model Matching unit 402, the acoustic model of this feature vector sequence and phonetic mates, and each candidate pinyin sequence is calculated the matching distance of phonetic respectively.Behind the statistics dependence unit 403, the matching distance of pinyin sequence is recomputated, and obtains the phonetic figure that many candidate pinyin constitute between adding phonetic.The phonetic figure of this moment also rests on the stage of alligatoring phonetic Figure 40 4.After alligatoring phonetic figure carried out self-adaptation correction and level and smooth the correction, just can obtain refinement phonetic Figure 40 5.Unit 405 are stored, and so far the analytic process to voice flow finishes.
Based on the keyword retrieval module 500 of pinyin sequence, the phonetic figure that the voice flow analysis module is obtained retrieves.Pinyin sequence unit 502 storages pinyin sequence to be retrieved, the phonetic figure of pinyin sequence and unit 405 outputs is admitted to confidence computation unit 501 simultaneously, carries out the calculating of degree of confidence.The calculating of degree of confidence is to utilize front and back to algorithm, calculates the posterior probability of pinyin string, judges according to prior preset confidence thresholding whether this pinyin sequence that is retrieved occurs in this audio fragment.If through judge that this pinyin sequence occurs in audio fragment, Unit 501 can obtain the position of this pinyin sequence in audio stream simultaneously.
Embodiment
Referring to Fig. 1, for audio analysis system, the audio stream acquisition module is the basis of whole audio analysis, is that data are carried out pretreated process.Demoder can be taked different decoding processes at different audio stream sources.The source of audio stream has multiple, asf/wma/wmv/avi/wav/mpeg/mp3/aiff/pcm/raw/vox is possible audio stream source, and the sample frequency and the sampling resolution of the audio frequency in every kind of source all are not quite similar: for example the sample frequency of telephone audio is generally 8000Hz, and the sample frequency of CD is generally 44100Hz; Sampling resolution also can change to 24 from 8 owing to the difference in source.Behind the audio stream acquisition module, the audio stream of various separate sources all is unified into a kind of form, and the audio stream of this consolidation form has identical sample frequency and sampling resolution, and only includes the information of digitized audio frequency itself.
For the audio stream of one section unknown properties, need to analyze its acoustic characteristic, at this moment just need the method for utilizing audio parsing and audio types to differentiate.It is the basis of audio retrieval that audio parsing and audio types are differentiated, and in the front end signal of speech recognition is handled crucial meaning is arranged.In content-based speech analysis system, what import speech analysis module should be the single audio-frequency fragments of acoustic feature.But the voice flow of gathering under the true environment does not often satisfy such condition, but various features is mixed in together.With the Broadcast Journalism is example, and its acoustic enviroment is complicated and changeable, and sentence boundary the unknown if directly do not send into speech analysis module just do not do front-end processing, can reduce the performance of speech analysis module greatly.This just need cut apart voice flow voice flow pre-service in addition.
Pause has reflected the structural information of language material as an important prosodic features.Usually can have pause in the middle of sentence and the sentence and in the middle of voice and the non-voice, be characterized by quiet and pause, the sound signal of this moment only is a ground unrest.By detecting quiet can cutting apart, reach the purpose of preliminary segmentation to continuous audio frequency stream.
Judge that by the height of energy whether quiet point exists is a kind of mode of the easiest realization, in the audio parsing module, we use energy as one of foundation of judging quiet point.But because acoustic enviroment is not unalterable in the reality, it is low to that is to say that quiet energy has height to have, and so, it is not enough only relying on the height of energy to come segmentation, therefore considers to use the variance of energy to be used as another foundation of segmentation.
Definition energy variance is as follows:
Wherein
e
iEnergy for each frame; N is the frame number of energy, and is relevant with the length of pausing, and be 300ms if promptly set minimum pause, and the frame rate of energy is 100 o'clock, and N is 30.
Energy e calculates with following formula:
Wherein x (t) is a t sampled point, and T is total sampled point number that every frame comprises.
Set thresholding T
Var, be the long window self-energy variance of calculating of window with N.Thresholding T
VarComputing method as follows:
T
Var=α log
10σ
Global, σ
GlobalBe the variance in the whole audio stream scope, α is a scale factor, between the value 0.7-1.0.
If the variance yields that obtains greater than the thresholding of setting, then illustrate do not have in this section audio quiet.Step-length moving window with certain recomputates the energy variance.If the energy variance yields that obtains is less than threshold value, then signal contains quiet point in the instructions window.In order to find quiet end point, constantly increase the length of window, till the energy variance that calculates is greater than thresholding, so far find a quiet point.The long N that is made as again of window, beginning is searched next time afterwards.So can find all quiet points in the audio stream.Each quiet point then is slit into each audio section to continuous audio frequency flow point, can be that unit is further processed afterwards with the audio section.
The essence that detects the change point of acoustic characteristic is the distance of calculating between two models.At first sound signal is carried out modeling with acoustic feature vector, calculate the distance of acoustic feature vector between two adjacent windows then.The distance of directly calculating acoustic feature vector is difficult to carry out, and therefore need take indirect mode.It has been generally acknowledged that the acoustic feature vector that obtains like this satisfies Gaussian distribution,, calculate the distance of two Gaussian distribution then so consider at first the acoustic feature vector in the window to be come match with Gaussian distribution.The distance of calculating acoustic feature vector so just is converted into the distance of counting statistics model.Be the distance of calculating two statistical models now, so the method for a lot of metric ranges is arranged.
Supposing existing two Gaussian distribution, is respectively N (μ
1, ∑
1) and N (μ
2, ∑
2), the method for several frequently seen calculating Gaussian distribution distance is as follows:
Kullback-Leibler-2?distance:
Mahalanobis?distance:
Bhattacharyya?distance:
But the computation model distance can be used not limit to and use top three kinds of methods, and here we use the distance between two models of Kullback-Leibler-2 distance calculation.
If obviously the distance value of two distributions is very big, then explanation is likely an acoustic feature trip point, and very little these two parts that then illustrate of distance that distribute are in the middle of the similar acoustic enviroment, and acoustic characteristic should be identical.
On a continuous audio signal stream, respectively get a segment signal with two adjacent windows, character vector of signals in every window is fitted to a Gaussian distribution, calculate the distance between these two Gaussian distribution.Move this two windows by certain step-length then, calculate the distance of two windows once more, thereby obtain a distance measure curve.At last find out acoustical change point possible on the curve according to certain threshold setting rule.
This module is relatively more responsive to the change of environment, and recall rate is very high, but also can detect a lot of redundant points simultaneously, causes false alarm rate too high.Just because of this characteristic that detects acoustic characteristic change point unit, the setting of waypoint merge cells is only and is necessary.
The waypoint merge cells is to determine to judge whether two continuant frequency ranges can merge under the situation of cut-point.
Suppose x
1, x
2... x
N~N (μ, ∑) supposes only to comprise a cut-point in audio-frequency fragments, and saltus step takes place i constantly, and whole audio section is divided into former and later two parts, and two parts to after cutting apart have
x
1, x
2... x
i~N (μ
1, ∑
1) and x
I+1, x
I+2... x
N~N (μ
2, ∑
2)
∑, ∑
1, ∑
2It is respectively the covariance matrix of all voice datas, a preceding i voice data, N-i the voice data in back.
Can regard a problem of model selection as judging whether to merge so.One of model is that all voice datas are described with a Gaussian distribution; Two of model is to be the boundary with the cut-point, and voice data is divided into two parts, describes with a Gaussian distribution respectively.The BIC value of two models can be represented with following formula:
Wherein, N, N
1, N
2Be respectively the eigenvector number of describing this Gaussian distribution, d is the dimension of vector space, and λ is a penalty factor, and general value is 1.
If the BIC value, thinks then that two audio sections are to belong to same distribution less than 0, should merge, on the contrary then nonjoinder.
Also different values can be arranged for the penalty factor λ in the following formula, can set different λ values according to different situations like this, in the hope of obtaining more excellent result.
The characteristic information that has shown audio stream through the waypoint after merging.Wherein audio frequency characteristics is the usefulness of the part of voice for the analysis of voice flow analysis module.
After voice flow is sent into speech analysis module, at first to extract feature vector sequence to voice flow.Phonetic feature has multiple, as LPC coefficient, Mel frequency cepstral coefficient (Mel-Frequency Cepstral Coefficient, MFCC), the perception linear forecasting parameter (Perceptual Linear Predictive, PLP) or the like.Here we and be indifferent to and adopt which kind of parameter, the present invention can be suitable for any characteristic parameter.Here we to adopt the MFCC coefficient be example.
Here the MFCC coefficient of Cai Yonging is 14 dimensions, and 14 dimension MFCC coefficients add corresponding single order second order difference, with and the single order second order difference of logarithm energy and logarithm energy constituted the eigenvectors of 45 dimensions.The eigenvector that divides frame to extract is combined and has been constituted feature vector sequence.
The acoustic model that the Model Matching unit can adopt has multiple equally: single-tone submodel (Monophone), diphone model (Biphone), three-tone model (Triphone) or multitone submodel more.Here adopt the single-tone submodel for the convenience of narrating.Simultaneously in the Model Matching unit, add phonemic language model, promptly utilize the statistics dependence submodel coupling between phonetic, in the hope of obtaining the result of better phonetic figure.
The Model Matching unit can obtain the phonetic figure of alligatoring.After obtaining alligatoring phonetic figure, can adopt adaptive approach and smoothing method that phonetic figure is revised, so that phonetic figure can reflect the information that audio stream carries better.Common adaptive approach has the maximum a posteriori probability self-adaptation, and (Maximum A Posteriori, MAP) (Maximum Likelihood Linear Regression MLLR), adopts the MLLR adaptive approach here with the linear recurrence of maximum likelihood self-adaptation.So just, can obtain the phonetic figure of refinement.This phonetic figure can be used as keyword retrieval.
The keyword retrieval system generally is divided into following two classes:
The one, single-phase system, search is carried out on the network of keyword model and non-key speech model (perhaps being called garbage model) parallel connection.When keyword changed, system must analyze once more to voice flow, and to when repeatedly retrieving with audio stream, retrieval rate is relatively slow, therefore was not suitable for using under the occasion that the user need revise querying condition repeatedly.
Two is two stage system.Phase one is a pretreatment stage, is phonetic figure or speech figure or text by speech analysis module with the voice flow data conversion, every section audio stream is all only needed operation once, and the retrieval for the response user later on only needs to search coupling in phonetic figure or speech figure or text.
Audio retrieval task in the native system is under the certain situation of database, returns corresponding results according to different query requests, therefore considers to take two stage system as system schema.
We have obtained the phonetic figure of refinement the front, just need this moment the input pinyin sequence to be retrieved and the phonetic figure of refinement to handle together, just the result that can obtain retrieving.
During retrieval, utilize the posterior probability of front and back, thereby carry out the calculating of degree of confidence, judge the pinyin sequence that whether occurs being retrieved in this audio fragment according to prior preset confidence thresholding afterwards to algorithm computation pinyin sequence to be retrieved.If there is this pinyin sequence, can obtain the position of pinyin sequence to be retrieved in audio stream simultaneously.