CN107527617A

CN107527617A - Monitoring method, apparatus and system based on voice recognition

Info

Publication number: CN107527617A
Application number: CN201710944193.XA
Authority: CN
Inventors: 台龙飞; 曹瑞林; 林伟
Original assignee: Shanghai Institute of Technology
Current assignee: Shanghai Institute of Technology
Priority date: 2017-09-30
Filing date: 2017-09-30
Publication date: 2017-12-29

Abstract

The invention provides a kind of monitoring method based on voice recognition, apparatus and system, method comprises the following steps：S1：Several specific sound is gathered in advance carries out sound model training, the sound model after being trained；S2：Collection site sound, feature extraction corresponding with the several specific sound is carried out to the sound of collection；S3：The feature of extraction and the sound model are subjected to matching classification, obtain the classification results of live sound；S4：Judge whether to need to alarm according to the classification results.The present invention can make up the deficiency of traditional video surveillance, and sound coordinates video preferably can be monitored in real time to complex environment.And prevention and the efficiency crackd down on crime can be improved, it is ensured that the initiative and promptness that monitoring system monitors to unsafe incidents.

Description

Monitoring method, apparatus and system based on voice recognition

Technical field

It is more particularly to a kind of to be known based on sound the present invention relates to signal transacting, speech recognition and mode identification technology Other monitoring method, apparatus and system.

Background technology

It is in public to use traditional video monitoring means more, video monitoring relative efficiency to have taken precautions against some illegal Criminal activity.But there are following two shortcomings in video monitoring：Due to the carelessness of monitoring personnel, monitored picture can be missed and captured Unsafe incidents；Due to the bidimensionality of video pictures, picture, which is easy to disturbed thing, to be stopped.Although case occur with Afterwards, the monitor video of spot can be gathered, helps to investigate and collect evidence.But if missing optimal rescue period can then cause The deterioration of case.So traditional video monitoring system is difficult timely and effectively to find that some incidents of violence either terror is attacked Hit.

Secondly, the classification in sound monitoring to sound simply amplitude or other features can not be classified, it is necessary to Classify with reference to the monitoring scene actual conditions feature different to sound, so that sound monitoring is really applied to life and worked as In.

So the novel intelligent monitoring system of a breakthrough tradition monitoring obstacle of design is extremely urgent.In video monitoring On the basis of add three types sound monitoring and aid in, can greatly improve monitoring efficiency, reduce the generation of tragedy, it is raw to reality Work is significant.

The content of the invention

It is existing to solve it is an object of the invention to provide a kind of monitoring method based on voice recognition, apparatus and system Video monitoring function it is single, the problem of monitoring efficiency is relatively low.

To achieve the above object, the invention provides a kind of monitoring method based on voice recognition, comprise the following steps：

S1：Several specific sound is gathered in advance carries out sound model training, the sound model after being trained；

S2：Collection site sound, feature extraction corresponding with the several specific sound is carried out to the sound of collection；

S3：The feature of extraction and the sound model are subjected to matching classification, obtain the classification results of live sound；

S4：Judge whether to need to alarm according to the classification results.

It is preferred that the specific sound includes abnormal sound, the voice with emotion and the sensitive word voice of non-voice, phase When feature is extracted in Ying Di, the step S2, the feature of extraction is respectively：It is special for the non-speech sounds of abnormal sound monitoring Sign；For crowd's speech emotional feature of crowd's mood monitoring；And monitor and extract with sensitive vocabulary for crowd's language institute Speech-to-text needed for feature.

It is preferred that when extracting non-speech sounds feature, using the abnormal sound feature extracting method based on D-ESMD, tool Body comprises the following steps：

1. determine the number K of T distributed random noises；

2. the voice signal s of collection site, and T distributed random noises are added in the voice signal s, obtain plus make an uproar Signal S_i, wherein, i is the number of noisy signal；

3. to the noisy signal S_iDecomposed using the ESMD of symmetrical Point Interpolation method, obtain modal components

4. calculate the modal componentsArrangement entropy H, and pass through field test threshold value；

If 5. the arrangement entropy H is more than the threshold value, the modal componentsFor useful signal modal components, enter Enter step 6., otherwise the modal componentsFor noise；

6. willAs input signal, repeat 3.~5., the modal components until decomposing obtained n ranksFor noise Untill, wherein, n is positive integer；

If 7. i<K, then make i=i+1, repeat 2.~6., untill i=K, obtain all modal components, and ask Its population meanBy population meanFinal modal components as decomposed signal；

8. calculating energy ratio of each rank modal components relative to original voice signal s, and it is combined into characteristic vector progress Normalized, the characteristic vector as primary signal.

It is preferred that when extracting crowd's speech emotional feature, using the feature extracting method based on speech emotion recognition, tool Body is：The expression of characteristic vector is carried out using the feature set used in international speech emotional challenge match.

It is preferred that when extracting feature needed for speech-to-text, using the speech feature extraction side based on Gammatone Method, specifically include following steps：

1. the live voice signal integrated carries out preemphasis as x (n), to it, if pre emphasis factor is α, after preemphasis Voice signal be y (n)=x (n)-α * x (n-1), wherein, n be collection in worksite voice signal number；

2. to after preemphasis voice signal y (n) carry out framing, frame length be N number of sampled point, wherein, N for 2 it is just whole Power for several times；；

3. to the voice signal y (n) after preemphasis plus Hamming window, the voice signal S (n) after adding window be expressed as S (n)= Y (n) * w (n), wherein, w (n) is Hamming window；

4. carrying out Fast Fourier Transform (FFT) to the voice signal S (n) after adding window, frequency domain signal X (k)=fft (S are obtained (n),N)；

5. square obtaining energy spectrum to frequency domain signal X (k) modulus, then it is filtered with Gammatone wave filter groups Processing, obtains signal H (k)=fft (h (n), N)；

6. the output to each Gammatone wave filters carries out log-compressed；

7. the signal of log-compressed is carried out into discrete cosine transform, GFLCC (Gammatone Frequency Log are obtained Cepstrum Coeffient)；

8. the feature obtained by discrete cosine transform is carried out into liter semisinusoidal cepstrum to be lifted, feature to the end is obtained.

It is preferred that the abnormal sound of the non-voice includes shot, explosive sound, strike note, shriek in monitoring scene In one or more；The voice with emotion is included with one kind in happy, normal, tranquil, lively, angry, angry The voice of emotion；The sensitive word voice is saved somebody's life, murdered including appearance, the middle dangerous vocabulary of one or more of hitting the person.

It is preferred that when the classification results are the abnormal sound of the non-voice, then judge in the step S4 pair The live event answered is the one or more in shooting incident, crash, explosive incident, certain danger event, and is reported Alert prompting；

When the classification results are the voice with emotion, then judge that corresponding crowd's emotion occurs in the step S4 Alarm is carried out when indignation, angry feature；

When the classification results are sensitive word voice, then reported in the step S4 according to the sensitive word recognized Alert prompting.

It is preferred that the step S1 is specifically included：Using the algorithm of fuzzy least squares SVMs to from several The characteristic value extracted in specific sound is learnt, establishes the sound model and classification；Then the step S3 is further Including, by collection in worksite to feature and the sound model of voice signal correspond to carry out matching classification；

Wherein, judge that output result is to need the result alarmed with being not required to according to the classification results in the step S4 The result to be alarmed.

Present invention also offers a kind of supervising device based on voice recognition, including：

Sound pick-up, for collected sound signal；

Model training module, sound model training is carried out for gathering several specific sound in advance, after being trained Sound model；

Characteristic extracting module, carried for the voice signal of collection in worksite to be carried out into feature corresponding with several specific sound Take；

Sort module is matched, the feature that the characteristic extracting module is extracted carries out matching classification with the sound model, Obtain the classification results of live sound；

Alarm module, judge whether to need to alarm according to the classification results.

Present invention also offers a kind of monitoring system based on voice recognition, including one or more to be based on as described above The supervising device of voice recognition.

The invention has the advantages that：

The deficiency of traditional video surveillance can be effectively made up, it is real-time that sound coordinates video preferably can be carried out to complex environment Monitoring.Technical scheme coordinates video monitoring to improve prevention and the effect crackd down on crime to a certain extent Rate, it is ensured that the initiative and promptness that monitoring system monitors to unsafe incidents.

Brief description of the drawings

Fig. 1 is monitoring method schematic flow sheet of the preferred embodiment of the present invention based on voice recognition；

Fig. 2 is supervising device structural representation of the preferred embodiment of the present invention based on voice recognition；

Fig. 3 is preferred embodiment of the present invention sound characteristic extraction module Organization Chart；

Fig. 4 is preferred embodiment of the present invention model building module apparatus structure schematic diagram；

Fig. 5 is preferred embodiment of the present invention model building module Organization Chart；

Fig. 6 is that the preferred embodiment of the present invention matches sort module Organization Chart.

Embodiment

Below with reference to the accompanying drawing of the present invention, clear, complete description is carried out to the technical scheme in the embodiment of the present invention And discussion, it is clear that as described herein is only a part of example of the present invention, is not whole examples, based on the present invention In embodiment, the every other implementation that those of ordinary skill in the art are obtained on the premise of creative work is not made Example, belongs to protection scope of the present invention.

For the ease of the understanding to the embodiment of the present invention, make further by taking specific embodiment as an example below in conjunction with accompanying drawing Illustrate, and each embodiment does not form the restriction to the embodiment of the present invention.

As shown in figure 1, the monitoring method based on voice recognition that the present embodiment provides, comprises the following steps：

S4：Judge whether to need to alarm according to the classification results.

Wherein, specific sound includes abnormal sound, the voice with emotion and the sensitive word voice of non-voice, correspondingly, When feature is extracted in the step S2, the feature of extraction is respectively：For the non-speech sounds feature of abnormal sound monitoring；For Crowd's speech emotional feature of crowd's mood monitoring；And for crowd's language with sensitive vocabulary monitor and extract voice turn Feature needed for word.

Below in conjunction with the accompanying drawings shown in 2-6, the inventive method is described further as follows：

With reference to figure 2-5, mainly include below scheme in the sound monitoring method of the present embodiment：Sound collection, sound characteristic Extraction, model is established, model and live sound feature match classification, the type of alarm.Wherein, monitoring is included to three The monitoring of the sound of type, it is respectively：Abnormal sound monitoring, the speech emotional monitoring of crowd, by the speech-to-text of crowd It is monitored.

Abnormal sound monitoring is to shot, explosive sound, strike note, shriek etc. should not occur in monitoring scene sound Monitoring, then correspond to extraction feature be non-speech sounds feature.

The speech emotional monitoring of crowd is that the speech emotional in the voice to crowd in monitoring scene is monitored, emotion Including emotion possessed by the mankind such as happy, normal, tranquil, lively, angry, angry, wherein being reported to dangerous emotion Alert prompting, such as indignation, anger, then the feature for corresponding to extraction is crowd's speech emotional feature.

The monitoring of crowd's speech-to-text is that crowd's voice in monitoring scene is converted into word, and then word is supervised Control.Such as occur saving somebody's life, murder, dangerous vocabulary of hitting the person, then correspond to sensitive word of the feature of extraction needed for speech-to-text Feature, now, monitoring system make alarm.

Therefore, the abnormal sound of non-voice includes one in the shot in monitoring scene, explosive sound, strike note, shriek Kind is a variety of；Voice with emotion includes the voice with a kind of emotion in happy, normal, tranquil, lively, angry, angry； Sensitive word voice is saved somebody's life, murdered including appearance, the middle dangerous vocabulary of one or more of hitting the person.In so step S1, collection sound is made , it is necessary to the sound that artificial addition is monitored during to train sound：, it is necessary to artificial when being used as training sound such as acquisition abnormity sound Manufacture gunshot, explosive sound etc., it is also desirable to the artificial sound manufactured under other safe conditions, recorded, feature extraction, Model training；, it is necessary to manufacture the voice with all kinds of emotions in the place in speech emotional monitoring, enrolled, feature carries Take, model training；, it is necessary to carry alarm vocabulary (such as save somebody's life, hit the person) in the Place Making in speech-to-text monitoring Voice, it is also desirable to reference to the typing of corresponding places feature not by the sound of alarm vocabulary, enrolled, feature extraction, mould Type training.

It is sound characteristic extraction module Organization Chart of the present invention with reference to figure 3：

Based on three kinds of monitoring type, when the present invention models, the extraction to sound characteristic needs to be divided into three classes：

Monitored for abnormal sound：Abnormal sound feature extraction based on D-ESMD；

Monitored for crowd's mood：Feature extraction based on speech emotion recognition；

Monitored for speech-to-text：Speech feature extraction based on Gammatone.

It is model building module apparatus structure schematic diagram of the present invention with reference to figure 4：

First, according to monitoring scene situation, artificial selection sound is as training sound；

Sound pick-up collects training sound, and by transmission of sound signals to characteristic extracting module；

Characteristic extracting module carries out feature extraction to the feature for training sound, and characteristic value is transferred into training module；

Training module is trained using fuzzy least squares vector machine algorithm to characteristic value, and exports three standby species The training pattern of type, in case matching sort module is called.

Specifically, in step S1, when extracting non-speech sounds feature, carried using the abnormal sound feature based on D-ESMD Method is taken, specifically includes following steps：

2. determine the number K of T distributed random noises；

2. the voice signal s of collection site, and T distributed random noises are added in the voice signal s, obtain plus make an uproar Signal S_i, wherein, i is the number of noisy signal, the arbitrary value in waiting for 1,2,3 ...；

If 5. the arrangement entropy H is more than the threshold value, the modal componentsFor useful signal modal components, enter Step 6., otherwise modal componentsFor noise；

6. willAs input signal, repeat 3.~5., the modal components until decomposing obtained n-th orderTo make an uproar Untill sound, wherein, n is positive integer；

Wherein, T partition noises are added to original voice signal and carries out noise reduction, its principle is：

If the voice signal collected is X (t), real sound signals are x (t), and noise is N (t)；

X (t) is decomposed, modal components M (t) is obtained and decomposes remainder r (t)；

M (t) includes actual signal component m (t) and noise c (t)；

Primary signal mathematic(al) representation：

Primary signal adds T noises：

Above-mentioned k formula is added up：

As k~∞,

Found out by above-mentioned mathematical formulae, add T partition noises and influence of noise is reduced using ESMD decomposed signals.

Wherein, the ESMD of symmetrical centre interpolation includes：

Seek noisy signal S_iAll maximum point x_maxWith minimum point x_min；

All adjacent maximum points and minimum point are connected, seek its midpoint x_mean=(x_max+x_min)/2；

Seek the symmetrical midpoint x at adjacent midpoint_m, to x_mEnter row interpolation.

Wherein, arranging the calculating of entropy includes：

Delay reconstruction is carried out to modal components M, obtains following sequence：

Wherein, i is time delay, and m is reconstruct dimension；

Ascending order arrangement is carried out to m element in all reconstruct component Y (i), the arrangement mode of all reconstruct components is converged Always, the Probability p that every kind of arrangement mode occurs is calculated₁,p₂,…,p_i, then arranging entropy is：

Wherein, mode energy is calculated as follows：

In step S1, when extracting crowd's speech emotional feature, using the feature extracting method based on speech emotion recognition, Specially：The expression of characteristic vector is carried out using the feature set used in international speech emotional challenge match, is specially：

With reference to following table one, the feature set used in the present embodiment includes 16 low layer descriptor (low-level Descriptors, LLDs), and 16 low layer descriptors are acted on by the statistic of 12 class functions, you can carry out crowd The expression of the characteristic vector of speech emotional.

Table one：The feature set used in international speech emotional challenge match

In step S1, when extracting feature needed for speech-to-text, using the speech feature extraction side based on Gammatone Method, specifically include following steps：

1. the live voice signal integrated carries out preemphasis as x (n), to it, if pre emphasis factor is α, after preemphasis Voice signal be y (n)=x (n)-α * x (n-1), wherein, n be collection in worksite voice signal number, be 1,2,3 ... In any one value；

2. carrying out framing to the voice signal y (n) after preemphasis, frame length is N number of sampled point, wherein, N here is 256, N can be set to the value of 2 arbitrary integer power in other preferred embodiments；

3. to the voice signal y (n) after preemphasis plus Hamming window, the voice signal S (n) after adding window be expressed as S (n)= Y (n) * w (n), wherein, w (n) is Hamming window, and Hamming window w (n) is specially：

5. square obtaining energy spectrum to frequency domain signal X (k) modulus, then it is filtered with Gammatone wave filter groups Processing, obtains signal H (k)=fft (h (n), N), wherein, the expression formula of Gammatone wave filters is：

G (t)=t^n-1exp(3πBt)cos(2πf_iT), t >=0,

∫_iIt is centre frequency, B=1.109* (24.7+0.108 ∫_i)；

6. the output to each Gammatone wave filters carries out log-compressed, compression expression formula is：

P is number of filter；

7. the signal of log-compressed is carried out into discrete cosine transform, GFLCC (Gammatone Frequency Log are obtained Cepstrum Coeffient), expression formula is as follows：

M is the dimension of GFLCC features；

8. the feature obtained by discrete cosine transform is carried out into liter semisinusoidal cepstrum to be lifted, feature to the end is obtained, such as Shown in formula：

C=C* ω (i).

With reference to figure 5, when carrying out model training, the sound and the normal sound in actual monitored region alarmed needs Extracted by characteristic extracting module, be trained subsequently into training module, obtain training pattern.

Then in actual monitored, specifically the sound for monitoring place is monitored in real time, in order to improve point of sound type Class result, employs fuzzy least squares vector machine algorithm, and this algorithm can uniquely be attributed to some class to each sample Not.So the sound that the sound or needs that are likely to occur in needing to scene in model training monitor carries out artificial manufacture And gather；For in scene abnormal sound monitoring, it is necessary to collection sound, such as paces sound, lap, running car sound, rifle The sound that is likely to occur in the monitoring such as sound, explosive sound, strike note, shriek place, wherein need alarm for shot, quick-fried The sound that fried sound, strike note, shriek etc. should not occur；Monitoring for crowd's emotion in monitoring place is, it is necessary to gather band The sound of emotional color, the sound of happy, sad, tranquil, lively, angry, angry emotion is such as carried, wherein to dangerous The sound of emotion carries out alarm；Middle the monitoring with vocabulary is spoken, it is necessary to gather in place for crowd in monitoring place The sound of vocabulary is likely to occur, such as has a meal, go window-shopping, learning, playing, saving somebody's life, murdering, the sound for vocabulary of hitting the person, wherein needing Alarm to save somebody's life, murder, the sound for the dangerous vocabulary such as hit the person；Artificial manufacture is needed by alarm and the sound do not alarmed Sound；Feature extraction is carried out to sound using characteristic extracting module, and characteristic value is transferred to training module；Training module uses mould Paste least square support vector machines algorithm is trained to characteristic value, exports training pattern.

With reference to figure 6, the present embodiment is monitored to the sound of three types, then the feature extraction to sound is three species Type.The sound model of three types is established on the basis of the sound characteristic of three types, is respectively：Abnormal sound model, voice Emotion model, language and characters model；Three kinds of sound characteristics and three kinds of sound models are subjected to matching classification；Sorting algorithm is fuzzy Least square method supporting vector machine algorithm；Three kinds of classification results are exported after matching classification.

When step S3 classification results are the abnormal sound of non-voice, then corresponding live event is judged in step S4 For the one or more in shooting incident, crash, explosive incident, certain danger event, and carry out alarm；Work as classification When being as a result the voice with emotion, then when judging that indignation, angry feature occurs in corresponding crowd's emotion in the step S4 Carry out alarm；When classification results are sensitive word voice, then carried out in the step S4 according to the sensitive word recognized Alarm.

During concrete application, when alarm module obtains classification results, three kinds of classification results are independent of each other, such as：Abnormal sound is examined Survey as danger sound, then result one is alarmed, as a result two and result three may not detect danger, then do not alarm；Monitoring personnel The alarm types that can be sent according to warning device make corresponding action；It is relative that event occurs if abnormal sound alarm occurs Seriously, such as shooting incident, explosive incident, then ambulance can be alarmed and called to monitoring personnel；If speech emotional alarm occurs Mostly crowd's dispute, monitoring personnel can call goes together conciliation or the alarm of selectivity and calling ambulance in advance；As occurred Speech-to-text alarm is then the event such as hit the person, save somebody's life, then ambulance can be alarmed and called to monitoring personnel.

After then step S1 extracts feature, still further comprise：Using fuzzy least squares SVMs algorithm to from The characteristic value extracted in several specific sound is learnt, establishes the sound model and classification；The then step S3 Further comprise, by collection in worksite to feature and the sound model of voice signal correspond to carry out matching classification, The characteristic matching classification of model and field collection sound is also used into fuzzy least squares vector machine；Wherein, in the step S4 Output result is judged according to the classification results to need the result alarmed and the result that need not be alarmed.

In preferred embodiment, because the foundation of the sound model of the present embodiment needs to be divided into three classes, then in the present embodiment Alarm setting it is as follows：

The first kind be non-speech sounds feature model establish, i.e., abnormal sound in scene is monitored, as paces sound, The sound being likely to occur in the monitoring such as lap, running car sound, shot, explosive sound, strike note, shriek place is established Model, wherein needing the sound that should not occur for shot, explosive sound, strike note, shriek etc. of alarm；

Second class is that the model of the speech emotional feature of crowd is established, i.e., to the emotion in crowd's voice in monitoring scene It is monitored, emotion includes the emotion possessed by the mankind such as happy, normal, tranquil, lively, angry, angry, wherein to dangerous Emotion carry out alarm, such as angry, anger；

The model of required feature is established when 3rd class is crowd's speech-to-text, i.e., middle institute of being spoken to crowd in monitoring scene Monitoring with vocabulary, such as the vocabulary for having a meal, go window-shopping, learning, play, saving somebody's life, murder, be likely to occur in monitoring place of hitting the person Sound characteristic establishes model, wherein need alarm to save somebody's life, murder, the vocabulary that should not occur such as hit the person.

Here the fuzzy least squares vector machine used has done further improvement on the basis of traditional SVMs, makes Obtain each sample and be attributed to some classification；

Introduce fuzzy membership s_i, then optimization problem be：

Wherein, x_iFor m dimensional input vectors, y_iFor sample category, i is sample number, and w is hyperplane wx_i+ b=0 law vector, B is hyperplane bias, and C is punishment parameter, ξ_iX is represented for relaxation factor_iTo hyperplane wx_i+ b=0 distance；

It is to the optimizing decision surface function of jth class sample for the i-th class sample：

D_ij(x)=(w^T)_ij+ b,

Fuzzy membership function is defined as：

The fuzzy membership function of i-th class sample is：

Sample data x is divided into classification:

The present embodiment additionally provides a kind of supervising device based on voice recognition, and with reference to shown in figure 2, the device includes：

Sound pick-up, for collected sound signal；

Model training module, sound model training is carried out for gathering several specific sound in advance, after being trained Sound model, its hardware composition include microprocessor, integrated circuit, programmable gate circuit etc., can be to three category features of sound Carry out the foundation of model；

Characteristic extracting module, carried for the voice signal of collection in worksite to be carried out into feature corresponding with several specific sound Take, its hardware composition includes microprocessor, integrated circuit, programmable gate circuit etc., can extract three classes of sound according to demand Feature；

Sort module is matched, the feature that the characteristic extracting module is extracted carries out matching classification with the sound model, The classification results of live sound are obtained, its hardware composition includes microprocessor, integrated circuit, programmable gate circuit etc., Ke Yigen Corresponded according to the type of model and feature and carry out matching classification；

Alarm module, judge whether to need to alarm according to the classification results, and further carry out alarm.Alarm Prompting is respectively abnormal sound alarm, speech emotional alarm, the alarm of dangerous vocabulary.Alarmed for abnormal sound, if detected The sound that paces sound, lap, running car sound etc. meet in scene is not alarmed then, if detecting shot, explosive sound, shock The dangerous sound such as sound, shriek is then alarmed；Alarmed for speech emotional, if detecting the normal safe feelings such as happy, lively Thread is not alarmed then, is alarmed if the dangerous mood such as anger, indignation is detected；Alarmed for speech-to-text, if detection To have a meal, go window-shopping, learn, the normal vocabulary such as play is not alarmed then, if the dangerous vocabulary such as detect help, murder, hit the person Then alarm.

In addition, the present embodiment additionally provides a kind of monitoring system based on voice recognition, including one or more as above institutes The supervising device based on voice recognition stated, or there are multiple sound pick-ups, multiple signals are obtained, it is other for supervising device Module carries out sound signal processing respectively.

The foregoing is only a specific embodiment of the invention, but protection scope of the present invention is not limited thereto, any Those skilled in the art the invention discloses technical scope in, to the present invention deformation or replacement done, should all cover Within protection scope of the present invention.Therefore, protection scope of the present invention should be defined by described scope of the claims.

Claims

1. a kind of monitoring method based on voice recognition, it is characterised in that comprise the following steps：

S4：Judge whether to need to alarm according to the classification results.

2. the monitoring method according to claim 1 based on voice recognition, it is characterised in that the specific sound includes non- The abnormal sound of voice, the voice with emotion and sensitive word voice, when correspondingly, in the step S2 extracting feature, extraction Feature be respectively：For the non-speech sounds feature of abnormal sound monitoring；For crowd's speech emotional of crowd's mood monitoring Feature；And for crowd's language with sensitive vocabulary monitor and extract speech-to-text needed for feature.

3. the monitoring method according to claim 2 based on voice recognition, it is characterised in that when extraction non-speech sounds are special During sign, using the abnormal sound feature extracting method based on D-ESMD, following steps are specifically included：

1. set the number K of T distributed random noises；

2. the voice signal s of collection site, and T distributed random noises are added in the voice signal s, obtain noisy signal S_i, wherein, i is the number of noisy signal；

If 5. the arrangement entropy H is more than the threshold value, the modal componentsFor useful signal modal components, into step 6. the otherwise modal componentsFor noise；

6. willAs input signal, repeat 3.~5., until decomposing obtained n rank modal componentsUntill noise, its In, n is positive integer；

If 7. i<K, then make i=i+1, repeat 2.~6., untill i=K, obtain all modal components, and ask its total Body average valueBy population meanFinal modal components as decomposed signal；

8. calculating energy ratio of each rank modal components relative to original voice signal s, and it is combined into characteristic vector and carries out normalizing Change is handled, the characteristic vector as primary signal.

4. the monitoring method according to claim 2 based on voice recognition, it is characterised in that when extraction crowd's speech emotional During feature, using the feature extracting method based on speech emotion recognition, it is specially：Using being used in international speech emotional challenge match Feature set carry out characteristic vector expression.

5. the monitoring method according to claim 2 based on voice recognition, it is characterised in that when extraction speech-to-text institute When needing feature, using the Speech Feature Extraction based on Gammatone, following steps are specifically included：

1. the live voice signal gathered is x (n), preemphasis is carried out to it, if pre emphasis factor is α, after preemphasis Voice signal is y (n)=x (n)-α * x (n-1), wherein, n is the number of the voice signal of collection in worksite；

2. carrying out framing to the voice signal y (n) after preemphasis, frame length is N number of sampled point, wherein, the positive integer time that N is 2 Power；

3. to the voice signal y (n) after preemphasis plus Hamming window, the voice signal S (n) after adding window is expressed as S (n)=y (n) * w (n), wherein, w (n) is Hamming window；

4. carrying out Fast Fourier Transform (FFT) to the voice signal S (n) after adding window, frequency domain signal X (k)=fft (S (n), N) is obtained；

5. square obtaining energy spectrum to frequency domain signal X (k) modulus, processing then is filtered with Gammatone wave filter groups, Obtain signal H (k)=fft (h (n), N)；

6. the output to each Gammatone wave filters carries out log-compressed；

7. the signal of log-compressed is carried out into discrete cosine transform, GFLCC is obtained；

6. the monitoring method according to claim 2 based on voice recognition, it is characterised in that the abnormal sound of the non-voice Sound includes the one or more in the shot in monitoring scene, explosive sound, strike note, shriek；The voice packet with emotion Include the voice with a kind of emotion in happy, normal, tranquil, lively, angry, angry；The sensitive word voice includes occurring Save somebody's life, murder, the middle dangerous vocabulary of one or more of hitting the person.

7. the monitoring method according to claim 6 based on voice recognition, it is characterised in that when the classification results are institute When stating the abnormal sound of non-voice, then judge in the step S4 corresponding to live event be shooting incident, it is crash, quick-fried One or more in fried event, certain danger event, and carry out alarm；

When the classification results are the voice with emotion, then judge that anger occurs in corresponding crowd's emotion in the step S4 Alarm is carried out when anger, angry feature；

When the classification results are sensitive word voice, then alarm are carried out according to the sensitive word recognized in the step S4 and carried Show.

8. the monitoring method according to claim 1 or 2 based on voice recognition, it is characterised in that the step S1 is specific Including：The characteristic value extracted from several specific sound is carried out using the algorithm of fuzzy least squares SVMs Learn, establish the sound model and classification；

Then the step S3 further comprises, the feature for the voice signal that collection in worksite is arrived corresponds with the sound model To carry out matching classification；

Wherein, judge that output result is to need the result alarmed with that need not report according to the classification results in the step S4 Alert result.

A kind of 9. supervising device based on voice recognition, it is characterised in that including：

Sound pick-up, for collected sound signal；

Model training module, sound model training, the sound after being trained are carried out for gathering several specific sound in advance Model；

Characteristic extracting module, for the voice signal of collection in worksite to be carried out into feature extraction corresponding with several specific sound；

Sort module is matched, the feature that the characteristic extracting module is extracted carries out matching classification with the sound model, obtained The classification results of live sound；

10. a kind of monitoring system based on voice recognition, it is characterised in that including one or more as claimed in claim 9 Supervising device based on voice recognition.