CN106710599A

CN106710599A - Particular sound source detection method and particular sound source detection system based on deep neural network

Info

Publication number: CN106710599A
Application number: CN201611099733.0A
Authority: CN
Inventors: 蔡钢林
Original assignee: Shenzhen Sahara Data Technology Co Ltd
Current assignee: Shenzhen Sahara Data Technology Co Ltd
Priority date: 2016-12-02
Filing date: 2016-12-02
Publication date: 2017-05-24

Abstract

The invention relates to a particular sound source detection method and a particular sound source detection system based on a deep neural network. The method comprises the following steps: extracting the acoustic features of a real-time sound signal to generate an acoustic feature vector; using a deep neural network method to train a preset sound signal, and building a DNN training model; and using the DNN training model to detect and judge the acoustic feature vector, and modeling different sound source signals by use of a deep neural network technology. As the deep neural network has more precise modeling ability and is of higher modeling accuracy especially under the premise of full data, sound sources with very approximate feature spaces can be detected. By using an adjacent time frame fusion feature technology, high single-frame detection accuracy is achieved, real-time performance is high, the judgment delay is not greater than 0.5 seconds, and practicability is high.

Description

A kind of particular sound source detection method and system based on deep neural network

Technical field

The present invention relates to voice technology field, a kind of particular sound source inspection based on deep neural network is especially related to Survey method and system.

Background technology

Widely available with Intelligent hardware, the correlation technique such as intelligent robot, toy reaches its maturity.Speech recognition, spy Determine the relevant sound signals such as sound Sources Detection treatment technology as intelligent terminal in the urgent need to one of technical scheme.So-called specific sound Source detection refers to the voice signal in the microphone Real-time Collection applied environment using intelligent terminal, and detects whether occur in time User is interested or sound-source signal of concern, and is timely fed back to user.

The sound that the acoustical signal that different sound sources send, such as people send with machine, people and pet exists on spectrum structure Difference, the voice signal that even not same people sends there is also difference.At present, than more typical sound Sources Detection, classification application bag Include, Application on Voiceprint Recognition, multimedia music are classified, answered based on Underwater Acoustic channels birds/animal class identification technology, baby crying detection etc. With.Sorting technique based on machine learning is the most ripe technology of this application, i.e., on the basis of certain training sample, can It is modeled with to different classes of sound characteristic, to realize the purpose of different acoustical signal processings.Additionally, there is some more special Different application, can use more simplified implementation, such as the voice activity detection (Voice of speech recognition front-ends Activity Detection, VAD) technology, it is only necessary to voice and Jing Yin two kinds of classifications are distinguished, the projection of energy can be based on Detection, is aided with the features such as zero-crossing rate, realizes non-supervisory classification.

It is to carry out spy based on the Bayes's classification technology of gauss hybrid models (Gaussian Mixture Model, GMM) Determine the common technology of sound Sources Detection and classification.If the method is divided into the various voice signals that may be picked up in real-time application environment Dry classification, is modeled using GMM to each classification.Such as, the Application on Voiceprint Recognition skill of relative maturity, is the language to different people Message use is modeled, and realizes classification and voice print verification based on Bayes's classification criterion.Additionally, in machine learning other The machine learning techniques of comparative maturity, such as SVMs (Support Vector Machine, SVM) can be used for sound The classification of message number and detection.

The voice activity detection technology commonly used in robust speech identification can also regard a kind of particular sound source detection as, lead to Cross the features such as real-time detection energy, zero-crossing rate and realize voice, Jing Yin classification.Its basic ideas is, it is assumed that only existed in environment Voice and in the absence of two kinds of scenes of any acoustical signal, can set a threshold value, when the acoustical signal energy of microphone collection exceedes It is judged to voice during certain threshold value.

Voice activity detection technology can only be used to distinguish and there is voice signal and Jing Yin two kinds of scenes, and practical application is more than this One assumes complicated, such as the sound (television set, sound equipment etc.) of various machines is may be simultaneously present under true applied environment, this Technology distinction is too poor, it is impossible to extensive to relatively more general sound Sources Detection problem.

Existing machine learning method is primarily present two defects when this task of sound Sources Detection is carried out：First, model Modeling accuracy to sound source type is poor, and classification accuracy is not high.In being applied as some sound source characteristics spaces are closer, this Defect is especially apparent.Such as, conventional bayes method cannot effectively distinguish the voice signal that is sent by people and be set by audio amplifier The voice signal that preparation goes out.Second, real-time is poor.In numerous applications, it is necessary to be made immediately after there are specific acoustic signals Correct to judge, such as child wept is signals such as, alarms.However, existing statistical learning method is, it is necessary to one section of sound letter very long Number it is analyzed and judges, causes the decision delay grown very much.

The content of the invention

The main object of the present invention is to provide a kind of particular sound source detection method and system based on deep neural network, solution Certainly sound Sources Detection low precision, the problem of poor real..

The present invention proposes a kind of particular sound source detection method based on deep neural network, comprises the following steps：

Extract the acoustic feature of real-time audio signal, generation acoustic feature vector；

Detection judgement is carried out to the acoustic feature vector using DNN training patterns, the DNN training patterns are using deep Degree neural net method is trained and sets up to preset sound signal.

Preferably, the step of acoustic feature of the extraction voice signal includes：

Real-time audio signal is obtained, the real-time audio signal is pre-processed；

Fourier transform is first carried out to pretreated real-time audio signal, then through Mel filter process, is then carried out Discrete cosine transform is processed, and obtains the acoustic feature of real-time audio signal；

The acoustic feature of adjacent multiple voice signals, the sound with the voice signal before taking the real-time audio signal Learn feature to be merged, generation acoustic feature vector.

Preferably, the acquisition real-time audio signal, includes the step of pre-processed to the real-time audio signal：

Real-time audio signal is obtained, data signal is changed into according to default quantitative rate, monitor pending buffering area, if buffering Area writes full to time data is specified, then write history buffer, and carry out feature extraction；

Time data is specified with reference to a upper data buffer zone, to carrying out windowing process comprising two specified time datas.

Preferably, the process of setting up of the DNN training patterns includes：

Default multiple voice data, the multiple voice data includes target particular sound source；

Feature extraction is carried out to each voice data, a sound is set up and is demarcated vector, the sound demarcates vector to be included carrying The sound characteristic and calibration value for taking；

Realize that DNN weights update using DNN kits Kaldi.

Preferably, the use DNN training patterns carry out the step of detection judges to the acoustic feature vector includes：

Calculate DNN output probabilities；

Smoothing processing is done to DNN output probabilities using the probability output of former frame；

Compared with given threshold by the probability after smoothing processing, judge particular sound source.

Preferably, the step of calculating DNN output probabilities include：

Initialization input layer,

a₁=X_C(n), X_CN () is the acoustic feature of the real-time audio signal, a₁It is input layer；

Hidden layer activation output and front and rear layer weighted sum are calculated, iterative calculation formula is as follows：

a_l-1=[1, a_l-1]

z_l=a_l-1×w_l-1

a_l=ReLU (z_l)=log (1+exp (z_l))

Wherein a_l-1L-1 layers of activation output is represented,

a_lL layers of activation output is represented,

z_lExpression is input to l layers of weighted sum,

w_l-1The connection weight between l-1 layers and l layers is represented,

w_lThe connection weight between l layers and l+1 layers is represented,

ReLU represents broken line activation primitive, calculates all of hidden layer backward successively；Output layer activation output valve is calculated, and As DNN output probabilities；Output layer is calculated to be calculated according to last hiding layer data, and computing formula is as follows：

a_L-1=[1, a_L-1]

z_L=a_L-1×w_L-1

Wherein a_L-1L-1 layers of activation output is represented,

a_LL layers of activation output is represented,

z_LExpression is input to L layers of weighted sum,

w_L-1The connection weight between L-1 layers and L layers is represented,

Output layer activation output is set as that the time frame is judged to the probability of particular sound source, i.e.,：

P (y (n)=1 | X_C(n))=a_L。

Preferably, in the step of probability output of the utilization former frame does smoothing processing to DNN output probabilities, smooth general Rate computing formula is as follows：

Wherein, α is smoothing factor.

Preferably, the span of α is 0.75-0.85.

Preferably, the given threshold is 0.5.

The invention allows for a kind of particular sound source detecting system based on deep neural network, including：

Characteristic extracting module, the acoustic feature for extracting real-time audio signal, generation acoustic feature vector；

DNN training pattern modules are set up, for being trained to preset sound signal using deep neural network method, is built Vertical DNN training patterns；

Detection module, for carrying out detection judgement to the acoustic feature vector using the DNN training patterns.

The present invention a kind of particular sound source detection method and system based on deep neural network, its method include, extract in fact When voice signal acoustic feature, generation acoustic feature vector；Preset sound signal is carried out using deep neural network method Training, sets up DNN training patterns；Detection judgement is carried out to the acoustic feature vector using the DNN training patterns, using depth Degree nerual network technique realizes the modeling to different sound-source signals.Because deep neural network has more accurate modeling energy Power, especially data more sufficiently under the premise of, the accuracy of modeling is higher, can process some feature spaces it is close as Sound Sources Detection problem.This invention uses surrounding time frame fusion feature technology, it is possible to achieve single frame detection accuracy rate higher, Real-time is high, and decision delay is not more than 0.5 second, practical.

Brief description of the drawings

Fig. 1 is the schematic flow sheet of particular sound source detection method first embodiment of the present invention based on deep neural network；

Fig. 2 is that the acoustic feature of particular sound source detection method second embodiment of the present invention based on deep neural network is extracted Flow chart；

Fig. 3 is Hanning window schematic diagram；

Fig. 4 is the structure chart of deep neural network；

Fig. 5 is particular sound source detection method fourth embodiment particular sound source detection number of the present invention based on deep neural network According to flow chart.

The realization of the object of the invention, functional characteristics and advantage will be described further referring to the drawings in conjunction with the embodiments.

Specific embodiment

It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.

Embodiment 1

As shown in figure 1, the present invention proposes a kind of particular sound source detection method based on deep neural network, including following step Suddenly：

S10, the acoustic feature for extracting real-time audio signal, generation acoustic feature vector；

S20, detection judgement is carried out to the acoustic feature vector using DNN training patterns, the DNN training patterns are profits Preset sound signal is trained with deep neural network method and is set up.

Step S10 is mainly used in carrying out feature extraction to voice signal.

Characteristic extraction step be intended to it is abstract on the spectrum structure of acoustical signal, summarize one group of spy that can reflect sound source classification Levy vector.Different classification task aural signatures may be different.Adoptable feature includes mel cepstrum feature (Mel-scale Frequency Cepstral Coefficients, MFCC), linear predictive coding (linear predictive coding, LPC) etc..In some careful acoustical signal classification tasks, aural signature can also be carried out using technologies such as principal component analysis Choose.

In step S20, including two contents of aspect, including set up DNN training patterns and using DNN models to acoustics spy Levy vector detection.Model training step is intended to carry out the aural signature of different classes of acoustical signal statistical induction, sets up different The Mathematical Modeling of classification.The model needs a certain amount of priori data, and generally the more abundant model of data is more accurate, classification, The effect of detection is better.

DNN (i.e. Deep Neural Networks, deep neural network) training pattern, is the engineering for rising in recent years Model is practised, breakthrough progress is all achieved in the field such as voice and image.Deep neural network originates from neutral net, big Under the common conditions that data, high-performance calculation and theory of algorithm skill break through, deep neural network is on the basis of neutral net Obtain great breakthrough.Deep neural network utilizes multi-level Nonlinear Processing, abstract spy that can be in automatic mining data Structure is levied, so that the classification provided for final monitoring learning or predicted characteristics.It is multiple that deep neural network is good at treatment The miscellaneous structural signal of higher-dimension.

And detection judgement is carried out to the acoustic feature vector using the DNN training patterns.

Voice wake-up application towards robust speech identification can use method proposed by the present invention.Speech recognition application is in intelligence Under energy domestic environment, because user will not send phonetic order in the most of the time, if intelligent terminal collects voice signal, The identification engine at invoking server end, in addition to it can cause substantial amounts of misrecognition, false triggering, band is taken when going back president always It is wide.The present invention can be as voice arousal function, i.e., only system detectio can just call rear end to recognize engine to user speech.This Voice signal and various ambient noises, training DNN models, and foundation this invention sound that invention was sent using 10 hours by user Decision method is detected in source, and time frame void inspection rate is less than 12%, and time frame loss is less than 1%, effectively supports robust speech identification Application of the system under true environment.

Pet towards intelligent robot calls detection.Household humanoid robot can be made using this invention method, in real time Detect in domestic environment whether there is the cry of pet dog, cat or other pets, and make feedback, improve robot interactive It is intelligent and recreational.This invention is used 20 hours, the cry of totally 10 class pet dogs, to DNN model trainings, sound Sources Detection Frame void inspection rate is less than 15%, and time frame loss is less than 5%.

Embodiment 2

Feature extraction

The present embodiment can automatic data-detection buffering area, and by framing, adding window etc. process, realize from 1 dimension time-domain signal To the change of feature space, acoustic feature is extracted.Flow chart of data processing, as shown in Figure 2.

Detailed step is as follows：

(1) real-time recording byte stream is obtained, data signal is changed into according to 16bit quantitative rates, monitor pending buffering area, such as Fruit buffering area writes full to 16 milliseconds data (if sample rate is 16KHz, be 256 data samples), then write history buffering Area, and carry out feature extraction.

(2) upper 16 milliseconds of data in a data buffer zone are combined, totally 32 milliseconds (is 512 if sample rate is 16KHz Individual data sample), carry out windowing process.The present embodiment uses a length of 512 Hanning window function.Assuming that former time-domain signal is x T (), the data after adding window areHanning window function is as shown in Figure 3.

(3) FFT (Fast Fourier Transformation, FFT) is carried out, time-domain signal is converted To spectrum domain and squared, conversion process is as follows：

(4) filter (MFCC) using 23 Jan Vermeers and transform to log-domain, 23 Jan Vermeer wave filters are defined as follows：

Wherein, f is the frequency of correspondence frequency band, and m values are 1 to 23.23 dimension datas by Mel wave filter and after removing logarithm For：

(5) discrete cosine transform, to remove the interdependence effects that different Mel wave filters are exported, the following institute of calculating process Show：

The present embodiment only with preceding 12 discrete cosine transforms coefficient as feature, i.e. n values from 1 to 12.

(6) Fusion Features.There is very strong correlation because acoustical signal has significant timing, i.e. surrounding time frame, In order to improve the accuracy of particular sound source detection, the present embodiment has considered the acoustic feature of preceding four time frames, has been fused into It is a new characteristic vector, i.e., the present embodiment is using totally 60 dimensional feature.

The present embodiment is based on MFCC features and is used to describe the difference of the special sound source with other sound sources of user's concern, and uses The strategy of Fusion Features, has expanded the dimension of feature, can improve the precision of model training.

Embodiment 3

Set up DNN training patterns

Deep neural network is general by an input layer, multiple non-linear hidden layers and an output layer composition, each implicit Layer is stacked up the structure to form a profound level, as shown in Figure 4.

Full connection between each layer of DNN, its connection weight we be defined as w, w_lRepresent l layers and l+1 layers Between connection weight.a_lRepresent l layers of activation output, z_lExpression is input to l layers of weighted sum.It is The output of DNN, whereinRepresent that input x belongs to the probability of the i-th class, i.e.,

Training process includes the steps such as data prepare and demarcate, DNN weights update, specific as follows：

(1) data prepare, and data prepare to be used to ensure the accuracy of DNN models, including user's concern particular sound source and Some signals of the common sound source outside the sound source.

(2) data scaling.To each audio signal of previous step, feature extraction is carried out, feature extraction as above chapters and sections Shown in formula (4).Set up a new vector, (X_C(n), y (n))), wherein y (n) is calibration value.If the frame signal is to treat Then value is 1 to inspection particular sound source, is otherwise other sound-source signals, then value is -1.

(3) DNN weights update.Weight is updated to the committed step of model training, and the present embodiment is using conventional reverse biography Broadcast algorithm and realize that DNN weights update.The present embodiment realizes that DNN weights update using conventional DNN kits Kaldi.DNN sets It is 4 hidden layers, ground floor number of nodes is identical with characteristic dimension, as 60, remaining the number of hidden nodes is 512, output node number Measure is 1.

Embodiment 4

Sound Sources Detection judges

The DNN training patterns that the characteristic vector and embodiment 3 extracted using embodiment 2 are set up, when first to each Between frame determine probability, smoothing processing then is done to probability output, finally according to contrast one given threshold make final sound Source detection judges.The resume module flow is as shown in Figure 4.

Wherein data framing, buffering are identical with the method and embodiment 2 of feature extraction.DNN models are set up using embodiment 3 DNN training patterns, i.e., each hidden layer connection weight w_lWith activation output a_lIt is model parameter.Before DNN probability calculations, probability to Smooth, sound source decision-making module is described in detail as follows：

(1) DNN probability calculations

A () initializes input layer, i.e. DNN ground floors activation output assignment with 60 dimension aural signature data：a₁=X_C(n)。

B () calculates hidden layer activation output and front and rear layer weighted sum, iterative calculation formula is as follows：

Wherein ReLU represents broken line activation primitive.Calculate all of hidden layer backward successively.

C () calculates output layer activation output valve, and as DNN output probabilities.It is hidden according to last that output layer is calculated Hide layer data to calculate, computing formula is as follows：

Wherein output layer activation output is set as that the time frame is judged to the probability of particular sound source,

I.e.：

P (y (n)=1 | X_C(n))=a_L (7)

(2) smoothing processing is done using the probability output of former frame to smooth before probability, to avoid single frame probability output from missing The mistake judgement that difference is caused.Smooth probability calculation is as follows：

Probability after this is smooth is used as final decision probability.

α is smoothing factor, between the present embodiment sets smoothing factor span as 0.75 to 0.85.

(3) particular sound source detection decision-making.Detection decision-making is compared according to the probability after smoothing with given threshold, i.e., such as ReallyMore than threshold θ, then judge that the frame is particular sound source start frame, otherwise judge do not have in the time frame Detect particular sound source.Usual threshold θ interval is between 0 to 1, if threshold value is too small, to have excessive empty inspection, that is, permitted Many is not that the particular sound source signal time frame is mistakenly detected as the particular sound source；If threshold value is too high, have excessive Missing inspection, i.e., what many particular sound source time frames can not be striven for detects.The threshold value setting θ of the present embodiment use compromise= 0.5, to balance empty inspection and residual error rate.

Embodiment 5

Characteristic extracting module 10, the acoustic feature for extracting real-time audio signal, generation acoustic feature vector；

DNN training patterns module 20 is set up, for being trained to preset sound signal using deep neural network method, Set up DNN training patterns；

Detection module 30, for carrying out detection judgement to the acoustic feature vector using the DNN training patterns.

Characteristic extracting module 10 can be built using the method for embodiment 2, and setting up DNN training patterns module 20 can use The method of embodiment 3 is built, and detection module 30 can be built using the method for embodiment 4.

Towards the robot application of children's company type, can in time detect what children cried using the system of the present embodiment Sound, feeds back to guardian；Towards the camera application of intelligent security guard, can go out whether to have using the system detectio of the present embodiment Audio warning etc., improves the intelligent level of safety-protection system.

The disclosure can embody in equipment, system, method and/or computer program product.The computer program product Computer readable storage medium (or medium) is may include, thereon with computer-readable program instructions, for causing processor to enter The aspect of the row disclosure.

Computer readable storage medium can be the physical device that can keep being used for instruction executing device with store instruction.Meter Calculation machine readable storage medium may be, for example, but be not limited to electronic storage device, magnetic storage device, optical storage, electromagnetism and deposit Storage device, semiconductor storage, or foregoing every any appropriate combination.The more specific example of computer readable storage medium Non-exhaustive list include the following：Portable computer diskette, hard disk, random access memory (RAM), read-only storage (ROM), Erasable Programmable Read Only Memory EPROM (EPROM or flash memory), static RAM (SRAM), portable Formula compact disk read-only storage (CD-ROM), digital versatile disc (DVD), memory stick, floppy disc, such as above-noted have The device of the mechanical codings such as the bulge-structure in the card punch or groove of instruction, and foregoing every any appropriate combination. It is as used herein, computer readable storage medium is not construed to temporary signal in itself, for example radio wave or other The electromagnetic wave of Free propagation, the electromagnetic wave of waveguide or other transmission mediums is propagated across (for example, through the light arteries and veins of Connectorized fiber optic cabling Punching), or through the electric signal of wire transfer.

Computer-readable program instructions described herein can be downloaded to corresponding meter from computer readable storage medium Calculation/processing unit, or download to outer computer via networks such as such as internet, LAN, wide area network and/or wireless networks Or external memory.The network may include copper transmission yarn, optical transmission fibers, be wirelessly transferred, router, fire wall, Exchanger, gateway computer and/or Edge Server.Adapter or network interface in each calculating/processing unit connect The computer-readable program instructions for carrying out automatic network are received, and forwards the computer-readable program instructions, for storing mutually accrued In computer readable storage medium in calculation/processing unit.

The computer-readable program instructions of the operation for carrying out the disclosure can be assembler directive, instruction set architecture (ISA) instruction, machine instruction, machine-dependent instructions, microcode, firmware instructions, condition setup data, or programmed with one or more Any source code or object code that any combinations of language are write, the programming language is including taking target as the programming language being oriented to Speech, such as Smalltalk, C++ etc.；And conventional program programming language, such as " C " programming language or similar programming language. Computer-readable program instructions can completely on the computer of user, partly on the computer of user, as stand alone software Encapsulation, partly on the computer of user and partly on remote computer or completely on remote computer or server Perform.In the latter's scene, remote computer can be by the computer of any kind of network connection to user, including LAN (LAN) or wide area network (WAN), or can proceed to outer computer connection (for example, using ISP pass through because Special net).In some embodiments, including such as PLD, field programmable gate array (FPGA) or programmable The electronic circuit of logic array (PLA) can perform computer-readable by using the status information of computer-readable program instructions Programmed instruction with individualize electronic circuit, to perform the aspect of the disclosure.

Herein according to the embodiment of the disclosure, the flow of reference method, equipment (system) and computer program product Scheme explanation and/or block diagram to describe the aspect of the disclosure.It will be understood that, each frame in flow chart explanation and/or block diagram, Yi Jiliu The combination of the frame in the explanation of journey figure and/or block diagram can be implemented by computer-readable program instructions.

Can by these computer-readable program instructions be supplied to the production all-purpose computer of machine, special-purpose computer or other The processor of programmable data processing device so that via the finger that the computer or other programmable data processing devices are performed Order creates the component for function/action specified in implementing procedure figure and/or block diagram block.These computer-readable programs refer to Order may be alternatively stored in computer readable storage medium, and it can instruct computer, programmable data processing device and/or other dresses Put and work in a specific way so that the inside be stored with instruction computer readable storage medium include product, the product bag Include the instruction of the aspect of specified function/action in implementing procedure figure and/or block diagram block.

Computer-readable program instructions can also be loaded into computer, other programmable data processing devices or other devices On, to cause series of operation steps to be performed on computer, other programmable devices or other devices, to produce computer reality The process applied so that the instruction implementing procedure figure and/or block diagram performed on computer, other programmable devices or other devices Specified function/action in frame.

Flow chart and block diagram in figure show system, the method and computer program according to the various embodiments of the disclosure The framework of the possible embodiment of product, feature and operation.At this point, each frame in flow chart or block diagram can be represented and referred to The module of order, fragment or part, it includes one or more executable instructions for implementing to specify logic function.Replaced at some Occur for the order that in implementation method, the function described in frame can be different from the order described in figure.For example, continuously show Two frames for going out in fact can be performed generally simultaneously, or the frame can be performed with reverse order sometimes, depending on involved Feature.It will also be noted that the group of the frame in block diagram and/or each frame of flow chart explanation, and block diagram and/or flow chart explanation Close, can be by execution specified function or action or the system based on specialized hardware of the combination for carrying out specialized hardware and computer instruction To implement.

Although the embodiment that foregoing teachings are directed to the disclosure, in the situation of the base region for not departing from the disclosure Under, it is contemplated that other and other embodiments of the disclosure, and the scope of the present disclosure determines by appended claims.

Claims

1. a kind of particular sound source detection method based on deep neural network, it is characterised in that comprise the following steps：

Detection judgement is carried out to the acoustic feature vector using DNN training patterns, the DNN training patterns are using depth god Preset sound signal is trained through network method and is set up.

2. the particular sound source detection method based on deep neural network according to claim 1, it is characterised in that described to carry The step of acoustic feature for taking voice signal, includes：

The acoustic feature of adjacent multiple voice signals before taking the real-time audio signal, the acoustics with the voice signal is special Levy and merged, generation acoustic feature vector.

3. the particular sound source detection method based on deep neural network according to claim 2, it is characterised in that described to obtain Real-time audio signal is taken, is included the step of pre-processed to the real-time audio signal：

Real-time audio signal is obtained, data signal is changed into according to default quantitative rate, pending buffering area is monitored, if buffering area is write It is full extremely to specify time data, then history buffer is write, and carry out feature extraction；

4. the particular sound source detection method based on deep neural network according to claim 1, it is characterised in that described The process of setting up of DNN training patterns includes：

Feature extraction is carried out to each voice data, a sound is set up and is demarcated vector, the sound demarcates vector includes what is extracted Sound characteristic and calibration value；

Realize that DNN weights update using DNN kits Kaldi.

5. the particular sound source detection method based on deep neural network according to claim 1, it is characterised in that described to make The step of detection judges is carried out to the acoustic feature vector with DNN training patterns includes：

Calculate DNN output probabilities；

6. the particular sound source detection method based on deep neural network according to claim 5, it is characterised in that the meter The step of calculating DNN output probabilities includes：

Initialization input layer,

a_l-1=[1, a_l-1]

z_l=a_l-1×w_l-1

a_l=ReLU (z_l)=log (1+exp (z_l))

Wherein a_l-1L-1 layers of activation output is represented,

a_lL layers of activation output is represented,

z_lExpression is input to l layers of weighted sum,

w_l-1The connection weight between l-1 layers and l layers is represented,

w_lThe connection weight between l layers and l+1 layers is represented,

ReLU represents broken line activation primitive, calculates all of hidden layer backward successively；Calculate output layer activation output valve, and conduct DNN output probabilities；Output layer is calculated to be calculated according to last hiding layer data, and computing formula is as follows：

a_L-1=[1, a_L-1]

z_L=a_L-1×w_L-1

a_{L} = s o f t M a x (z_{L}) = \frac{1}{1 + \exp (- Σ_{i} z_{L} (i))}

Wherein a_L-1L-1 layers of activation output is represented,

a_LL layers of activation output is represented,

z_LExpression is input to L layers of weighted sum,

w_L-1The connection weight between L-1 layers and L layers is represented,

P (y (n)=1 | X_C(n))=a_L。

7. the particular sound source detection method based on deep neural network according to claim 6, it is characterised in that the profit In the step of doing smoothing processing to DNN output probabilities with the probability output of former frame, probability calculation formula is smoothed as follows：

\hat{P} (y (n) = 1 | X_{C} (n)) = α P (y (n - 1) = 1 | X_{C} (n - 1)) + (1 - α) P (y (n) = 1 | X_{C} (n))

Wherein, α is smoothing factor.

8. the particular sound source detection method based on deep neural network according to claim 7, it is characterised in that：

The span of α is 0.75-0.85.

9. the particular sound source detection method based on deep neural network according to claim 5, it is characterised in that described to set It is 0.5 to determine threshold value.

10. a kind of particular sound source detecting system based on deep neural network, it is characterised in that including：

DNN training pattern modules are set up, for being trained to preset sound signal using deep neural network method, is set up DNN training patterns；