CN106710599A - Particular sound source detection method and particular sound source detection system based on deep neural network - Google Patents

Particular sound source detection method and particular sound source detection system based on deep neural network Download PDF

Info

Publication number
CN106710599A
CN106710599A CN201611099733.0A CN201611099733A CN106710599A CN 106710599 A CN106710599 A CN 106710599A CN 201611099733 A CN201611099733 A CN 201611099733A CN 106710599 A CN106710599 A CN 106710599A
Authority
CN
China
Prior art keywords
sound source
neural network
deep neural
dnn
particular sound
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611099733.0A
Other languages
Chinese (zh)
Inventor
蔡钢林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Sahara Data Technology Co Ltd
Original Assignee
Shenzhen Sahara Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Sahara Data Technology Co Ltd filed Critical Shenzhen Sahara Data Technology Co Ltd
Priority to CN201611099733.0A priority Critical patent/CN106710599A/en
Publication of CN106710599A publication Critical patent/CN106710599A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a particular sound source detection method and a particular sound source detection system based on a deep neural network. The method comprises the following steps: extracting the acoustic features of a real-time sound signal to generate an acoustic feature vector; using a deep neural network method to train a preset sound signal, and building a DNN training model; and using the DNN training model to detect and judge the acoustic feature vector, and modeling different sound source signals by use of a deep neural network technology. As the deep neural network has more precise modeling ability and is of higher modeling accuracy especially under the premise of full data, sound sources with very approximate feature spaces can be detected. By using an adjacent time frame fusion feature technology, high single-frame detection accuracy is achieved, real-time performance is high, the judgment delay is not greater than 0.5 seconds, and practicability is high.

Description

A kind of particular sound source detection method and system based on deep neural network
Technical field
The present invention relates to voice technology field, a kind of particular sound source inspection based on deep neural network is especially related to Survey method and system.
Background technology
Widely available with Intelligent hardware, the correlation technique such as intelligent robot, toy reaches its maturity.Speech recognition, spy Determine the relevant sound signals such as sound Sources Detection treatment technology as intelligent terminal in the urgent need to one of technical scheme.So-called specific sound Source detection refers to the voice signal in the microphone Real-time Collection applied environment using intelligent terminal, and detects whether occur in time User is interested or sound-source signal of concern, and is timely fed back to user.
The sound that the acoustical signal that different sound sources send, such as people send with machine, people and pet exists on spectrum structure Difference, the voice signal that even not same people sends there is also difference.At present, than more typical sound Sources Detection, classification application bag Include, Application on Voiceprint Recognition, multimedia music are classified, answered based on Underwater Acoustic channels birds/animal class identification technology, baby crying detection etc. With.Sorting technique based on machine learning is the most ripe technology of this application, i.e., on the basis of certain training sample, can It is modeled with to different classes of sound characteristic, to realize the purpose of different acoustical signal processings.Additionally, there is some more special Different application, can use more simplified implementation, such as the voice activity detection (Voice of speech recognition front-ends Activity Detection, VAD) technology, it is only necessary to voice and Jing Yin two kinds of classifications are distinguished, the projection of energy can be based on Detection, is aided with the features such as zero-crossing rate, realizes non-supervisory classification.
It is to carry out spy based on the Bayes's classification technology of gauss hybrid models (Gaussian Mixture Model, GMM) Determine the common technology of sound Sources Detection and classification.If the method is divided into the various voice signals that may be picked up in real-time application environment Dry classification, is modeled using GMM to each classification.Such as, the Application on Voiceprint Recognition skill of relative maturity, is the language to different people Message use is modeled, and realizes classification and voice print verification based on Bayes's classification criterion.Additionally, in machine learning other The machine learning techniques of comparative maturity, such as SVMs (Support Vector Machine, SVM) can be used for sound The classification of message number and detection.
The voice activity detection technology commonly used in robust speech identification can also regard a kind of particular sound source detection as, lead to Cross the features such as real-time detection energy, zero-crossing rate and realize voice, Jing Yin classification.Its basic ideas is, it is assumed that only existed in environment Voice and in the absence of two kinds of scenes of any acoustical signal, can set a threshold value, when the acoustical signal energy of microphone collection exceedes It is judged to voice during certain threshold value.
Voice activity detection technology can only be used to distinguish and there is voice signal and Jing Yin two kinds of scenes, and practical application is more than this One assumes complicated, such as the sound (television set, sound equipment etc.) of various machines is may be simultaneously present under true applied environment, this Technology distinction is too poor, it is impossible to extensive to relatively more general sound Sources Detection problem.
Existing machine learning method is primarily present two defects when this task of sound Sources Detection is carried out:First, model Modeling accuracy to sound source type is poor, and classification accuracy is not high.In being applied as some sound source characteristics spaces are closer, this Defect is especially apparent.Such as, conventional bayes method cannot effectively distinguish the voice signal that is sent by people and be set by audio amplifier The voice signal that preparation goes out.Second, real-time is poor.In numerous applications, it is necessary to be made immediately after there are specific acoustic signals Correct to judge, such as child wept is signals such as, alarms.However, existing statistical learning method is, it is necessary to one section of sound letter very long Number it is analyzed and judges, causes the decision delay grown very much.
The content of the invention
The main object of the present invention is to provide a kind of particular sound source detection method and system based on deep neural network, solution Certainly sound Sources Detection low precision, the problem of poor real..
The present invention proposes a kind of particular sound source detection method based on deep neural network, comprises the following steps:
Extract the acoustic feature of real-time audio signal, generation acoustic feature vector;
Detection judgement is carried out to the acoustic feature vector using DNN training patterns, the DNN training patterns are using deep Degree neural net method is trained and sets up to preset sound signal.
Preferably, the step of acoustic feature of the extraction voice signal includes:
Real-time audio signal is obtained, the real-time audio signal is pre-processed;
Fourier transform is first carried out to pretreated real-time audio signal, then through Mel filter process, is then carried out Discrete cosine transform is processed, and obtains the acoustic feature of real-time audio signal;
The acoustic feature of adjacent multiple voice signals, the sound with the voice signal before taking the real-time audio signal Learn feature to be merged, generation acoustic feature vector.
Preferably, the acquisition real-time audio signal, includes the step of pre-processed to the real-time audio signal:
Real-time audio signal is obtained, data signal is changed into according to default quantitative rate, monitor pending buffering area, if buffering Area writes full to time data is specified, then write history buffer, and carry out feature extraction;
Time data is specified with reference to a upper data buffer zone, to carrying out windowing process comprising two specified time datas.
Preferably, the process of setting up of the DNN training patterns includes:
Default multiple voice data, the multiple voice data includes target particular sound source;
Feature extraction is carried out to each voice data, a sound is set up and is demarcated vector, the sound demarcates vector to be included carrying The sound characteristic and calibration value for taking;
Realize that DNN weights update using DNN kits Kaldi.
Preferably, the use DNN training patterns carry out the step of detection judges to the acoustic feature vector includes:
Calculate DNN output probabilities;
Smoothing processing is done to DNN output probabilities using the probability output of former frame;
Compared with given threshold by the probability after smoothing processing, judge particular sound source.
Preferably, the step of calculating DNN output probabilities include:
Initialization input layer,
a1=XC(n), XCN () is the acoustic feature of the real-time audio signal, a1It is input layer;
Hidden layer activation output and front and rear layer weighted sum are calculated, iterative calculation formula is as follows:
al-1=[1, al-1]
zl=al-1×wl-1
al=ReLU (zl)=log (1+exp (zl))
Wherein al-1L-1 layers of activation output is represented,
alL layers of activation output is represented,
zlExpression is input to l layers of weighted sum,
wl-1The connection weight between l-1 layers and l layers is represented,
wlThe connection weight between l layers and l+1 layers is represented,
ReLU represents broken line activation primitive, calculates all of hidden layer backward successively;Output layer activation output valve is calculated, and As DNN output probabilities;Output layer is calculated to be calculated according to last hiding layer data, and computing formula is as follows:
aL-1=[1, aL-1]
zL=aL-1×wL-1
Wherein aL-1L-1 layers of activation output is represented,
aLL layers of activation output is represented,
zLExpression is input to L layers of weighted sum,
wL-1The connection weight between L-1 layers and L layers is represented,
Output layer activation output is set as that the time frame is judged to the probability of particular sound source, i.e.,:
P (y (n)=1 | XC(n))=aL
Preferably, in the step of probability output of the utilization former frame does smoothing processing to DNN output probabilities, smooth general Rate computing formula is as follows:
Wherein, α is smoothing factor.
Preferably, the span of α is 0.75-0.85.
Preferably, the given threshold is 0.5.
The invention allows for a kind of particular sound source detecting system based on deep neural network, including:
Characteristic extracting module, the acoustic feature for extracting real-time audio signal, generation acoustic feature vector;
DNN training pattern modules are set up, for being trained to preset sound signal using deep neural network method, is built Vertical DNN training patterns;
Detection module, for carrying out detection judgement to the acoustic feature vector using the DNN training patterns.
The present invention a kind of particular sound source detection method and system based on deep neural network, its method include, extract in fact When voice signal acoustic feature, generation acoustic feature vector;Preset sound signal is carried out using deep neural network method Training, sets up DNN training patterns;Detection judgement is carried out to the acoustic feature vector using the DNN training patterns, using depth Degree nerual network technique realizes the modeling to different sound-source signals.Because deep neural network has more accurate modeling energy Power, especially data more sufficiently under the premise of, the accuracy of modeling is higher, can process some feature spaces it is close as Sound Sources Detection problem.This invention uses surrounding time frame fusion feature technology, it is possible to achieve single frame detection accuracy rate higher, Real-time is high, and decision delay is not more than 0.5 second, practical.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of particular sound source detection method first embodiment of the present invention based on deep neural network;
Fig. 2 is that the acoustic feature of particular sound source detection method second embodiment of the present invention based on deep neural network is extracted Flow chart;
Fig. 3 is Hanning window schematic diagram;
Fig. 4 is the structure chart of deep neural network;
Fig. 5 is particular sound source detection method fourth embodiment particular sound source detection number of the present invention based on deep neural network According to flow chart.
The realization of the object of the invention, functional characteristics and advantage will be described further referring to the drawings in conjunction with the embodiments.
Specific embodiment
It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.
Embodiment 1
As shown in figure 1, the present invention proposes a kind of particular sound source detection method based on deep neural network, including following step Suddenly:
S10, the acoustic feature for extracting real-time audio signal, generation acoustic feature vector;
S20, detection judgement is carried out to the acoustic feature vector using DNN training patterns, the DNN training patterns are profits Preset sound signal is trained with deep neural network method and is set up.
Step S10 is mainly used in carrying out feature extraction to voice signal.
Characteristic extraction step be intended to it is abstract on the spectrum structure of acoustical signal, summarize one group of spy that can reflect sound source classification Levy vector.Different classification task aural signatures may be different.Adoptable feature includes mel cepstrum feature (Mel-scale Frequency Cepstral Coefficients, MFCC), linear predictive coding (linear predictive coding, LPC) etc..In some careful acoustical signal classification tasks, aural signature can also be carried out using technologies such as principal component analysis Choose.
In step S20, including two contents of aspect, including set up DNN training patterns and using DNN models to acoustics spy Levy vector detection.Model training step is intended to carry out the aural signature of different classes of acoustical signal statistical induction, sets up different The Mathematical Modeling of classification.The model needs a certain amount of priori data, and generally the more abundant model of data is more accurate, classification, The effect of detection is better.
DNN (i.e. Deep Neural Networks, deep neural network) training pattern, is the engineering for rising in recent years Model is practised, breakthrough progress is all achieved in the field such as voice and image.Deep neural network originates from neutral net, big Under the common conditions that data, high-performance calculation and theory of algorithm skill break through, deep neural network is on the basis of neutral net Obtain great breakthrough.Deep neural network utilizes multi-level Nonlinear Processing, abstract spy that can be in automatic mining data Structure is levied, so that the classification provided for final monitoring learning or predicted characteristics.It is multiple that deep neural network is good at treatment The miscellaneous structural signal of higher-dimension.
And detection judgement is carried out to the acoustic feature vector using the DNN training patterns.
Voice wake-up application towards robust speech identification can use method proposed by the present invention.Speech recognition application is in intelligence Under energy domestic environment, because user will not send phonetic order in the most of the time, if intelligent terminal collects voice signal, The identification engine at invoking server end, in addition to it can cause substantial amounts of misrecognition, false triggering, band is taken when going back president always It is wide.The present invention can be as voice arousal function, i.e., only system detectio can just call rear end to recognize engine to user speech.This Voice signal and various ambient noises, training DNN models, and foundation this invention sound that invention was sent using 10 hours by user Decision method is detected in source, and time frame void inspection rate is less than 12%, and time frame loss is less than 1%, effectively supports robust speech identification Application of the system under true environment.
Pet towards intelligent robot calls detection.Household humanoid robot can be made using this invention method, in real time Detect in domestic environment whether there is the cry of pet dog, cat or other pets, and make feedback, improve robot interactive It is intelligent and recreational.This invention is used 20 hours, the cry of totally 10 class pet dogs, to DNN model trainings, sound Sources Detection Frame void inspection rate is less than 15%, and time frame loss is less than 5%.
Embodiment 2
Feature extraction
The present embodiment can automatic data-detection buffering area, and by framing, adding window etc. process, realize from 1 dimension time-domain signal To the change of feature space, acoustic feature is extracted.Flow chart of data processing, as shown in Figure 2.
Detailed step is as follows:
(1) real-time recording byte stream is obtained, data signal is changed into according to 16bit quantitative rates, monitor pending buffering area, such as Fruit buffering area writes full to 16 milliseconds data (if sample rate is 16KHz, be 256 data samples), then write history buffering Area, and carry out feature extraction.
(2) upper 16 milliseconds of data in a data buffer zone are combined, totally 32 milliseconds (is 512 if sample rate is 16KHz Individual data sample), carry out windowing process.The present embodiment uses a length of 512 Hanning window function.Assuming that former time-domain signal is x T (), the data after adding window areHanning window function is as shown in Figure 3.
(3) FFT (Fast Fourier Transformation, FFT) is carried out, time-domain signal is converted To spectrum domain and squared, conversion process is as follows:
(4) filter (MFCC) using 23 Jan Vermeers and transform to log-domain, 23 Jan Vermeer wave filters are defined as follows:
Wherein, f is the frequency of correspondence frequency band, and m values are 1 to 23.23 dimension datas by Mel wave filter and after removing logarithm For:
(5) discrete cosine transform, to remove the interdependence effects that different Mel wave filters are exported, the following institute of calculating process Show:
The present embodiment only with preceding 12 discrete cosine transforms coefficient as feature, i.e. n values from 1 to 12.
(6) Fusion Features.There is very strong correlation because acoustical signal has significant timing, i.e. surrounding time frame, In order to improve the accuracy of particular sound source detection, the present embodiment has considered the acoustic feature of preceding four time frames, has been fused into It is a new characteristic vector, i.e., the present embodiment is using totally 60 dimensional feature.
The present embodiment is based on MFCC features and is used to describe the difference of the special sound source with other sound sources of user's concern, and uses The strategy of Fusion Features, has expanded the dimension of feature, can improve the precision of model training.
Embodiment 3
Set up DNN training patterns
Deep neural network is general by an input layer, multiple non-linear hidden layers and an output layer composition, each implicit Layer is stacked up the structure to form a profound level, as shown in Figure 4.
Full connection between each layer of DNN, its connection weight we be defined as w, wlRepresent l layers and l+1 layers Between connection weight.alRepresent l layers of activation output, zlExpression is input to l layers of weighted sum.It is The output of DNN, whereinRepresent that input x belongs to the probability of the i-th class, i.e.,
Training process includes the steps such as data prepare and demarcate, DNN weights update, specific as follows:
(1) data prepare, and data prepare to be used to ensure the accuracy of DNN models, including user's concern particular sound source and Some signals of the common sound source outside the sound source.
(2) data scaling.To each audio signal of previous step, feature extraction is carried out, feature extraction as above chapters and sections Shown in formula (4).Set up a new vector, (XC(n), y (n))), wherein y (n) is calibration value.If the frame signal is to treat Then value is 1 to inspection particular sound source, is otherwise other sound-source signals, then value is -1.
(3) DNN weights update.Weight is updated to the committed step of model training, and the present embodiment is using conventional reverse biography Broadcast algorithm and realize that DNN weights update.The present embodiment realizes that DNN weights update using conventional DNN kits Kaldi.DNN sets It is 4 hidden layers, ground floor number of nodes is identical with characteristic dimension, as 60, remaining the number of hidden nodes is 512, output node number Measure is 1.
Embodiment 4
Sound Sources Detection judges
The DNN training patterns that the characteristic vector and embodiment 3 extracted using embodiment 2 are set up, when first to each Between frame determine probability, smoothing processing then is done to probability output, finally according to contrast one given threshold make final sound Source detection judges.The resume module flow is as shown in Figure 4.
Wherein data framing, buffering are identical with the method and embodiment 2 of feature extraction.DNN models are set up using embodiment 3 DNN training patterns, i.e., each hidden layer connection weight wlWith activation output alIt is model parameter.Before DNN probability calculations, probability to Smooth, sound source decision-making module is described in detail as follows:
(1) DNN probability calculations
A () initializes input layer, i.e. DNN ground floors activation output assignment with 60 dimension aural signature data:a1=XC(n)。
B () calculates hidden layer activation output and front and rear layer weighted sum, iterative calculation formula is as follows:
Wherein ReLU represents broken line activation primitive.Calculate all of hidden layer backward successively.
C () calculates output layer activation output valve, and as DNN output probabilities.It is hidden according to last that output layer is calculated Hide layer data to calculate, computing formula is as follows:
Wherein output layer activation output is set as that the time frame is judged to the probability of particular sound source,
I.e.:
P (y (n)=1 | XC(n))=aL (7)
(2) smoothing processing is done using the probability output of former frame to smooth before probability, to avoid single frame probability output from missing The mistake judgement that difference is caused.Smooth probability calculation is as follows:
Probability after this is smooth is used as final decision probability.
α is smoothing factor, between the present embodiment sets smoothing factor span as 0.75 to 0.85.
(3) particular sound source detection decision-making.Detection decision-making is compared according to the probability after smoothing with given threshold, i.e., such as ReallyMore than threshold θ, then judge that the frame is particular sound source start frame, otherwise judge do not have in the time frame Detect particular sound source.Usual threshold θ interval is between 0 to 1, if threshold value is too small, to have excessive empty inspection, that is, permitted Many is not that the particular sound source signal time frame is mistakenly detected as the particular sound source;If threshold value is too high, have excessive Missing inspection, i.e., what many particular sound source time frames can not be striven for detects.The threshold value setting θ of the present embodiment use compromise= 0.5, to balance empty inspection and residual error rate.
Embodiment 5
The invention allows for a kind of particular sound source detecting system based on deep neural network, including:
Characteristic extracting module 10, the acoustic feature for extracting real-time audio signal, generation acoustic feature vector;
DNN training patterns module 20 is set up, for being trained to preset sound signal using deep neural network method, Set up DNN training patterns;
Detection module 30, for carrying out detection judgement to the acoustic feature vector using the DNN training patterns.
Characteristic extracting module 10 can be built using the method for embodiment 2, and setting up DNN training patterns module 20 can use The method of embodiment 3 is built, and detection module 30 can be built using the method for embodiment 4.
Towards the robot application of children's company type, can in time detect what children cried using the system of the present embodiment Sound, feeds back to guardian;Towards the camera application of intelligent security guard, can go out whether to have using the system detectio of the present embodiment Audio warning etc., improves the intelligent level of safety-protection system.
The present invention a kind of particular sound source detection method and system based on deep neural network, its method include, extract in fact When voice signal acoustic feature, generation acoustic feature vector;Preset sound signal is carried out using deep neural network method Training, sets up DNN training patterns;Detection judgement is carried out to the acoustic feature vector using the DNN training patterns, using depth Degree nerual network technique realizes the modeling to different sound-source signals.Because deep neural network has more accurate modeling energy Power, especially data more sufficiently under the premise of, the accuracy of modeling is higher, can process some feature spaces it is close as Sound Sources Detection problem.This invention uses surrounding time frame fusion feature technology, it is possible to achieve single frame detection accuracy rate higher, Real-time is high, and decision delay is not more than 0.5 second, practical.
The disclosure can embody in equipment, system, method and/or computer program product.The computer program product Computer readable storage medium (or medium) is may include, thereon with computer-readable program instructions, for causing processor to enter The aspect of the row disclosure.
Computer readable storage medium can be the physical device that can keep being used for instruction executing device with store instruction.Meter Calculation machine readable storage medium may be, for example, but be not limited to electronic storage device, magnetic storage device, optical storage, electromagnetism and deposit Storage device, semiconductor storage, or foregoing every any appropriate combination.The more specific example of computer readable storage medium Non-exhaustive list include the following:Portable computer diskette, hard disk, random access memory (RAM), read-only storage (ROM), Erasable Programmable Read Only Memory EPROM (EPROM or flash memory), static RAM (SRAM), portable Formula compact disk read-only storage (CD-ROM), digital versatile disc (DVD), memory stick, floppy disc, such as above-noted have The device of the mechanical codings such as the bulge-structure in the card punch or groove of instruction, and foregoing every any appropriate combination. It is as used herein, computer readable storage medium is not construed to temporary signal in itself, for example radio wave or other The electromagnetic wave of Free propagation, the electromagnetic wave of waveguide or other transmission mediums is propagated across (for example, through the light arteries and veins of Connectorized fiber optic cabling Punching), or through the electric signal of wire transfer.
Computer-readable program instructions described herein can be downloaded to corresponding meter from computer readable storage medium Calculation/processing unit, or download to outer computer via networks such as such as internet, LAN, wide area network and/or wireless networks Or external memory.The network may include copper transmission yarn, optical transmission fibers, be wirelessly transferred, router, fire wall, Exchanger, gateway computer and/or Edge Server.Adapter or network interface in each calculating/processing unit connect The computer-readable program instructions for carrying out automatic network are received, and forwards the computer-readable program instructions, for storing mutually accrued In computer readable storage medium in calculation/processing unit.
The computer-readable program instructions of the operation for carrying out the disclosure can be assembler directive, instruction set architecture (ISA) instruction, machine instruction, machine-dependent instructions, microcode, firmware instructions, condition setup data, or programmed with one or more Any source code or object code that any combinations of language are write, the programming language is including taking target as the programming language being oriented to Speech, such as Smalltalk, C++ etc.;And conventional program programming language, such as " C " programming language or similar programming language. Computer-readable program instructions can completely on the computer of user, partly on the computer of user, as stand alone software Encapsulation, partly on the computer of user and partly on remote computer or completely on remote computer or server Perform.In the latter's scene, remote computer can be by the computer of any kind of network connection to user, including LAN (LAN) or wide area network (WAN), or can proceed to outer computer connection (for example, using ISP pass through because Special net).In some embodiments, including such as PLD, field programmable gate array (FPGA) or programmable The electronic circuit of logic array (PLA) can perform computer-readable by using the status information of computer-readable program instructions Programmed instruction with individualize electronic circuit, to perform the aspect of the disclosure.
Herein according to the embodiment of the disclosure, the flow of reference method, equipment (system) and computer program product Scheme explanation and/or block diagram to describe the aspect of the disclosure.It will be understood that, each frame in flow chart explanation and/or block diagram, Yi Jiliu The combination of the frame in the explanation of journey figure and/or block diagram can be implemented by computer-readable program instructions.
Can by these computer-readable program instructions be supplied to the production all-purpose computer of machine, special-purpose computer or other The processor of programmable data processing device so that via the finger that the computer or other programmable data processing devices are performed Order creates the component for function/action specified in implementing procedure figure and/or block diagram block.These computer-readable programs refer to Order may be alternatively stored in computer readable storage medium, and it can instruct computer, programmable data processing device and/or other dresses Put and work in a specific way so that the inside be stored with instruction computer readable storage medium include product, the product bag Include the instruction of the aspect of specified function/action in implementing procedure figure and/or block diagram block.
Computer-readable program instructions can also be loaded into computer, other programmable data processing devices or other devices On, to cause series of operation steps to be performed on computer, other programmable devices or other devices, to produce computer reality The process applied so that the instruction implementing procedure figure and/or block diagram performed on computer, other programmable devices or other devices Specified function/action in frame.
Flow chart and block diagram in figure show system, the method and computer program according to the various embodiments of the disclosure The framework of the possible embodiment of product, feature and operation.At this point, each frame in flow chart or block diagram can be represented and referred to The module of order, fragment or part, it includes one or more executable instructions for implementing to specify logic function.Replaced at some Occur for the order that in implementation method, the function described in frame can be different from the order described in figure.For example, continuously show Two frames for going out in fact can be performed generally simultaneously, or the frame can be performed with reverse order sometimes, depending on involved Feature.It will also be noted that the group of the frame in block diagram and/or each frame of flow chart explanation, and block diagram and/or flow chart explanation Close, can be by execution specified function or action or the system based on specialized hardware of the combination for carrying out specialized hardware and computer instruction To implement.
Although the embodiment that foregoing teachings are directed to the disclosure, in the situation of the base region for not departing from the disclosure Under, it is contemplated that other and other embodiments of the disclosure, and the scope of the present disclosure determines by appended claims.

Claims (10)

1. a kind of particular sound source detection method based on deep neural network, it is characterised in that comprise the following steps:
Extract the acoustic feature of real-time audio signal, generation acoustic feature vector;
Detection judgement is carried out to the acoustic feature vector using DNN training patterns, the DNN training patterns are using depth god Preset sound signal is trained through network method and is set up.
2. the particular sound source detection method based on deep neural network according to claim 1, it is characterised in that described to carry The step of acoustic feature for taking voice signal, includes:
Real-time audio signal is obtained, the real-time audio signal is pre-processed;
Fourier transform is first carried out to pretreated real-time audio signal, then through Mel filter process, is then carried out discrete Cosine transform is processed, and obtains the acoustic feature of real-time audio signal;
The acoustic feature of adjacent multiple voice signals before taking the real-time audio signal, the acoustics with the voice signal is special Levy and merged, generation acoustic feature vector.
3. the particular sound source detection method based on deep neural network according to claim 2, it is characterised in that described to obtain Real-time audio signal is taken, is included the step of pre-processed to the real-time audio signal:
Real-time audio signal is obtained, data signal is changed into according to default quantitative rate, pending buffering area is monitored, if buffering area is write It is full extremely to specify time data, then history buffer is write, and carry out feature extraction;
Time data is specified with reference to a upper data buffer zone, to carrying out windowing process comprising two specified time datas.
4. the particular sound source detection method based on deep neural network according to claim 1, it is characterised in that described The process of setting up of DNN training patterns includes:
Default multiple voice data, the multiple voice data includes target particular sound source;
Feature extraction is carried out to each voice data, a sound is set up and is demarcated vector, the sound demarcates vector includes what is extracted Sound characteristic and calibration value;
Realize that DNN weights update using DNN kits Kaldi.
5. the particular sound source detection method based on deep neural network according to claim 1, it is characterised in that described to make The step of detection judges is carried out to the acoustic feature vector with DNN training patterns includes:
Calculate DNN output probabilities;
Smoothing processing is done to DNN output probabilities using the probability output of former frame;
Compared with given threshold by the probability after smoothing processing, judge particular sound source.
6. the particular sound source detection method based on deep neural network according to claim 5, it is characterised in that the meter The step of calculating DNN output probabilities includes:
Initialization input layer,
a1=XC(n), XCN () is the acoustic feature of the real-time audio signal, a1It is input layer;
Hidden layer activation output and front and rear layer weighted sum are calculated, iterative calculation formula is as follows:
al-1=[1, al-1]
zl=al-1×wl-1
al=ReLU (zl)=log (1+exp (zl))
Wherein al-1L-1 layers of activation output is represented,
alL layers of activation output is represented,
zlExpression is input to l layers of weighted sum,
wl-1The connection weight between l-1 layers and l layers is represented,
wlThe connection weight between l layers and l+1 layers is represented,
ReLU represents broken line activation primitive, calculates all of hidden layer backward successively;Calculate output layer activation output valve, and conduct DNN output probabilities;Output layer is calculated to be calculated according to last hiding layer data, and computing formula is as follows:
aL-1=[1, aL-1]
zL=aL-1×wL-1
a L = s o f t M a x ( z L ) = 1 1 + exp ( - Σ i z L ( i ) )
Wherein aL-1L-1 layers of activation output is represented,
aLL layers of activation output is represented,
zLExpression is input to L layers of weighted sum,
wL-1The connection weight between L-1 layers and L layers is represented,
Output layer activation output is set as that the time frame is judged to the probability of particular sound source, i.e.,:
P (y (n)=1 | XC(n))=aL
7. the particular sound source detection method based on deep neural network according to claim 6, it is characterised in that the profit In the step of doing smoothing processing to DNN output probabilities with the probability output of former frame, probability calculation formula is smoothed as follows:
P ^ ( y ( n ) = 1 | X C ( n ) ) = α P ( y ( n - 1 ) = 1 | X C ( n - 1 ) ) + ( 1 - α ) P ( y ( n ) = 1 | X C ( n ) )
Wherein, α is smoothing factor.
8. the particular sound source detection method based on deep neural network according to claim 7, it is characterised in that:
The span of α is 0.75-0.85.
9. the particular sound source detection method based on deep neural network according to claim 5, it is characterised in that described to set It is 0.5 to determine threshold value.
10. a kind of particular sound source detecting system based on deep neural network, it is characterised in that including:
Characteristic extracting module, the acoustic feature for extracting real-time audio signal, generation acoustic feature vector;
DNN training pattern modules are set up, for being trained to preset sound signal using deep neural network method, is set up DNN training patterns;
Detection module, for carrying out detection judgement to the acoustic feature vector using the DNN training patterns.
CN201611099733.0A 2016-12-02 2016-12-02 Particular sound source detection method and particular sound source detection system based on deep neural network Pending CN106710599A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611099733.0A CN106710599A (en) 2016-12-02 2016-12-02 Particular sound source detection method and particular sound source detection system based on deep neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611099733.0A CN106710599A (en) 2016-12-02 2016-12-02 Particular sound source detection method and particular sound source detection system based on deep neural network

Publications (1)

Publication Number Publication Date
CN106710599A true CN106710599A (en) 2017-05-24

Family

ID=58934577

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611099733.0A Pending CN106710599A (en) 2016-12-02 2016-12-02 Particular sound source detection method and particular sound source detection system based on deep neural network

Country Status (1)

Country Link
CN (1) CN106710599A (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107393526A (en) * 2017-07-19 2017-11-24 腾讯科技(深圳)有限公司 Speech silence detection method, device, computer equipment and storage medium
CN107527620A (en) * 2017-07-25 2017-12-29 平安科技(深圳)有限公司 Electronic installation, the method for authentication and computer-readable recording medium
CN108053841A (en) * 2017-10-23 2018-05-18 平安科技(深圳)有限公司 The method and application server of disease forecasting are carried out using voice
CN108513227A (en) * 2018-04-09 2018-09-07 华南理工大学 A kind of hyundai electronics qin production method based on loudspeaker array design
CN108615536A (en) * 2018-04-09 2018-10-02 华南理工大学 Time-frequency combination feature musical instrument assessment of acoustics system and method based on microphone array
CN108831487A (en) * 2018-06-28 2018-11-16 深圳大学 Method for recognizing sound-groove, electronic device and computer readable storage medium
CN108922548A (en) * 2018-08-20 2018-11-30 深圳园林股份有限公司 A kind of bird based on deep learning, frog intelligent monitoring method
CN109065075A (en) * 2018-09-26 2018-12-21 广州势必可赢网络科技有限公司 A kind of method of speech processing, device, system and computer readable storage medium
CN109074822A (en) * 2017-10-24 2018-12-21 深圳和而泰智能控制股份有限公司 Specific sound recognition methods, equipment and storage medium
CN109102812A (en) * 2017-06-21 2018-12-28 北京搜狗科技发展有限公司 A kind of method for recognizing sound-groove, system and electronic equipment
CN109102143A (en) * 2018-06-19 2018-12-28 硕橙(厦门)科技有限公司 A kind of yield monitoring method and device
CN109298642A (en) * 2018-09-20 2019-02-01 三星电子(中国)研发中心 The method and device being monitored using intelligent sound box
CN109357749A (en) * 2018-09-04 2019-02-19 南京理工大学 A kind of power equipment audio signal analysis method based on DNN algorithm
CN109473119A (en) * 2017-09-07 2019-03-15 中国科学院声学研究所 A kind of acoustic target event-monitoring method
CN109885162A (en) * 2019-01-31 2019-06-14 维沃移动通信有限公司 Method for oscillating and mobile terminal
CN110444225A (en) * 2019-09-17 2019-11-12 中北大学 Acoustic target recognition methods based on Fusion Features network
CN110580915A (en) * 2019-09-17 2019-12-17 中北大学 Sound source target identification system based on wearable equipment
CN111566732A (en) * 2018-01-15 2020-08-21 三菱电机株式会社 Sound signal separating device and sound signal separating method
CN111833901A (en) * 2019-04-23 2020-10-27 北京京东尚科信息技术有限公司 Audio processing method, audio processing apparatus, audio processing system, and medium
CN112382302A (en) * 2020-12-02 2021-02-19 漳州立达信光电子科技有限公司 Baby cry identification method and terminal equipment
CN112700782A (en) * 2020-12-25 2021-04-23 维沃移动通信有限公司 Voice processing method and electronic equipment
CN113762085A (en) * 2021-08-11 2021-12-07 江苏省人民医院(南京医科大学第一附属医院) Artificial intelligence-based infant incubator system and method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103971690A (en) * 2013-01-28 2014-08-06 腾讯科技(深圳)有限公司 Voiceprint recognition method and device
CN105513597A (en) * 2015-12-30 2016-04-20 百度在线网络技术(北京)有限公司 Voiceprint authentication processing method and apparatus
CN105632486A (en) * 2015-12-23 2016-06-01 北京奇虎科技有限公司 Voice wake-up method and device of intelligent hardware
CN105845128A (en) * 2016-04-06 2016-08-10 中国科学技术大学 Voice identification efficiency optimization method based on dynamic pruning beam prediction
CN105976812A (en) * 2016-04-28 2016-09-28 腾讯科技(深圳)有限公司 Voice identification method and equipment thereof

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103971690A (en) * 2013-01-28 2014-08-06 腾讯科技(深圳)有限公司 Voiceprint recognition method and device
CN105632486A (en) * 2015-12-23 2016-06-01 北京奇虎科技有限公司 Voice wake-up method and device of intelligent hardware
CN105513597A (en) * 2015-12-30 2016-04-20 百度在线网络技术(北京)有限公司 Voiceprint authentication processing method and apparatus
CN105845128A (en) * 2016-04-06 2016-08-10 中国科学技术大学 Voice identification efficiency optimization method based on dynamic pruning beam prediction
CN105976812A (en) * 2016-04-28 2016-09-28 腾讯科技(深圳)有限公司 Voice identification method and equipment thereof

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109102812B (en) * 2017-06-21 2021-08-31 北京搜狗科技发展有限公司 Voiceprint recognition method and system and electronic equipment
CN109102812A (en) * 2017-06-21 2018-12-28 北京搜狗科技发展有限公司 A kind of method for recognizing sound-groove, system and electronic equipment
CN107393526B (en) * 2017-07-19 2024-01-02 腾讯科技(深圳)有限公司 Voice silence detection method, device, computer equipment and storage medium
CN107393526A (en) * 2017-07-19 2017-11-24 腾讯科技(深圳)有限公司 Speech silence detection method, device, computer equipment and storage medium
CN107527620A (en) * 2017-07-25 2017-12-29 平安科技(深圳)有限公司 Electronic installation, the method for authentication and computer-readable recording medium
CN107527620B (en) * 2017-07-25 2019-03-26 平安科技(深圳)有限公司 Electronic device, the method for authentication and computer readable storage medium
CN109473119A (en) * 2017-09-07 2019-03-15 中国科学院声学研究所 A kind of acoustic target event-monitoring method
CN108053841A (en) * 2017-10-23 2018-05-18 平安科技(深圳)有限公司 The method and application server of disease forecasting are carried out using voice
WO2019080502A1 (en) * 2017-10-23 2019-05-02 平安科技(深圳)有限公司 Voice-based disease prediction method, application server, and computer readable storage medium
CN109074822A (en) * 2017-10-24 2018-12-21 深圳和而泰智能控制股份有限公司 Specific sound recognition methods, equipment and storage medium
CN111566732A (en) * 2018-01-15 2020-08-21 三菱电机株式会社 Sound signal separating device and sound signal separating method
CN111566732B (en) * 2018-01-15 2023-04-04 三菱电机株式会社 Sound signal separating device and sound signal separating method
CN108513227B (en) * 2018-04-09 2021-02-19 华南理工大学 Modern electronic organ manufacturing method based on loudspeaker array design
CN108615536A (en) * 2018-04-09 2018-10-02 华南理工大学 Time-frequency combination feature musical instrument assessment of acoustics system and method based on microphone array
CN108513227A (en) * 2018-04-09 2018-09-07 华南理工大学 A kind of hyundai electronics qin production method based on loudspeaker array design
CN109102143A (en) * 2018-06-19 2018-12-28 硕橙(厦门)科技有限公司 A kind of yield monitoring method and device
CN108831487B (en) * 2018-06-28 2020-08-18 深圳大学 Voiceprint recognition method, electronic device and computer-readable storage medium
CN108831487A (en) * 2018-06-28 2018-11-16 深圳大学 Method for recognizing sound-groove, electronic device and computer readable storage medium
CN108922548A (en) * 2018-08-20 2018-11-30 深圳园林股份有限公司 A kind of bird based on deep learning, frog intelligent monitoring method
CN109357749A (en) * 2018-09-04 2019-02-19 南京理工大学 A kind of power equipment audio signal analysis method based on DNN algorithm
CN109298642A (en) * 2018-09-20 2019-02-01 三星电子(中国)研发中心 The method and device being monitored using intelligent sound box
CN109298642B (en) * 2018-09-20 2021-08-27 三星电子(中国)研发中心 Method and device for monitoring by adopting intelligent sound box
CN109065075A (en) * 2018-09-26 2018-12-21 广州势必可赢网络科技有限公司 A kind of method of speech processing, device, system and computer readable storage medium
CN109885162A (en) * 2019-01-31 2019-06-14 维沃移动通信有限公司 Method for oscillating and mobile terminal
CN111833901A (en) * 2019-04-23 2020-10-27 北京京东尚科信息技术有限公司 Audio processing method, audio processing apparatus, audio processing system, and medium
CN111833901B (en) * 2019-04-23 2024-04-05 北京京东尚科信息技术有限公司 Audio processing method, audio processing device, system and medium
CN110580915A (en) * 2019-09-17 2019-12-17 中北大学 Sound source target identification system based on wearable equipment
CN110444225A (en) * 2019-09-17 2019-11-12 中北大学 Acoustic target recognition methods based on Fusion Features network
CN112382302A (en) * 2020-12-02 2021-02-19 漳州立达信光电子科技有限公司 Baby cry identification method and terminal equipment
CN112700782A (en) * 2020-12-25 2021-04-23 维沃移动通信有限公司 Voice processing method and electronic equipment
CN113762085A (en) * 2021-08-11 2021-12-07 江苏省人民医院(南京医科大学第一附属医院) Artificial intelligence-based infant incubator system and method
CN113762085B (en) * 2021-08-11 2022-04-19 江苏省人民医院(南京医科大学第一附属医院) Artificial intelligence-based infant incubator system and method

Similar Documents

Publication Publication Date Title
CN106710599A (en) Particular sound source detection method and particular sound source detection system based on deep neural network
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
Lakomkin et al. On the robustness of speech emotion recognition for human-robot interaction with deep neural networks
CN110310647B (en) Voice identity feature extractor, classifier training method and related equipment
Tong et al. A comparative study of robustness of deep learning approaches for VAD
US20120316879A1 (en) System for detecting speech interval and recognizing continous speech in a noisy environment through real-time recognition of call commands
CN108364662B (en) Voice emotion recognition method and system based on paired identification tasks
CN109346087B (en) Noise robust speaker verification method and apparatus against bottleneck characteristics of a network
CN104903954A (en) Speaker verification and identification using artificial neural network-based sub-phonetic unit discrimination
CN110459207A (en) Wake up the segmentation of voice key phrase
KR20210070213A (en) Voice user interface
CN110852215A (en) Multi-mode emotion recognition method and system and storage medium
CN108320732A (en) The method and apparatus for generating target speaker's speech recognition computation model
US11217265B2 (en) Condition-invariant feature extraction network
Salekin et al. Distant emotion recognition
CN114999525A (en) Light-weight environment voice recognition method based on neural network
CN111998936B (en) Equipment abnormal sound detection method and system based on transfer learning
CN113077812A (en) Speech signal generation model training method, echo cancellation method, device and equipment
Zhang et al. Acoustic traffic event detection in long tunnels using fast binary spectral features
Nicolson et al. Sum-product networks for robust automatic speaker identification
CN116386664A (en) Voice counterfeiting detection method, device, system and storage medium
Segarceanu et al. Environmental acoustics modelling techniques for forest monitoring
CN112037772B (en) Response obligation detection method, system and device based on multiple modes
CN113744734A (en) Voice wake-up method and device, electronic equipment and storage medium
CN116259312A (en) Method for automatically editing task by aiming at voice and neural network model training method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170524

RJ01 Rejection of invention patent application after publication