CN105321525B - A kind of system and method reducing VOIP communication resource expense - Google Patents
A kind of system and method reducing VOIP communication resource expense Download PDFInfo
- Publication number
- CN105321525B CN105321525B CN201510638936.1A CN201510638936A CN105321525B CN 105321525 B CN105321525 B CN 105321525B CN 201510638936 A CN201510638936 A CN 201510638936A CN 105321525 B CN105321525 B CN 105321525B
- Authority
- CN
- China
- Prior art keywords
- layer
- frame
- voice
- output
- dimensional
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000004891 communication Methods 0.000 title claims abstract description 28
- 238000000034 method Methods 0.000 title claims abstract description 27
- 210000002569 neuron Anatomy 0.000 claims abstract description 77
- 238000005070 sampling Methods 0.000 claims abstract description 70
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 58
- 238000001228 spectrum Methods 0.000 claims abstract description 52
- 238000012549 training Methods 0.000 claims abstract description 15
- 239000000284 extract Substances 0.000 claims abstract description 7
- 239000000203 mixture Substances 0.000 claims abstract description 4
- 239000010410 layer Substances 0.000 claims description 228
- 239000011159 matrix material Substances 0.000 claims description 85
- 238000004364 calculation method Methods 0.000 claims description 17
- 230000009467 reduction Effects 0.000 claims description 14
- 230000004913 activation Effects 0.000 claims description 8
- 230000008569 process Effects 0.000 claims description 8
- 239000011229 interlayer Substances 0.000 claims description 3
- 210000005036 nerve Anatomy 0.000 claims description 3
- 238000012546 transfer Methods 0.000 claims description 3
- 230000001537 neural effect Effects 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 description 8
- 238000001514 detection method Methods 0.000 description 7
- 230000005540 biological transmission Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 241000208340 Araliaceae Species 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000005194 fractionation Methods 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/84—Detection of presence or absence of voice signals for discriminating voice from noise
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
Abstract
The invention discloses a kind of systems for reducing VOIP communication resource expense, including input layer, convolutional layer, sub-sampling layer and output layer, each layer to be made of characteristic spectrum, and every one-dimensional characteristic map contains neuron;Using the system reduce VOIP communication resource expense method, specifically: one, training convolutional neural networks system;Two, convolutional neural networks system is initialized;Three, voice to be measured is inputted into VAD system;Four, sequentially each frame extracts speech characteristic parameter MFCC and its first-order difference characteristic parameter;Five, each frame parameter composition one-dimensional characteristic map is brought into convolutional neural networks system;Six, convolutional neural networks system will sequentially export the result [x, y] of each frame of voice to be measured, and VAD system adjudicates and records result.Advantage is: convolutional neural networks system being introduced into VAD system and is detected, the False Rate of VAD system is reduced, saves and calculates time and bandwidth, the voice resource expense of VOIP can be reduced in the case where guaranteeing communication quality.
Description
Technical field
The invention belongs to communication computer fields, are related to a kind of system and method for reducing VOIP communication resource expense.
Background technique
VoIP (Voice over Internet Protocol) also known as Voice over IP, Voice over IP;Specifically
It is to digitize the analog signal (Voice) of voice, IP data packet is converted into, in the form of data packet (Data Packet)
Real-time delivery is done on IP network (IP Network).
It with the fast-developing of IP network technology and constantly improve, IPization network gradually substitutes traditional exchange of speech
Network.Using IP network transmitting audio data, under conditions of network bandwidth immobilizes, improving channel utilization can be very big
The quality of voice is improved in degree.
In the prior art, it improves channel utilization and generallys use voice activation detection (Voice Activity
Detection, VAD) technology;Speech and silent characteristic of the VAD technology according to the daily talk of people, detect mute
(Silence) inhibited when, be not take up it or few busy channel bandwidth, just by it when detecting the activation speech of burst
Carry out compressed encoding and transmission.In fact, people, when making a phone call, only 27% -40% time is talking, VAD skill is used
Art removes silent time, only retains the part containing information, can make the effective rate of utilization of channel width improve 50% with
On, so VAD technology becomes in voice traffic transmission the powerful tool for reducing bandwidth, improving channel utilization.
VAD technology is used to transmitting audio data in IP phone application, non-speech segments are removed during audio session,
The coding and transmission to quiet data packet are avoided, saves and calculates time and bandwidth.
The method that current VAD technology uses is: receiving current audio frame to be detected first, then obtains from audio frame
Time domain parameter and frequency domain parameter are taken, sliding is flat when then obtaining long in historical background noise frame of time domain parameter and time domain parameter
First distance between mean value, and sliding average when obtaining long in historical background noise frame of frequency domain parameter and frequency domain parameter
Between second distance." historical background noise frame " refers to the background noise frames before present frame, such as before present frame
Continuous multiple background noise frames;If present frame is initial first frame, can make an uproar using preset frame as historical background
Acoustic frame, or using the first frame as historical background noise frame.Finally, according to first distance and second distance and based on first away from
It is still background noise frames that from the judgement multinomial group judgement with second distance, currently audio frame to be detected, which is prospect speech frame,
Adjudicating at least one coefficient in multinomial group is variable, which is to detect working method and/or input according to voice activation
What signal characteristic determined.
Above-mentioned technology is mainly the numerical value such as Pitch Information, energy information, the tone information for passing through calculating voice signal and will
Compared with its threshold value with standard, if numerical value is greater than threshold value and is judged to voice, be judged to noise if less than threshold value, this kind of algorithm it is excellent
Point is fairly simple, but is only applicable to handle relatively simple noise source, for complicated noise source, voice messaging above-mentioned first
It is easier to mix with noise, subthreshold is also difficult to adapt to the variation of environment.
Summary of the invention
The present invention is directed to existing VAD detection technique, not high to the adaptability of environmental change, and adjudicates under complex environment
The low disadvantage of accuracy rate proposes a kind of system and method for reducing VOIP communication resource expense.
A kind of system also known as convolutional neural networks system reducing VOIP communication resource expense, specifically includes 4 layers, successively
For input layer, convolutional layer, sub-sampling layer and output layer, each layer is made of characteristic spectrum, and every one-dimensional characteristic map contains mind
Through member.
What input layer inputted is the speech characteristic parameter MFCC and its first-order difference characteristic parameter of each frame of voice to be measured, institute
It states parameter and constitutes one-dimensional characteristic map;
The characteristic spectrum of convolutional layer has multidimensional, per one-dimensional characteristic spectrum by the characteristic spectrum of input layer and a convolution
Nuclear convolution obtains, according to the size of convolution kernel, 1 neuron of multiple neurons mapping convolutional layer of input layer, and every dimension
The neuron of same position is all connected with corresponding neuron in input layer in characteristic spectrum.
Sub-sampling layer carries out dimensionality reduction operation by every one-dimensional characteristic map to convolutional layer, by the every one-dimensional characteristic figure of convolutional layer
Then neuron average packet in spectrum chooses neuron of the maximum neuron as sub-sampling layer in every group;Sub-sampling layer
The dimension of characteristic spectrum is identical as the dimension of convolutional layer characteristic spectrum, and the number of sub-sampling layer epineural member is reduced on convolutional layer
1/3 to the 2/3 of neuron number.
Output layer is one-dimensional characteristic map, all minds of the whole connexon sample levels of each neuron on characteristic spectrum
Through member, and export the actual result of voice to be measured.
A method of reducing VOIP communication resource expense, specific steps are as follows:
Step 1: training convolutional neural networks system, obtains and saves weight matrix w and weight matrix m;
Specific step is as follows:
Step 101 chooses speech samples and is respectively labeled as speech frame and non-speech frame to each frame of speech samples.
To speech samples, sequentially hand labeled, speech frame are labeled as [1,0], and non-speech frame is labeled as [0,1], respectively as
Convolutional neural networks system output layer adjudicates the ideal output of speech frame and non-speech frame.
Step 102, for each frame voice after label, sequentially extract speech characteristic parameter MFCC and its a scale respectively
Divide characteristic parameter;
MFCC (Mel-scale Frequency Cepstral Coefficients, mel cepstrum coefficients) extracts feature
A frame is divided by 20ms when parameter.
The calculation formula of speech characteristic parameter MFCC are as follows:
Wherein, M (n) is the n-th order parameter of speech characteristic parameter MFCC, mkFor the input of discrete cosine transform, k=
1,....,N;N is the number of Mel filter;L is the maximum order of speech characteristic parameter MFCC coefficient.
Note first-order difference MFCC parameter is Δ MFCC, its calculation formula is:
In formula, D (t) is the t rank parameter of first-order difference MFCC parameter;L' is the maximum order of Δ MFCC coefficient.
M (t+n) is the t+n rank parameter of speech characteristic parameter MFCC;M (t-n) is the t-n of speech characteristic parameter MFCC
Rank parameter;N=1 ... the time difference of, R, R expression single order cepstrum coefficient.
The speech characteristic parameter MFCC of each frame and its first-order difference characteristic parameter are sequentially formed one-dimensional spy by step 103
Levy map;
The calculation formula of one-dimensional characteristic map V is as follows:
V=[MFCC | Δ MFCC]=[M (1), M (2) ... M (n) ..., M (L), D (1), D (2) ... D (t) ..., D
(L')]
Step 104 brings the one-dimensional characteristic map of step 103 into convolutional neural networks system, obtains court verdict;
Specific steps are as follows:
Step 1041, the input that the one-dimensional characteristic map of each frame of sample voice is input to convolutional neural networks system
Layer;
Step 1042 obtains convolutional layer by the one-dimensional characteristic map and convolution nuclear convolution of input layer per one-dimensional characteristic pattern
Spectrum.
Convolutional layer is per one-dimensional characteristic spectrum QjCalculating it is as follows:
Qj=σ (O*wj), j=1 ..., J
Wherein, QjFor the jth dimensional feature map of convolutional layer, j=1 ..., J, J are the dimension of convolutional layer characteristic spectrum;O is
The one-dimensional characteristic map of input layer, * represent convolution operation, wjIndicate that input layer is connected nerve with convolutional layer jth dimensional feature map
The weight matrix of member, all weight matrix wjConstitute matrix w;σ (x) is activation primitive, selects sigmoid function, is defined as:
After convolution, what the neuron on the every one-dimensional characteristic map of convolutional layer extracted is opposite in input layer characteristic spectrum
The local feature answered.
Step 1043, sub-sampling layer are to every one-dimensional characteristic map Q of convolutional layerjCarry out dimensionality reduction operation;
By the neuron average packet on the every one-dimensional characteristic map of convolutional layer, the maximum neuron then chosen in every group is made
For the neuron of sub-sampling layer, calculation formula is as follows:
Wherein, pj,mRepresent sub-sampling layer jth dimensional feature map PjM-th of neuron, qj,mRepresent convolutional layer jth Wei Te
Levy map QjM-th of neuron, s is the mobile size of window;G is the window for the dimensionality reduction operation that convolutional layer is carried out to sub-sampling layer
Size.The value of the mobile size s of window is identical as the value of window G.
Each neuron of step 1044, output layer is all connect with all neurons of sub-sampling layer;
Calculation formula are as follows:
In formula, y is output vector;Indicate the characteristic spectrum that each dimensional feature map of sub-sampling layer is transformed into a n dimensional vector n,PjSub-sampling layer jth dimensional feature map is represented, J is the dimension of sub-sampling layer characteristic spectrum;M is
The characteristic spectrum of one n dimensional vector n is connected the weight matrix of neuron with output layer, and σ (x) is activation primitive, selects sigmoid letter
Number, is defined as:
Step 1045, after connecting entirely, output layer exports and records result.
Step 105 is directed to each frame voice, the ideal for sequentially marking the practical court verdict of step 104 and step 101
Output compares, and calculates decision error item;
For the n-th frame of input voice, the reality output result of convolutional neural networks system is y (n), and ideal output is denoted as, the calculation formula of decision error item d (n) is as follows:
D (n)=e (n) { y (n) [1-y (n)] }
In formulaN is the frame number of sample voice.
Step 106, for the decision error item of each frame voice, convolutional neural networks system reversely passes the error term
It broadcasts, updates weight matrix w and weight matrix m, complete a training process.
Step 1061, for the n-th frame of input voice, error term d (n) reverse transfer obtains sub-sampling to sub-sampling layer
The output error item sd (n) of layer:
Sd (n)=mT(n)·d(n)
mT(n) transposed matrix for being m (n), when m (n) is that input voice n-th frame is input to convolutional neural networks system, son
Sample level output characteristic spectrum is connected the weight matrix of neuron with output layer;
Output error item sd (n) is split into J dimension matrix by step 1062
Sd (n)=[sd1,sd2,...,sdM] split obtain J dimension matrixProcess is as follows:
I is the number of the every one-dimensional characteristic map epineural member of sub-sampling layer in formula, and M is of all neurons of sub-sampling layer
Number, M=J × I.
Step 1063, J tie up matrixIt is transferred to convolutional layer and carries out increasing dimension, obtain the error term c (n) of convolutional layer;
Process is as follows:
G is the window size for the dimensionality reduction operation that convolutional layer is carried out to sub-sampling layer in formula.
Step 1064, the voice n-th frame for input calculate separately the error term and weight matrix m of weight matrix w (n)
(n) error term.
Input layer is calculated to be connected with convolutional layer the error term of weight matrix w (n):
Δ w (n)=o " (n) * c (n)
O in formula " (n) is the input feature vector map O rotation 180 degree for inputting voice n-th frame and corresponding to input layer.
Sub-sampling layer is calculated to be connected with output layer the error term of weight matrix m (n):
In formulaFor the one-dimensional vector characteristic map that the voice n-th frame sub-sampling layer output of input is constituted, d (n) is real
Border output and ideal output error vector.
Step 1065, the corresponding weight matrix w (n) of speech frame for updating n-th frame input respectively are with weight matrix m's (n)
Value obtains the corresponding weight matrix w (n+1) of speech frame and the weight matrix m (n+1) of the (n+1)th frame input:
W (n+1)=w (n)-α Δ w (n)
M (n+1)=m (n)-α Δ m (n)
W (n) and m (n) respectively indicate the corresponding input layer of the speech frame of n-th frame input and convolutional layer and sub-sampling layer and
The weight matrix of interlayer is exported, α is learning rate, preferably α=1.
Step 107 continually enters sample voice, repeats step 101-106, until the output of convolutional neural networks system misses
Poor δ is less than preset value;
S is the voice totalframes of training input every time in formula, and y is the vector that the court verdict of S frame voice is constituted,For S frame
The vector that the ideal decision result of voice is constituted.
Step 108, when output error δ be less than preset value when, record weight matrix w and weight matrix m simultaneously be stored in file.
Step 2: reading weight matrix w and weight matrix m from file initializes convolutional neural networks system.
Step 3: voice to be measured is inputted VAD system;
Step 4: sequentially extracting speech characteristic parameter MFCC and its first-order difference feature ginseng to each frame of voice to be measured
Number;
Step 5: sequentially by the characteristic parameter MFCC of each frame of voice to be measured and its first-order difference characteristic parameter composition one
Dimensional feature map brings convolutional neural networks system into.
Step 6: convolutional neural networks system will sequentially export the result [x, y] of each frame of voice to be measured, VAD system pair
As a result [x, y] makes decisions, and records court verdict;
VAD system makes decisions the output result [x, y] of output layer, if x > y, illustrates for sample voice to be voice
Frame, otherwise it is non-speech frame that it is 1 that VAD system, which exports court verdict, and it is 0 that VAD system, which exports court verdict,.
The present invention has the advantages that
(1) a kind of system for reducing VOIP communication resource expense, after convolutional neural networks systematic training is good, without instructing again
Practice, VAD detection can be carried out after only need to importing the weight initialization network saved from file, computation complexity is low.
(2) a kind of method for reducing VOIP communication resource expense, by the way that convolutional neural networks system is introduced into VAD system
It is detected, makes full use of the learning characteristic of convolutional neural networks, reduce the False Rate of VAD system, saved and calculate time and band
Width, so as to reduce the voice resource expense of VOIP in the case where guaranteeing communication quality.
Detailed description of the invention
Fig. 1 is a kind of system structure diagram for reducing VOIP communication resource expense of the present invention.
Fig. 2 is a kind of method flow diagram for reducing VOIP communication resource expense of the present invention.
Fig. 3 is the flow chart of training convolutional neural networks system of the present invention.
Fig. 4 is the flow chart of convolutional neural networks system of the present invention.
Fig. 5 is the flow chart that convolutional neural networks system reverse of the present invention is propagated.
Specific embodiment
Below in conjunction with drawings and examples, the present invention is described in further detail.
In VOIP call system, VAD detection is first carried out before voice coding, is removed non-speech segments, is avoided to quiet
The coding of sound data packet and transmission save and calculate time and bandwidth.It is non-speech frame to VAD judgement in the voice coding stage
Segment can be without encoding or encoding less, to reduce communications bandwidth resources.
A kind of system and method reducing VOIP communication resource expense, convolutional neural networks system structure is simple, training
The features such as parameter is few and adaptable is applied in VAD system, by training convolutional neural networks system, by convolutional neural networks
System reduces the False Rate and misdetection rate of VAD system, to reduce in the case where guaranteeing communication quality as two-dimentional decision device
VOIP communication resource expense.
A kind of system also known as convolutional neural networks system reducing VOIP communication resource expense, as shown in Figure 1, specific packet
4 layers are included, input layer, convolutional layer, sub-sampling layer and output layer are followed successively by, each layer is made of several dimensional feature maps, often
Contain several neurons on one-dimensional characteristic map.
Input layer is one-dimensional characteristic map, has 26 neurons, is denoted as 1*26;Convolutional layer has 12 dimensional feature maps, each
Dimension has 21 neurons;It is denoted as 12*21;Sub-sampling layer is identical as the characteristic spectrum dimension of convolutional layer, on every one-dimensional characteristic map
Neuron number be the every one-dimensional characteristic map epineural member number of convolutional layer one third;It is denoted as 12*7;Output layer is one
Dimensional feature map has 2 neurons;It is denoted as 1*2.
That input layer inputs is the speech characteristic parameter MFCC and its first-order difference characteristic parameter structure of each frame of voice to be measured
At one-dimensional characteristic spectrogram;
The characteristic spectrum of convolutional layer has 12 dimensions, per one-dimensional characteristic spectrum by the characteristic spectrum of input layer and a convolution kernel
Convolution obtains.The present embodiment selects the convolution kernel of 12 1*6.After convolution, preceding 6 neurons of input layer correspond to convolutional layer
The first dimensional feature map on first neuron;And so on.The nerve of same position on each dimensional feature map of convolutional layer
Member is all connected with 6 neurons of the correspondence on input layer characteristic spectrum, and connection type is all the same;
Sub-sampling layer carries out dimensionality reduction operation by every one-dimensional characteristic map to convolutional layer, by the every one-dimensional characteristic figure of convolutional layer
Then neuron average packet in spectrum chooses neuron of the maximum neuron as sub-sampling layer in every group;Sub-sampling layer
The dimension of characteristic spectrum is identical as the dimension of convolutional layer characteristic spectrum, and the number of sub-sampling layer epineural member is reduced on convolutional layer
1/3 to the 2/3 of neuron number.
All neurons of two neuron whole connexon sample levels of output layer, and export the practical knot of voice to be measured
Fruit.
The method that VAD detection reduces VOIP communication resource expense is carried out using convolutional neural networks system, as shown in Fig. 2,
Specific steps are as follows:
Step 1: training convolutional neural networks system, obtains and saves weight matrix w and weight matrix m;
As shown in Figure 3, the specific steps are as follows:
Step 101 chooses speech samples and is respectively labeled as speech frame and non-speech frame to each frame of speech samples.
Part of speech of the speech samples in Clemson university CUAVE database and noise speech library NOIZEUS;
To speech samples, sequentially hand labeled, every 20ms divide a frame into, and speech frame is labeled as [1,0], non-speech frame label
For [0,1], speech frame is adjudicated respectively as convolutional neural networks system output layer and the ideal of non-speech frame exports.
Step 102, for each frame voice after label, sequentially extract speech characteristic parameter MFCC and its a scale respectively
Divide characteristic parameter;
MFCC (Mel-scale Frequency Cepstral Coefficients, mel cepstrum coefficients) extracts feature
A frame is divided by 20ms when parameter.
The calculation formula of speech characteristic parameter MFCC are as follows:
Wherein, M (n) is the n-th order parameter of speech characteristic parameter MFCC, mkFor the input of discrete cosine transform, k=
1,....,N;N is the number of Mel filter, preferred value 24;L is speech characteristic parameter MFCC coefficient maximum order, preferred value
It is 13.
Note first-order difference MFCC parameter is Δ MFCC, its calculation formula is:
In formula, D (t) is the t rank parameter of first-order difference MFCC parameter;L' is the maximum order of Δ MFCC coefficient.
M (t+n) is the t+n rank parameter of speech characteristic parameter MFCC;M (t-n) is the t-n of speech characteristic parameter MFCC
Rank parameter;N=1 ..., R, R indicate the time difference of single order cepstrum coefficient, can be with value 1 or 2.Preferred value is 2.
The speech characteristic parameter MFCC of each frame and its first-order difference characteristic parameter are sequentially formed one-dimensional spy by step 103
Levy map;
The process for forming one-dimensional characteristic map by MFCC and Δ MFCC is as follows:
V=[MFCC | Δ MFCC]=[M (1), M (2) ... M (n) ..., M (L), D (1), D (2) ... D (t) ..., D
(L')]
In formula, the one-dimensional characteristic map that V is indicated, M (n) is the n-th order parameter of speech characteristic parameter MFCC, and D (t) is
The t rank parameter of Δ MFCC parameter.
Step 104 brings the one-dimensional characteristic map of step 103 into convolutional neural networks system, obtains court verdict;
As shown in figure 4, specific steps are as follows:
Step 1041, the input that the one-dimensional characteristic map of each frame of sample voice is input to convolutional neural networks system
Layer;
All speech characteristic parameter MFCC and its first-order difference characteristic parameter of sample voice constitute the feature of a 1*26
Spectrogram;Using one-dimensional vector form as the input of convolutional neural networks system;
Step 1042 obtains convolutional layer by the characteristic spectrum and convolution nuclear convolution of input layer per one-dimensional characteristic spectrum.
Convolution kernel size is set according to demand;The present embodiment selects the convolution kernel of 12 1*6.
After convolution nuclear convolution, convolutional layer is reduced to the 80% of input layer number per one-dimensional neuron number
Left and right.Input layer is connected the weight matrix w of neuron with convolutional layer jth dimensional feature mapjIt indicates;All weight matrix wj
Constitute matrix w;
Convolutional layer is per one-dimensional characteristic spectrum QjCalculating it is as follows:
Qj=σ (O*wj), j=1 ..., J
Wherein, QjFor the jth dimensional feature map of convolutional layer, j=1 ..., J, J are convolutional layer characteristic spectrum dimension, this reality
Applying example is 12;O is the characteristic spectrum of input layer, and * represents convolution operation, and σ (x) is activation primitive, selects sigmoid function, fixed
Justice are as follows:
12 dimensional feature maps of convolutional layer are obtained by the convolution nuclear convolution of the characteristic spectrum of input layer and 12 1*6;Cause
It is identical with the neuron connection type on input layer characteristic spectrum for the neuron on each characteristic spectrum of convolutional layer, so convolutional layer
What the neuron on every one-dimensional characteristic map extracted is corresponding local feature in input layer characteristic spectrum.
Step 1043, sub-sampling layer are to every one-dimensional characteristic map Q of convolutional layerjCarry out dimensionality reduction operation;
By the neuron average packet on the every one-dimensional characteristic map of convolutional layer, the maximum neuron then chosen in every group is made
For the neuron of sub-sampling layer, calculation formula is as follows:
Wherein, pj,mRepresent sub-sampling layer jth dimensional feature map PjM-th of neuron, qj,mRepresent convolutional layer jth Wei Te
Levy map QjM-th of neuron, s is the mobile size of window;G is the window for the dimensionality reduction operation that convolutional layer is carried out to sub-sampling layer
The number of every group of neuron in size, that is, average packet;The value of the mobile size s of window is identical as the value of window G.This implementation
Example selects G=s=3.
Sub-sampling layer obtains 12*7 neuron after dimensionality reduction, and the neuron number of every one-dimensional characteristic map is on sub-sampling layer
1/3 to the 2/3 of every one-dimensional characteristic map neuron number, reduces the data volume of calculating on convolutional layer.
Each neuron of step 1044, output layer is all connect with all neurons of sub-sampling layer;
Using full connection, calculation formula between the output of sub-sampling layer and the input of output layer are as follows:
In formula, y is output vector,Indicate the characteristic spectrum that each dimensional feature map of sub-sampling layer is transformed into a n dimensional vector n;PjSub-sampling layer jth dimensional feature map is represented, J is the dimension of sub-sampling layer characteristic spectrum;M is
The characteristic spectrum of one n dimensional vector n is connected the weight matrix of neuron with output layer, and σ (x) is activation primitive, selects sigmoid letter
Number, is defined as:
Step 1045, after connecting entirely, output layer output is expressed as [x, y], and records result.
Step 105 is directed to each frame voice, the ideal for sequentially marking the practical court verdict of step 104 and step 101
Output compares, and calculates decision error item;
For n-th frame speech frame, the reality output court verdict of convolutional neural networks system is y (n), and ideal output is denoted asThe calculation formula of decision error item d (n) is as follows:
D (n)=e (n) { y (n) [1-y (n)] }
In formulaN is the frame number of sample voice.
Step 106, for the decision error item of each frame voice, convolutional neural networks system reversely passes the error term
It broadcasts, updates weight matrix w and weight matrix m, complete a training process.
As shown in Figure 5, the specific steps are as follows:
Step 1061, for the n-th frame of input voice, error term d (n) reverse transfer obtains sub-sampling to sub-sampling layer
The output error item sd (n) of layer:
Sd (n)=mT(n)·d(n)
mT(n) transposed matrix for being m (n), m (n) are that the sub-sampling layer characteristic spectrum of n-th frame speech frame is connected with output layer
The weight matrix of neuron;
Output error item sd (n) is split into J dimension matrix by step 1062
Sd (n) is one-dimensional matrix, since sub-sampling layer is J dimension, needs to be splitted into J dimension matrix.
Assuming that sd (n)=[sd1,sd2,...,sdM], sd (n) fractionation obtains J dimension matrixSplit process is as follows:
I is the number of the every one-dimensional characteristic map epineural member of sub-sampling layer in formula, and M is of all neurons of sub-sampling layer
Number, M=J × I.sI,JMatrix is tieed up for JElement.
Step 1063, J tie up matrixIt is transferred to convolutional layer and carries out increasing dimension, obtain the error term c (n) of convolutional layer;
Process is as follows:
G is the window size for the dimensionality reduction operation that convolutional layer is carried out to sub-sampling layer in formula.
Step 1064, the voice n-th frame for input calculate separately the error term and weight matrix m of weight matrix w (n)
(n) error term.
Input layer is calculated to be connected with convolutional layer the error term of weight matrix w (n):
Δ w (n)=o " (n) * c (n)
O in formula " (n) is the input feature vector map O rotation 180 degree for inputting voice n-th frame and corresponding to input layer.
Sub-sampling layer is calculated to be connected with output layer the error term of weight matrix m (n):
In formulaFor the one-dimensional vector characteristic map that the voice n-th frame sub-sampling layer output of input is constituted, d (n) is real
Border output and ideal output error vector.
Step 1065, the corresponding weight matrix w (n) of speech frame for updating n-th frame input respectively are with weight matrix m's (n)
Value obtains the corresponding weight matrix w (n+1) of speech frame and the weight matrix m (n+1) of the (n+1)th frame input:
W (n+1)=w (n)-α Δ w (n)
M (n+1)=m (n)-α Δ m (n)
W (n) and m (n) respectively indicate the corresponding input layer of the speech frame of n-th frame input and convolutional layer and sub-sampling layer and
The weight matrix of interlayer is exported, α is learning rate, preferably α=1.
Step 107 continually enters sample voice, repeats step 101-106, until the output of convolutional neural networks system misses
Poor δ is less than preset value;
S is the voice totalframes of training input every time in formula, and y is the vector that the court verdict of S frame voice is constituted,For S
The vector that the ideal decision result of frame voice is constituted.Preset value is 5%.
Step 108, when output error δ be less than preset value when, record weight matrix w and weight matrix m simultaneously be stored in file.
Step 2: reading weight matrix w and weight matrix m from file initializes convolutional neural networks system.
Step 3: voice to be measured is inputted VAD system;
It will carry out the voice to be measured input VAD system of VAD judgement.
Step 4: sequentially extracting speech characteristic parameter MFCC and its first-order difference characteristic parameter to speech frame to be measured;
Step 5: sequentially bringing the characteristic parameter composition one-dimensional characteristic map of each frame of voice to be measured into convolutional Neural net
Network system.
The process for forming one-dimensional characteristic map by MFCC and Δ MFCC is as follows:
V=[MFCC | Δ MFCC]=[M (1), M (2) ... M (n) ..., M (L), D (1), D (2) ... D (t) ..., D
(L')]
Step 6: convolutional neural networks system will sequentially export the result [x, y] of each frame of voice to be measured, VAD system pair
As a result [x, y] makes decisions, and records court verdict;
Voice divides a frame by 20ms, for a frame voice data, the output result of convolutional neural networks system output layer
For [x, y], when VAD system makes decisions, if x > y, illustrate that sample voice is speech frame, VAD system output 1, is otherwise non-language
Sound frame, VAD system export court verdict 0.
For example voice to be measured passes through the result of output layer are as follows: [0.98,0.02], 0.98 when VAD system makes decisions >
0.02, illustrate VAD system court verdict: the voice to be measured is speech frame, and VAD system exports court verdict 1.
The present invention realizes a kind of VAD detection technique, using speech characteristic parameter as the input of convolutional neural networks, such as
MFCC, MFCC and its first-order difference parameter, MFCC and single order and second differnce parameter;Convolutional neural networks are used to sentence as two dimension
Certainly device.
Voice resource expense can be reduced in the case where guaranteeing communication quality by the technology in VOIP.Phonetic feature
Extracting part could alternatively be LPCC, LPCC and its first-order difference parameter, LPCC and single order and second differnce parameter and MFCC
With the combination of LPCC.The number of characteristic parameter can be different with the reference value given.The convolutional layer and sub-sampling of convolutional neural networks
Layer neuron number can be different, and convolution kernel size can be different, and convolutional layer and sub-sampling layer can be multilayer.
Claims (5)
1. a kind of system for reducing VOIP communication resource expense, which is characterized in that the system is by input layer, convolutional layer, son
The convolutional neural networks system of sample level and output layer composition;Each layer is made of characteristic spectrum, and every one-dimensional characteristic map contains
There is neuron;The VOIP refers to Voice over IP;
What input layer inputted is the first-order difference characteristic parameter of the speech characteristic parameter MFCC and MFCC of each frame of voice to be measured, institute
It states parameter and constitutes one-dimensional characteristic map;The MFCC refers to mel cepstrum coefficients;
Convolutional layer is obtained per one-dimensional characteristic spectrum by the characteristic spectrum of input layer and a convolution nuclear convolution, and every dimensional feature
The neuron of same position is all connected with corresponding neuron in input layer in map;
Sub-sampling layer carries out dimensionality reduction operation to every one-dimensional characteristic map of convolutional layer, by the mind on the every one-dimensional characteristic map of convolutional layer
Through first average packet, neuron of the maximum neuron as sub-sampling layer in every group, sub-sampling layer epineural member are then chosen
Number be reduced to 1/3 to the 2/3 of convolutional layer epineural member number;
Output layer is one-dimensional characteristic map, all nerves of the whole connexon sample levels of each neuron on characteristic spectrum
Member, and export the actual result of voice to be measured;
The working principle of the system is as follows:
Training convolutional neural networks system, obtains and saves input layer and be connected with convolutional layer the weight matrix w and son of neuron
Sample level is connected the weight matrix m of neuron with output layer;Weight matrix w and weight matrix m is read, convolutional Neural net is initialized
Network system;Voice to be measured is inputted into VAD system;Sequentially to each frame of voice to be measured extract speech characteristic parameter MFCC and its
First-order difference characteristic parameter;Sequentially the characteristic parameter MFCC of each frame of voice to be measured and its first-order difference characteristic parameter are formed
One-dimensional characteristic map brings convolutional neural networks system into;Convolutional neural networks system will sequentially export the knot of each frame of voice to be measured
Fruit, VAD system make decisions result, and record court verdict;The output result of output layer is [x, y], and VAD system is sentenced
Certainly, if x is greater than y, illustrating to be speech frame for sample voice, otherwise it is non-speech frame that it is 1 that VAD system, which exports court verdict,
It is 0 that VAD system, which exports court verdict,.
2. the side that application reduces the reduction VOIP communication resource expense of the system of VOIP communication resource expense as described in claim 1
Method, which is characterized in that specific steps are as follows:
Step 1: training convolutional neural networks system, obtains and save input layer and be connected with convolutional layer the weight matrix w of neuron
And sub-sampling layer is connected the weight matrix m of neuron with output layer;
Step 2: reading weight matrix w and weight matrix m, convolutional neural networks system is initialized;
Step 3: voice to be measured is inputted VAD system;The VAD system refers to that voice activation detects;
Step 4: sequentially extracting speech characteristic parameter MFCC and its first-order difference characteristic parameter to each frame of voice to be measured;
Step 5: the characteristic parameter MFCC of each frame of voice to be measured and its first-order difference characteristic parameter are sequentially formed one-dimensional spy
Sign map brings convolutional neural networks system into;
Step 6: convolutional neural networks system by sequentially export each frame of voice to be measured as a result, VAD system sentences result
Certainly, and court verdict is recorded;
The output result of output layer is [x, y], and VAD system makes decisions, if x is greater than y, illustrates for sample voice to be voice
Frame, otherwise it is non-speech frame that it is 1 that VAD system, which exports court verdict, and it is 0 that VAD system, which exports court verdict,.
3. reducing the method for VOIP communication resource expense as claimed in claim 2, which is characterized in that the step one is specific
Are as follows: specific step is as follows:
Step 101 chooses speech samples and is respectively labeled as speech frame and non-speech frame to each frame of speech samples;
Speech frame is labeled as [1,0], and non-speech frame is labeled as [0,1], adjudicates respectively as convolutional neural networks system output layer
The ideal output of speech frame and non-speech frame;
Step 102, for each frame voice after label, sequentially extract speech characteristic parameter MFCC respectively and its first-order difference be special
Levy parameter;
The calculation formula of speech characteristic parameter MFCC n-th order parameter M (n) are as follows:
mkFor the input of discrete cosine transform, k=1 ..., N;N is the number of Mel filter;L is speech characteristic parameter
The maximum order of MFCC coefficient;
The calculation formula of the t rank parameter D (t) of first-order difference MFCC parameter, Δ MFCC are as follows:
L' is the maximum order of Δ MFCC coefficient;M (t+n) is the t+n rank parameter of speech characteristic parameter MFCC;M (t-n) is language
The t-n rank parameter of sound characteristic parameter MFCC;N=1 ... the time difference of, R, R expression single order cepstrum coefficient;
The speech characteristic parameter MFCC of each frame and its first-order difference characteristic parameter are sequentially formed one-dimensional characteristic figure by step 103
Spectrum;
The calculation formula of one-dimensional characteristic map V is as follows:
V=[MFCC | Δ MFCC]=[M (1), M (2) ... M (n) ..., M (L), D (1), D (2) ... D (t) ..., D (L')]
Step 104 brings the one-dimensional characteristic map of step 103 into convolutional neural networks system, obtains court verdict;
Step 105 is directed to each frame voice, the ideal output for sequentially marking the practical court verdict of step 104 and step 101
It compares, calculates decision error item;
For the n-th frame of input voice, the reality output result of convolutional neural networks system is y (n), and ideal output is denoted as
The calculation formula of decision error item d (n) is as follows:
D (n)=e (n) { y (n) [1-y (n)] }
In formulaN is the frame number of sample voice;
Step 106, for the decision error item of each frame voice, convolutional neural networks system is by the error term backpropagation, more
New weight matrix w and weight matrix m, completes a training process;
Step 107 continually enters sample voice, repeats step 101-106, until the output error δ of convolutional neural networks system
Less than preset value;
S is the voice totalframes of training input every time in formula, and y is S frame voice after convolutional neural networks system, reality output
As a result the vector constituted,The vector constituted for the desired result of the S frame voice of input;
Step 108, when output error δ be less than preset value when, record weight matrix w and weight matrix m simultaneously be stored in file.
4. reducing the method for VOIP communication resource expense as claimed in claim 3, which is characterized in that step 104 tool
Body are as follows:
Step 1041, the input layer that the one-dimensional characteristic map of each frame of sample voice is input to convolutional neural networks system;
Step 1042 obtains convolutional layer by the one-dimensional characteristic map and convolution nuclear convolution of input layer per one-dimensional characteristic spectrum;
Convolutional layer is per one-dimensional characteristic spectrum QjCalculating it is as follows:
Qj=σ (O*wj), j=1 ..., J
Wherein, QjFor the jth dimensional feature map of convolutional layer, j=1 ..., J, J are that the dimension of convolutional layer characteristic spectrum and son are adopted
The dimension of sample layer characteristic spectrum;O is the one-dimensional characteristic map of input layer, and * represents convolution operation, wjIndicate input layer and convolutional layer
The weight matrix of the connected neuron of jth dimensional feature map, all weight matrix wjConstitute matrix w;
σ (x) is defined as:
After convolution, what the neuron on the every one-dimensional characteristic map of convolutional layer extracted is corresponding in input layer characteristic spectrum
Local feature;
Step 1043, sub-sampling layer are to every one-dimensional characteristic map Q of convolutional layerjCarry out dimensionality reduction operation;
By the neuron average packet on the every one-dimensional characteristic map of convolutional layer, the maximum neuron in every group is then chosen as son
The neuron of sample level, calculation formula are as follows:
Wherein, pj,mRepresent sub-sampling layer jth dimensional feature map PjM-th of neuron, qj,mRepresent convolutional layer jth dimensional feature figure
Compose QjM-th of neuron, s is the mobile size of window;G is that the window that convolutional layer is operated to the dimensionality reduction that sub-sampling layer carries out is big
Small, the value of the mobile size s of window is identical as the value of window G;
Each neuron of step 1044, output layer is all connect with all neurons of sub-sampling layer;
The calculation formula of output vector y are as follows:
In formula,Indicate the characteristic spectrum that each dimensional feature map of sub-sampling layer is transformed into a n dimensional vector n,PjRepresent sub-sampling layer jth dimensional feature map;M is the characteristic spectrum and output layer of a n dimensional vector n
The weight matrix of connected neuron;Export a n dimensional vector n y is indicated with [x, y];
Step 1045, after connecting entirely, output layer output result [x, y] simultaneously record result.
5. reducing the method for VOIP communication resource expense as claimed in claim 3, which is characterized in that step 106 tool
Body are as follows:
Step 1061, for the n-th frame of input voice, error term d (n) reverse transfer to sub-sampling layer obtains sub-sampling layer
Output error item sd (n):
Sd (n)=mT(n)·d(n)
mT(n) transposed matrix for being m (n), when m (n) is that input voice n-th frame is input to convolutional neural networks system, sub-sampling layer
Output characteristic spectrum is connected the weight matrix of neuron with output layer;
Output error item sd (n) is split into J dimension matrix by step 1062
Sd (n)=[sd1,sd2,…,sdM] split obtain J dimension matrixProcess is as follows:
I is the number of the every one-dimensional characteristic map epineural member of sub-sampling layer in formula, and M is the number of all neurons of sub-sampling layer, M
=J × I;
Step 1063, J tie up matrixIt is transferred to convolutional layer and carries out increasing dimension, obtain the error term c (n) of convolutional layer;
G is the window size for the dimensionality reduction operation that convolutional layer is carried out to sub-sampling layer in formula;
Step 1064, the voice n-th frame for input calculate separately the error term and weight matrix m (n) of weight matrix w (n)
Error term;
Input layer is calculated to be connected with convolutional layer the error term of weight matrix w (n):
Δ w (n)=o " (n) * c (n)
O in formula " (n) is the input feature vector map O rotation 180 degree for inputting voice n-th frame and corresponding to input layer;
Sub-sampling layer is calculated to be connected with output layer the error term of weight matrix m (n):
In formulaFor the one-dimensional vector characteristic map that the voice n-th frame sub-sampling layer output of input is constituted, d (n) is practical defeated
Out with ideal output error vector;
Step 1065, the value for updating speech frame corresponding weight matrix w (n) and weight matrix m (n) that n-th frame inputs respectively, obtain
The corresponding weight matrix w (n+1) of speech frame inputted to the (n+1)th frame and weight matrix m (n+1):
W (n+1)=w (n)-α Δ w (n)
M (n+1)=m (n)-α Δ m (n)
W (n) input layer corresponding with the speech frame that m (n) respectively indicates n-th frame input and convolutional layer and sub-sampling layer and output
The weight matrix of interlayer, α are learning rate.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510638936.1A CN105321525B (en) | 2015-09-30 | 2015-09-30 | A kind of system and method reducing VOIP communication resource expense |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510638936.1A CN105321525B (en) | 2015-09-30 | 2015-09-30 | A kind of system and method reducing VOIP communication resource expense |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105321525A CN105321525A (en) | 2016-02-10 |
CN105321525B true CN105321525B (en) | 2019-02-22 |
Family
ID=55248773
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510638936.1A Active CN105321525B (en) | 2015-09-30 | 2015-09-30 | A kind of system and method reducing VOIP communication resource expense |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105321525B (en) |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11080591B2 (en) | 2016-09-06 | 2021-08-03 | Deepmind Technologies Limited | Processing sequences using convolutional neural networks |
JP6577159B1 (en) * | 2016-09-06 | 2019-09-18 | ディープマインド テクノロジーズ リミテッド | Generating audio using neural networks |
EP3767547A1 (en) | 2016-09-06 | 2021-01-20 | Deepmind Technologies Limited | Processing sequences using convolutional neural networks |
CN110023963B (en) | 2016-10-26 | 2023-05-30 | 渊慧科技有限公司 | Processing text sequences using neural networks |
CN106683680B (en) * | 2017-03-10 | 2022-03-25 | 百度在线网络技术(北京)有限公司 | Speaker recognition method and device, computer equipment and computer readable medium |
CN106790300A (en) * | 2017-03-21 | 2017-05-31 | 青岛海信宽带多媒体技术有限公司 | A kind of method and apparatus conversed |
CN108962231B (en) * | 2018-07-04 | 2021-05-28 | 武汉斗鱼网络科技有限公司 | Voice classification method, device, server and storage medium |
CN108847251B (en) * | 2018-07-04 | 2022-12-02 | 武汉斗鱼网络科技有限公司 | Voice duplicate removal method, device, server and storage medium |
CN110017991B (en) * | 2019-05-13 | 2020-03-31 | 山东大学 | Rolling bearing fault classification method and system based on spectral kurtosis and neural network |
CN110223713A (en) * | 2019-06-11 | 2019-09-10 | 苏州思必驰信息科技有限公司 | Sound event detection model training method and sound event detection method |
CN110706694B (en) * | 2019-09-26 | 2022-04-08 | 成都数之联科技股份有限公司 | Voice endpoint detection method and system based on deep learning |
CN111210815B (en) * | 2019-11-28 | 2023-01-06 | 赵铭 | Deep neural network construction method for voice command word recognition, and recognition method and device |
CN111179975B (en) * | 2020-04-14 | 2020-08-04 | 深圳壹账通智能科技有限公司 | Voice endpoint detection method for emotion recognition, electronic device and storage medium |
CN114420111B (en) * | 2022-03-31 | 2022-06-17 | 成都启英泰伦科技有限公司 | One-dimensional hypothesis-based speech vector distance calculation method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101278337A (en) * | 2005-07-22 | 2008-10-01 | 索福特迈克斯有限公司 | Robust separation of speech signals in a noisy environment |
US7660713B2 (en) * | 2003-10-23 | 2010-02-09 | Microsoft Corporation | Systems and methods that detect a desired signal via a linear discriminative classifier that utilizes an estimated posterior signal-to-noise ratio (SNR) |
CN104021373A (en) * | 2014-05-27 | 2014-09-03 | 江苏大学 | Semi-supervised speech feature variable factor decomposition method |
CN104157290A (en) * | 2014-08-19 | 2014-11-19 | 大连理工大学 | Speaker recognition method based on depth learning |
-
2015
- 2015-09-30 CN CN201510638936.1A patent/CN105321525B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7660713B2 (en) * | 2003-10-23 | 2010-02-09 | Microsoft Corporation | Systems and methods that detect a desired signal via a linear discriminative classifier that utilizes an estimated posterior signal-to-noise ratio (SNR) |
CN101278337A (en) * | 2005-07-22 | 2008-10-01 | 索福特迈克斯有限公司 | Robust separation of speech signals in a noisy environment |
CN104021373A (en) * | 2014-05-27 | 2014-09-03 | 江苏大学 | Semi-supervised speech feature variable factor decomposition method |
CN104157290A (en) * | 2014-08-19 | 2014-11-19 | 大连理工大学 | Speaker recognition method based on depth learning |
Non-Patent Citations (3)
Title |
---|
"A fast neural net training algorithm and its application to voiced-unvoiced-silence classification of speech";Thea Ghiselli-Crippa,Amro EI-Jaroudi;《Proc.Int.Conf.ASSP,1991》;19911231;第1卷;全文 * |
"Denoising Deep Neural Networks Based Voice Activity Detection";Xiao-Lei Zhang,Ji Wu;《IEEE International Conference on Acoustics》;20130304;第32卷(第3期);全文 * |
"基于SDCNN的人脸识别研究";王璟尧;《中国优秀硕士学位论文全文数据库》;20150515(第05期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN105321525A (en) | 2016-02-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105321525B (en) | A kind of system and method reducing VOIP communication resource expense | |
CN110992987B (en) | Parallel feature extraction system and method for general specific voice in voice signal | |
CN1121681C (en) | Speech processing | |
CN110379412A (en) | Method, apparatus, electronic equipment and the computer readable storage medium of speech processes | |
CN110739003B (en) | Voice enhancement method based on multi-head self-attention mechanism | |
US5812973A (en) | Method and system for recognizing a boundary between contiguous sounds for use with a speech recognition system | |
CN103065629A (en) | Speech recognition system of humanoid robot | |
CN108922513A (en) | Speech differentiation method, apparatus, computer equipment and storage medium | |
CN110415701A (en) | The recognition methods of lip reading and its device | |
AU684214B2 (en) | System for recognizing spoken sounds from continuous speech and method of using same | |
CN108597505A (en) | Audio recognition method, device and terminal device | |
CN113488058A (en) | Voiceprint recognition method based on short voice | |
CN114338623B (en) | Audio processing method, device, equipment and medium | |
CN113539232B (en) | Voice synthesis method based on lesson-admiring voice data set | |
CN112466297B (en) | Speech recognition method based on time domain convolution coding and decoding network | |
Shi et al. | End-to-End Monaural Speech Separation with Multi-Scale Dynamic Weighted Gated Dilated Convolutional Pyramid Network. | |
CN109036470A (en) | Speech differentiation method, apparatus, computer equipment and storage medium | |
CN113763965A (en) | Speaker identification method with multiple attention characteristics fused | |
CN113450761A (en) | Parallel speech synthesis method and device based on variational self-encoder | |
KR100779242B1 (en) | Speaker recognition methods of a speech recognition and speaker recognition integrated system | |
Yang et al. | RS-CAE-based AR-Wiener filtering and harmonic recovery for speech enhancement | |
Park et al. | The Second DIHARD Challenge: System Description for USC-SAIL Team. | |
Gaafar et al. | An improved method for speech/speaker recognition | |
CN110136741A (en) | A kind of single-channel voice Enhancement Method based on multiple dimensioned context | |
CN113516987B (en) | Speaker recognition method, speaker recognition device, storage medium and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |