CN105321525A - System and method for reducing VOIP (voice over internet protocol) communication resource overhead - Google Patents

System and method for reducing VOIP (voice over internet protocol) communication resource overhead Download PDF

Info

Publication number
CN105321525A
CN105321525A CN201510638936.1A CN201510638936A CN105321525A CN 105321525 A CN105321525 A CN 105321525A CN 201510638936 A CN201510638936 A CN 201510638936A CN 105321525 A CN105321525 A CN 105321525A
Authority
CN
China
Prior art keywords
layer
frame
voice
input
illustrative plates
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510638936.1A
Other languages
Chinese (zh)
Other versions
CN105321525B (en
Inventor
王珂
孙瑞浩
张鹤立
李曦
纪红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201510638936.1A priority Critical patent/CN105321525B/en
Publication of CN105321525A publication Critical patent/CN105321525A/en
Application granted granted Critical
Publication of CN105321525B publication Critical patent/CN105321525B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise

Abstract

The invention discloses a system for reducing VOIP (voice over internet protocol) communication resource overhead, comprising an input layer, a convolution layer, a sampling sub-layer and an output layer, each layer being composed of a characteristic spectrum, each characteristic spectrum containing nerve cells; a method of using the system to reduce VOIP communication resource overhead includes specifically: 1, training a convolutional neural network; 2, initializing the convolutional neural network; 3, inputting voice to be measured into a VAD (voice activity detection) system; 4, extracting voice characteristic parameter MFCC and its first-order differential characteristic parameter from each frame in order; 5, composing the parameters of each frame into a one-dimensional characteristic map taken into the convolutional neural network system; 6, the convolutional neural network system outputting in order a result [x, y] of each frame of the voice to be detected, and the VAD system making judgment and recording the results. The system and method have the advantages that the convolutional neural network system is used in the VAD system for detecting, the misjudgment rate of the VAD system is reduced, calculation time and bandwidth are saved, and VOIP voice resource overhead can be reduced at the premise of ensuring communication quality.

Description

A kind of system and method reducing VOIP communication resource expense
Technical field
The invention belongs to communication computer field, relate to a kind of system and method reducing VOIP communication resource expense.
Background technology
VoIP (VoiceoverInternetProtocol), has another name called VoiceoverIP, Voice over IP; Specifically by simulating signal (Voice) digitizing of voice, be converted into IP packet, in IP network (IPNetwork), do real-time transmission with the form of data packet (DataPacket).
Along with the fast development of IP network technology and constantly perfect, IPization network progressively substitutes traditional exchange of speech network.Utilize IP network transmitting audio data, under the changeless condition of the network bandwidth, improve the quality that channel utilization can improve voice to a great extent.
In prior art, improve channel utilization and usually adopt voice activation to detect (VoiceActivityDetection, VAD) technology; VAD technology, according to the speech of the daily talk of people and characteristic of mourning in silence, is suppressed when quiet (Silence) being detected, makes it not take or few busy channel bandwidth, is just carried out compressed encoding and transmission when the activation speech of burst being detected.In fact, people are when making a phone call, only there is the time of 27%-40% in speech, use VAD technology that the dumb time is removed, only retain the part containing information, the effective rate of utilization of channel width can be made to improve more than 50%, so VAD technology becomes in voice traffic transmission the powerful tool reducing bandwidth, improve channel utilization.
In IP phone application, VAD technology is adopted to transmitting audio data, during audio session, remove non-speech segments, avoid the coding to quiet data bag and transmission, save computing time and bandwidth.
The method that current VAD technology adopts is: first receive current audio frame to be detected, then from audio frame, time domain parameter and frequency domain parameter is obtained, the first distance when then obtaining long in historical background noise frame of time domain parameter and time domain parameter between sliding average, and second distance when obtaining long in historical background noise frame of frequency domain parameter and frequency domain parameter between sliding average." historical background noise frame " refer to present frame before background noise frames, the multiple background noise frames of the continuous print before such as present frame; If present frame is initial first frame, then can using the frame that presets as historical background noise frame, maybe using this first frame as historical background noise frame.Finally, be that prospect speech frame is still for background noise frames according to the first Distance geometry second distance and the current audio frame to be detected of judgement polynomial expression group judgement based on the first Distance geometry second distance, at least one coefficient in judgement polynomial expression group is variable, and this variable is determined according to voice activation testing mode and/or input signal feature.
Above-mentioned technology mainly by calculate voice signal the numerical value such as Pitch Information, energy information, tone information and by it compared with the threshold value of standard, if numerical value is greater than threshold value be judged to voice, be less than threshold value and be then judged to noise, the advantage of this kind of algorithm is fairly simple, but be only applicable to the comparatively simple noise source of process, for the noise source of complexity, first above-mentioned voice messaging ratio is easier to mix with noise, and its subthreshold is also difficult to the change conformed.
Summary of the invention
The present invention is directed to existing VAD detection technique, not high to the adaptability of environmental change, and under complex environment, adjudicate the low shortcoming of accuracy rate, propose a kind of system and method reducing VOIP communication resource expense.
Reduce a system for VOIP communication resource expense, have another name called convolutional neural networks system, specifically comprise 4 layers, be followed successively by input layer, convolutional layer, sub sampling layer and output layer, every one deck is formed by characteristic spectrum, and every one-dimensional characteristic collection of illustrative plates contains neuron.
Input layer input be speech characteristic parameter MFCC and the first order difference characteristic parameter thereof of each frame of voice to be measured, described parameter forms one-dimensional characteristic collection of illustrative plates;
The characteristic spectrum of convolutional layer has multidimensional, the characteristic spectrum of every one dimension obtains by the characteristic spectrum of input layer and a convolution kernel convolution, according to the size of convolution kernel, 1 neuron of multiple neuron Map Volume laminations of input layer, and in every dimensional feature collection of illustrative plates, the neuron of same position all connects neuron corresponding in input layer.
Sub sampling layer, by carrying out dimensionality reduction operation to every one-dimensional characteristic collection of illustrative plates of convolutional layer, by the neuron average packet on every for convolutional layer one-dimensional characteristic collection of illustrative plates, then chooses the neuron of the maximum neuron in every group as sub sampling layer; The dimension of sub sampling layer characteristic spectrum is identical with the dimension of convolutional layer characteristic spectrum, and the number of sub sampling layer epineural unit is reduced to 1/3 to 2/3 of convolutional layer epineural unit number.
Output layer is one-dimensional characteristic collection of illustrative plates, all neurons of all whole connexon sample level of each neuron on characteristic spectrum, and exports the actual result of voice to be measured.
Reduce a method for VOIP communication resource expense, concrete steps are:
Step one, training convolutional neural networks system, obtain and preserve weight matrix w and weight matrix m;
Concrete steps are as follows:
Step 101, choose speech samples and speech frame and non-speech frame are labeled as respectively to each frame of speech samples.
To speech samples hand labeled according to the order of sequence, speech frame is labeled as [1,0], and non-speech frame is labeled as [0,1], and the ideal respectively as convolutional neural networks system output layer judgement speech frame and non-speech frame exports.
Step 102, for each the frame voice after mark, respectively extract speech characteristic parameter MFCC and first order difference characteristic parameter thereof according to the order of sequence;
MFCC (Mel-scaleFrequencyCepstralCoefficients, mel cepstrum coefficients), divides a frame into by 20ms when extracting characteristic parameter.
The computing formula of speech characteristic parameter MFCC is:
M ( n ) = 2 N Σ k = 1 N m k c o s ( n π N ( k - 0.5 ) ) , n = 1 , 2 , 3 , ... , L
Wherein, the n-th rank parameter that M (n) is speech characteristic parameter MFCC, m kfor the input of discrete cosine transform, k=1 ...., N; N is the number of Mel wave filter; L is the maximum order of speech characteristic parameter MFCC coefficient.
Note first order difference MFCC parameter is Δ MFCC, and its computing formula is:
D ( t ) = Σ n = 1 R n ( M ( t + n ) - M ( t - n ) ) 2 Σ n = 1 R n 2 , t = 1 , 2 , 3 , ... , L ′
In formula, the t rank parameter that D (t) is first order difference MFCC parameter; L' is the maximum order of Δ MFCC coefficient.
The t+n rank parameter that M (t+n) is speech characteristic parameter MFCC; The t-n rank parameter that M (t-n) is speech characteristic parameter MFCC; N=1 ...., R, R represent the mistiming of single order cepstrum coefficient.
Step 103, according to the order of sequence by the speech characteristic parameter MFCC of each frame and first order difference characteristic parameter composition one-dimensional characteristic collection of illustrative plates thereof;
The computing formula of one-dimensional characteristic collection of illustrative plates V is as follows:
V=[MFCC|ΔMFCC]=[M(1),M(2),...M(n),...,M(L),D(1),D(2),...D(t),...,D(L')]
Step 104, bring the one-dimensional characteristic collection of illustrative plates of step 103 into convolutional neural networks system, obtain court verdict;
Concrete steps are:
Step 1041, the one-dimensional characteristic collection of illustrative plates of each frame of sample voice is input to the input layer of convolutional neural networks system;
Step 1042, obtained the characteristic spectrum of the every one dimension of convolutional layer by the one-dimensional characteristic collection of illustrative plates of input layer and convolution kernel convolution.
The characteristic spectrum Q of the every one dimension of convolutional layer jbe calculated as follows:
Q j=σ(O*w j),j=1,...,J
Wherein, Q jfor the jth dimensional feature collection of illustrative plates of convolutional layer, j=1 ..., J, J are the dimension of convolutional layer characteristic spectrum; O is the one-dimensional characteristic collection of illustrative plates of input layer, and * represents convolution operation, w jrepresent that input layer is connected with convolutional layer jth dimensional feature collection of illustrative plates neuronic weight matrix, all weight matrix w jform matrix w; σ (x) is activation function, selects sigmoid function, is defined as:
σ ( x ) = 1 1 + e - x
After convolution, what the neuron on convolutional layer every one-dimensional characteristic collection of illustrative plates extracted is local feature corresponding in input layer characteristic spectrum.
Step 1043, sub sampling layer are to every one-dimensional characteristic collection of illustrative plates Q of convolutional layer jcarry out dimensionality reduction operation;
By the neuron average packet on every for convolutional layer one-dimensional characteristic collection of illustrative plates, then choose the neuron of the maximum neuron in every group as sub sampling layer, computing formula is as follows:
p j , m = m a x n = 1 G q j , ( m - 1 ) × s + n
Wherein, p j,mrepresent sub sampling layer jth dimensional feature collection of illustrative plates P jm neuron, q j,mrepresent convolutional layer jth dimensional feature collection of illustrative plates Q jm neuron, s is that window moves size; G is the window size of the dimensionality reduction operation that convolutional layer carries out to sub sampling layer.The value that window moves size s is identical with the value of window G.
Each neuron of step 1044, output layer is all connected with all neurons of sub sampling layer;
Computing formula is:
y = σ ( P ^ · m )
In formula, y is output vector; represent the characteristic spectrum of each for sub sampling layer dimensional feature collection of illustrative plates generate one n dimensional vector n, p jrepresent sub sampling layer jth dimensional feature collection of illustrative plates, J is the dimension of sub sampling layer characteristic spectrum; M is that the characteristic spectrum of a n dimensional vector n is connected with output layer neuronic weight matrix, and σ (x) is activation function, selects sigmoid function, is defined as:
σ ( x ) = 1 1 + e - x
Step 1045, through entirely connect after, output layer exports and records result.
Step 105, for each frame voice, the ideal that actual court verdict and the step 101 of step 104 mark is exported contrast according to the order of sequence, calculate decision error item;
For the n-th frame of input voice, the actual Output rusults of convolutional neural networks system is y (n), and desirable output is designated as , the computing formula of decision error item d (n) is as follows:
d(n)=e(n)·{y(n)·[1-y(n)]}
In formula n is the frame number of sample voice.
Step 106, decision error item for each frame voice, convolutional neural networks system, by this error term backpropagation, upgrades weight matrix w and weight matrix m, completes a training process.
Step 1061, for input voice the n-th frame, error term d (n) reverse transfer, to sub sampling layer, obtains output error item sd (n) of sub sampling layer:
sd(n)=m T(n)·d(n)
M tn transposed matrix that () is m (n), m (n) for input voice n-th frame be input to convolutional neural networks system time, sub sampling layer output characteristic collection of illustrative plates is connected with output layer neuronic weight matrix;
Step 1062, output error item sd (n) split into J and tie up matrix
Sd (n)=[sd 1, sd 2..., sd m] split obtain J tie up matrix process is as follows:
In formula, I is the number of sub sampling layer every one-dimensional characteristic collection of illustrative plates epineural unit, and M is all neuronic number of sub sampling layer, M=J × I.
Step 1063, J tie up matrix be transferred to convolutional layer and carry out increasing dimension, obtain the error term c (n) of convolutional layer;
Process is as follows:
In formula, G is the window size of the dimensionality reduction operation that convolutional layer carries out to sub sampling layer.
Step 1064, voice n-th frame for input, calculate error term and the error term of weight matrix m (n) of weight matrix w (n) respectively.
Calculate input layer to be connected with convolutional layer the error term of weight matrix w (n):
Δw(n)=o″(n)*c(n)
" (n) is that the input feature vector collection of illustrative plates O of the corresponding input layer of input voice n-th frame revolves turnback to o in formula.
Calculate sub sampling layer to be connected with output layer the error term of weight matrix m (n):
Δ m ( n ) = d ( n ) · P ^ ( n )
In formula for the voice n-th frame sub sampling layer of input exports the one dimension vector characteristic collection of illustrative plates formed, d (n) is actual output and desirable output error vector.
Step 1065, upgrade weight matrix w (n) corresponding to speech frame of the n-th frame input and the value of weight matrix m (n) respectively, obtain the speech frame of the (n+1)th frame input corresponding weight matrix w (n+1) and weight matrix m (n+1):
w(n+1)=w(n)-αΔw(n)
m(n+1)=m(n)-αΔm(n)
W (n) and m (n) represents the input layer corresponding to speech frame of the n-th frame input and convolutional layer and the weight matrix between sub sampling layer and output layer respectively, and α is learning rate, preferred α=1.
Step 107, constantly input amendment voice, repeat step 101-106, until the output error δ of convolutional neural networks system is less than default value;
δ = 1 2 Σ i = 1 S ( y ^ - y ) 2 ,
In formula, S is the voice totalframes of each training input, and y is the vector that the court verdict of S frame voice is formed, for the vector that the ideal decision result of S frame voice is formed.
Step 108, when output error δ is less than preset value, record weight matrix w and weight matrix m is also kept at file.
Step 2, from file, read weight matrix w and weight matrix m initialization convolutional neural networks system.
Step 3, by phonetic entry VAD system to be measured;
Step 4, according to the order of sequence speech characteristic parameter MFCC and first order difference characteristic parameter thereof are extracted to each frames of voice to be measured;
Step 5, bring the characteristic parameter MFCC of each frame of voice to be measured and first order difference characteristic parameter composition one-dimensional characteristic collection of illustrative plates thereof into convolutional neural networks system according to the order of sequence.
Step 6, convolutional neural networks system will export the result [x, y] of each frame of voice to be measured according to the order of sequence, and VAD system is adjudicated result [x, y], and records court verdict;
The Output rusults [x, y] of VAD system to output layer is adjudicated, if x>y, being illustrated as sample voice is speech frame, and it is 1 that VAD system exports court verdict, otherwise is non-speech frame, and it is 0 that VAD system exports court verdict.
The invention has the advantages that:
(1) reduce a system for VOIP communication resource expense, after convolutional neural networks systematic training is good, without the need to again training, only need can carry out VAD detection after file imports the weight initialization network preserved, computation complexity is low.
(2) a kind of method reducing VOIP communication resource expense, detect by convolutional neural networks system is incorporated into VAD system, make full use of the learning characteristic of convolutional neural networks, reduce the False Rate of VAD system, save computing time and bandwidth, thus the voice resource expense of VOIP can be reduced when ensureing communication quality.
Accompanying drawing explanation
Fig. 1 is a kind of system architecture schematic diagram reducing VOIP communication resource expense of the present invention.
Fig. 2 is a kind of method flow diagram reducing VOIP communication resource expense of the present invention.
Fig. 3 is the process flow diagram of training convolutional neural networks system of the present invention.
Fig. 4 is the process flow diagram of convolutional neural networks system of the present invention.
Fig. 5 is the process flow diagram that convolutional neural networks system reverse of the present invention is propagated.
Embodiment
Below in conjunction with drawings and Examples, the present invention is described in further detail.
In VOIP call system, before voice coding, first carry out VAD detection, remove non-speech segments, avoid the coding to quiet data bag and transmission, save computing time and bandwidth.In the voice coding stage, can not encode or few coding to the fragment that VAD judgement is non-speech frame, thus reduce communications bandwidth resources.
A kind of system and method reducing VOIP communication resource expense, by simple for convolutional neural networks system architecture, training parameter is few and the feature such as strong adaptability is applied in VAD system, by training convolutional neural networks system, using convolutional neural networks system as two-dimentional decision device, reduce False Rate and the misdetection rate of VAD system, thus reduce VOIP communication resource expense when ensureing communication quality.
Reduce a system for VOIP communication resource expense, have another name called convolutional neural networks system, as shown in Figure 1, specifically comprise 4 layers, be followed successively by input layer, convolutional layer, sub sampling layer and output layer, every one deck is all be made up of some dimensional feature collection of illustrative plates, containing some neurons on every one-dimensional characteristic collection of illustrative plates.
Input layer is one-dimensional characteristic collection of illustrative plates, has 26 neurons, is designated as 1*26; Convolutional layer has 12 dimensional feature collection of illustrative plates, and every one dimension has 21 neurons; Be designated as 12*21; Sub sampling layer is identical with the characteristic spectrum dimension of convolutional layer, and the neuron number on every one-dimensional characteristic collection of illustrative plates is 1/3rd of convolutional layer every one-dimensional characteristic collection of illustrative plates epineural unit number; Be designated as 12*7; Output layer is one-dimensional characteristic collection of illustrative plates, has 2 neurons; Be designated as 1*2.
What input layer inputted is the speech characteristic parameter MFCC of each frame of voice to be measured and the one-dimensional characteristic spectrogram of first order difference characteristic parameter formation thereof;
The characteristic spectrum of convolutional layer has 12 dimensions, and the characteristic spectrum of every one dimension is obtained by the characteristic spectrum of input layer and a convolution kernel convolution.The present embodiment selects the convolution kernel of 12 1*6.After convolution, first neuron on the first dimensional feature collection of illustrative plates of the corresponding convolutional layer of front 6 neurons of input layer; By that analogy.On each dimensional feature collection of illustrative plates of convolutional layer, the neuron of same position all connects correspondence 6 neurons on input layer characteristic spectrum, and connected mode is all identical;
Sub sampling layer, by carrying out dimensionality reduction operation to every one-dimensional characteristic collection of illustrative plates of convolutional layer, by the neuron average packet on every for convolutional layer one-dimensional characteristic collection of illustrative plates, then chooses the neuron of the maximum neuron in every group as sub sampling layer; The dimension of sub sampling layer characteristic spectrum is identical with the dimension of convolutional layer characteristic spectrum, and the number of sub sampling layer epineural unit is reduced to 1/3 to 2/3 of convolutional layer epineural unit number.
All neurons of two neuron whole connexon sample level of output layer, and export the actual result of voice to be measured.
Application convolutional neural networks system is carried out VAD and is detected the method reducing VOIP communication resource expense, and as shown in Figure 2, concrete steps are:
Step one, training convolutional neural networks system, obtain and preserve weight matrix w and weight matrix m;
As shown in Figure 3, concrete steps are as follows:
Step 101, choose speech samples and speech frame and non-speech frame are labeled as respectively to each frame of speech samples.
Speech samples is selected from the part of speech in Clemson university CUAVE database and noise speech storehouse NOIZEUS;
To speech samples hand labeled according to the order of sequence, every 20ms divides a frame into, and speech frame is labeled as [1,0], and non-speech frame is labeled as [0,1], and the ideal respectively as convolutional neural networks system output layer judgement speech frame and non-speech frame exports.
Step 102, for each the frame voice after mark, respectively extract speech characteristic parameter MFCC and first order difference characteristic parameter thereof according to the order of sequence;
MFCC (Mel-scaleFrequencyCepstralCoefficients, mel cepstrum coefficients), divides a frame into by 20ms when extracting characteristic parameter.
The computing formula of speech characteristic parameter MFCC is:
M ( n ) = 2 N Σ k = 1 N m k c o s ( n π N ( k - 0.5 ) ) , n = 1 , 2 , 3 , ... , L
Wherein, the n-th rank parameter that M (n) is speech characteristic parameter MFCC, m kfor the input of discrete cosine transform, k=1 ...., N; N is the number of Mel wave filter, and preferred value is 24; L is speech characteristic parameter MFCC coefficient maximum order, and preferred value is 13.
Note first order difference MFCC parameter is Δ MFCC, and its computing formula is:
D ( t ) = Σ n = 1 R n ( M ( t + n ) - M ( t - n ) ) 2 Σ n = 1 R n 2 , t = 1 , 2 , 3 , ... , L ′
In formula, the t rank parameter that D (t) is first order difference MFCC parameter; L' is the maximum order of Δ MFCC coefficient.
The t+n rank parameter that M (t+n) is speech characteristic parameter MFCC; The t-n rank parameter that M (t-n) is speech characteristic parameter MFCC; N=1 ...., R, R represent the mistiming of single order cepstrum coefficient, can value 1 or 2.Preferred value is 2.
Step 103, according to the order of sequence by the speech characteristic parameter MFCC of each frame and first order difference characteristic parameter composition one-dimensional characteristic collection of illustrative plates thereof;
The process being made up of one-dimensional characteristic collection of illustrative plates MFCC and Δ MFCC is as follows:
V=[MFCC|ΔMFCC]=[M(1),M(2),...M(n),...,M(L),D(1),D(2),...D(t),...,D(L')]
In formula, V represents the one-dimensional characteristic collection of illustrative plates obtained, the n-th rank parameter that M (n) is speech characteristic parameter MFCC, and D (t) is the t rank parameter of Δ MFCC parameter.
Step 104, bring the one-dimensional characteristic collection of illustrative plates of step 103 into convolutional neural networks system, obtain court verdict;
As shown in Figure 4, concrete steps are:
Step 1041, the one-dimensional characteristic collection of illustrative plates of each frame of sample voice is input to the input layer of convolutional neural networks system;
All speech characteristic parameter MFCC of sample voice and first order difference characteristic parameter thereof form the feature spectrogram of a 1*26; Input using one dimension vector form as convolutional neural networks system;
Step 1042, obtained the characteristic spectrum of the every one dimension of convolutional layer by the characteristic spectrum of input layer and convolution kernel convolution.
Convolution kernel size sets according to demand; The present embodiment selects the convolution kernel of 12 1*6.
After convolution kernel convolution, the neuron number of the every one dimension of convolutional layer is reduced to about 80% of input layer number.Input layer is connected with convolutional layer jth dimensional feature collection of illustrative plates neuronic weight matrix w jrepresent; All weight matrix w jform matrix w;
The characteristic spectrum Q of the every one dimension of convolutional layer jbe calculated as follows:
Q j=σ(O*w j),j=1,...,J
Wherein, Q jfor the jth dimensional feature collection of illustrative plates of convolutional layer, j=1 ..., J, J are convolutional layer characteristic spectrum dimension, and the present embodiment is 12; O is the characteristic spectrum of input layer, and * represents convolution operation, and σ (x) is activation function, selects sigmoid function, is defined as:
σ ( x ) = 1 1 + e - x
12 dimensional feature collection of illustrative plates of convolutional layer are obtained through the characteristic spectrum of input layer and the convolution kernel convolution of 12 1*6; Because the neuron on each characteristic spectrum of convolutional layer is identical with the neuron connected mode on input layer characteristic spectrum, so neuron on the every one-dimensional characteristic collection of illustrative plates of convolutional layer extracts is corresponding local feature in input layer characteristic spectrum.
Step 1043, sub sampling layer are to every one-dimensional characteristic collection of illustrative plates Q of convolutional layer jcarry out dimensionality reduction operation;
By the neuron average packet on every for convolutional layer one-dimensional characteristic collection of illustrative plates, then choose the neuron of the maximum neuron in every group as sub sampling layer, computing formula is as follows:
p j , m = m a x n = 1 G q j , ( m - 1 ) × s + n
Wherein, p j,mrepresent sub sampling layer jth dimensional feature collection of illustrative plates P jm neuron, q j,mrepresent convolutional layer jth dimensional feature collection of illustrative plates Q jm neuron, s is that window moves size; G is the window size of the dimensionality reduction operation that convolutional layer carries out to sub sampling layer, namely often organizes neuronic number in average packet; The value that window moves size s is identical with the value of window G.The present embodiment selects G=s=3.
After dimensionality reduction, sub sampling layer obtains 12*7 neuron, and on sub sampling layer, the neuron number of every one-dimensional characteristic collection of illustrative plates is 1/3 to 2/3 of every one-dimensional characteristic collection of illustrative plates neuron number on convolutional layer, reduces the data volume of calculating.
Each neuron of step 1044, output layer is all connected with all neurons of sub sampling layer;
Adopt between the output of sub sampling layer with the input of output layer and be entirely connected, computing formula is:
y = σ ( P ^ · m )
In formula, y is output vector, represent the characteristic spectrum of each for sub sampling layer dimensional feature collection of illustrative plates generate one n dimensional vector n; p jrepresent sub sampling layer jth dimensional feature collection of illustrative plates, J is the dimension of sub sampling layer characteristic spectrum; M is that the characteristic spectrum of a n dimensional vector n is connected with output layer neuronic weight matrix, and σ (x) is activation function, selects sigmoid function, is defined as:
σ ( x ) = 1 1 + e - x
Step 1045, after entirely connecting, output layer exports and is expressed as [x, y], and records result.
Step 105, for each frame voice, the ideal that actual court verdict and the step 101 of step 104 mark is exported contrast according to the order of sequence, calculate decision error item;
For the n-th frame speech frame, the actual output court verdict of convolutional neural networks system is y (n), and desirable output is designated as the computing formula of decision error item d (n) is as follows:
d(n)=e(n)·{y(n)·[1-y(n)]}
In formula n is the frame number of sample voice.
Step 106, decision error item for each frame voice, convolutional neural networks system, by this error term backpropagation, upgrades weight matrix w and weight matrix m, completes a training process.
As shown in Figure 5, concrete steps are as follows:
Step 1061, for input voice the n-th frame, error term d (n) reverse transfer, to sub sampling layer, obtains output error item sd (n) of sub sampling layer:
sd(n)=m T(n)·d(n)
M tn transposed matrix that () is m (n), m (n) is that the sub sampling layer characteristic spectrum of the n-th frame speech frame is connected with output layer neuronic weight matrix;
Step 1062, output error item sd (n) split into J and tie up matrix
Sd (n) is one dimension matrix, because sub sampling layer is J dimension, needs to be splitted into J and ties up matrix.
Suppose sd (n)=[sd 1, sd 2..., sd m], sd (n) splits and obtains J dimension matrix split process is as follows:
In formula, I is the number of sub sampling layer every one-dimensional characteristic collection of illustrative plates epineural unit, and M is all neuronic number of sub sampling layer, M=J × I.S i,Jfor J ties up matrix element.
Step 1063, J tie up matrix be transferred to convolutional layer and carry out increasing dimension, obtain the error term c (n) of convolutional layer;
Process is as follows:
In formula, G is the window size of the dimensionality reduction operation that convolutional layer carries out to sub sampling layer.
Step 1064, voice n-th frame for input, calculate error term and the error term of weight matrix m (n) of weight matrix w (n) respectively.
Calculate input layer to be connected with convolutional layer the error term of weight matrix w (n):
Δw(n)=o″(n)*c(n)
" (n) is that the input feature vector collection of illustrative plates O of the corresponding input layer of input voice n-th frame revolves turnback to o in formula.
Calculate sub sampling layer to be connected with output layer the error term of weight matrix m (n):
Δ m ( n ) = d ( n ) · P ^ ( n )
In formula for the voice n-th frame sub sampling layer of input exports the one dimension vector characteristic collection of illustrative plates formed, d (n) is actual output and desirable output error vector.
Step 1065, upgrade weight matrix w (n) corresponding to speech frame of the n-th frame input and the value of weight matrix m (n) respectively, obtain the speech frame of the (n+1)th frame input corresponding weight matrix w (n+1) and weight matrix m (n+1):
w(n+1)=w(n)-αΔw(n)
m(n+1)=m(n)-αΔm(n)
W (n) and m (n) represents the input layer corresponding to speech frame of the n-th frame input and convolutional layer and the weight matrix between sub sampling layer and output layer respectively, and α is learning rate, preferred α=1.
Step 107, constantly input amendment voice, repeat step 101-106, until the output error δ of convolutional neural networks system is less than default value;
δ = 1 2 Σ i = 1 S ( y ^ - y ) 2 ,
In formula, S is the voice totalframes of each training input, and y is the vector that the court verdict of S frame voice is formed, for the vector that the ideal decision result of S frame voice is formed.Preset value is 5%.
Step 108, when output error δ is less than preset value, record weight matrix w and weight matrix m is also kept at file.
Step 2, from file, read weight matrix w and weight matrix m initialization convolutional neural networks system.
Step 3, by phonetic entry VAD system to be measured;
The phonetic entry VAD system to be measured of VAD judgement will be carried out.
Step 4, according to the order of sequence speech characteristic parameter MFCC and first order difference characteristic parameter thereof are extracted to speech frame to be measured;
Step 5, bring the characteristic parameter of each frame of voice to be measured composition one-dimensional characteristic collection of illustrative plates into convolutional neural networks system according to the order of sequence.
The process being made up of one-dimensional characteristic collection of illustrative plates MFCC and Δ MFCC is as follows:
V=[MFCC|ΔMFCC]=[M(1),M(2),...M(n),...,M(L),D(1),D(2),...D(t),...,D(L')]
Step 6, convolutional neural networks system will export the result [x, y] of each frame of voice to be measured according to the order of sequence, and VAD system is adjudicated result [x, y], and records court verdict;
Voice divide a frame into by 20ms, for a frame speech data, the Output rusults of convolutional neural networks system output layer is [x, y], when VAD system is adjudicated, if x>y, illustrate that sample voice is speech frame, VAD system exports 1, otherwise is non-speech frame, and VAD system exports court verdict 0.
Such as voice to be measured through the result of output layer are: [0.98,0.02], and 0.98>0.02 when VAD system is adjudicated illustrates VAD system determination result: these voice to be measured are speech frame, and VAD system exports court verdict 1.
Present invention achieves a kind of VAD detection technique, using the input of speech characteristic parameter as convolutional neural networks, as MFCC, MFCC and first order difference parameter thereof, MFCC and single order and second order difference parameter; Use convolutional neural networks as two-dimentional decision device.
Can to be declined low voice resource overhead ensureing the situation of communication quality by this technology in VOIP.Speech feature extraction part can replace with LPCC, LPCC and first order difference parameter thereof, LPCC and single order and second order difference parameter, and the combination of MFCC and LPCC.The number of characteristic parameter can be different with the reference value of giving.The convolutional layer of convolutional neural networks and sub sampling layer neuron number can be different, and convolution kernel size can be different, and convolutional layer and sub sampling layer can be multilayer.

Claims (5)

1. reduce a system for VOIP communication resource expense, it is characterized in that, described system is by input layer, convolutional layer, the convolutional neural networks system of sub sampling layer and output layer composition; Every one deck is formed by characteristic spectrum, and every one-dimensional characteristic collection of illustrative plates contains neuron; Described VOIP refers to Voice over IP;
Input layer input be the first order difference characteristic parameter of speech characteristic parameter MFCC and MFCC of each frame of voice to be measured, described parameter forms one-dimensional characteristic collection of illustrative plates; Described MFCC refers to mel cepstrum coefficients;
The characteristic spectrum of the every one dimension of convolutional layer obtains by the characteristic spectrum of input layer and a convolution kernel convolution, and in every dimensional feature collection of illustrative plates, the neuron of same position all connects neuron corresponding in input layer;
The every one-dimensional characteristic collection of illustrative plates of sub sampling layer to convolutional layer carries out dimensionality reduction operation, by the neuron average packet on every for convolutional layer one-dimensional characteristic collection of illustrative plates, then choose the neuron of the maximum neuron in every group as sub sampling layer, the number of sub sampling layer epineural unit is reduced to 1/3 to 2/3 of convolutional layer epineural unit number;
Output layer is one-dimensional characteristic collection of illustrative plates, all neurons of all whole connexon sample level of each neuron on characteristic spectrum, and exports the actual result of voice to be measured.
2. application reduces the method for the reduction VOIP communication resource expense of the system of VOIP communication resource expense as claimed in claim 1, and it is characterized in that, concrete steps are:
Step one, training convolutional neural networks system, obtain and preserve be connected with convolutional layer neuronic weight matrix w and sub sampling layer of input layer and to be connected with output layer neuronic weight matrix m;
Step 2, reading weight matrix w and weight matrix m, initialization convolutional neural networks system;
Step 3, by phonetic entry VAD system to be measured; Described VAD system refers to that voice activation detects;
Step 4, according to the order of sequence speech characteristic parameter MFCC and first order difference characteristic parameter thereof are extracted to each frames of voice to be measured;
Step 5, bring the characteristic parameter MFCC of each frame of voice to be measured and first order difference characteristic parameter composition one-dimensional characteristic collection of illustrative plates thereof into convolutional neural networks system according to the order of sequence;
Step 6, convolutional neural networks system will export the result of each frame of voice to be measured according to the order of sequence, and VAD system is adjudicated result, and records court verdict;
The Output rusults of output layer is [x, y], and VAD system is adjudicated, if x is greater than y, being illustrated as sample voice is speech frame, and it is 1 that VAD system exports court verdict, otherwise is non-speech frame, and it is 0 that VAD system exports court verdict.
3. the method reducing VOIP communication resource expense as claimed in claim 2, it is characterized in that, described step one is specially: concrete steps are as follows:
Step 101, choose speech samples and speech frame and non-speech frame are labeled as respectively to each frame of speech samples;
Speech frame is labeled as [1,0], and non-speech frame is labeled as [0,1], and the ideal respectively as convolutional neural networks system output layer judgement speech frame and non-speech frame exports;
Step 102, for each the frame voice after mark, respectively extract speech characteristic parameter MFCC and first order difference characteristic parameter thereof according to the order of sequence;
The computing formula on speech characteristic parameter MFCC n-th rank parameter M (n) is:
M ( n ) = 2 N Σ k = 1 N m k c o s ( n π N ( k - 0.5 ) ) , n = 1 , 2 , 3 , ... , L
M kfor the input of discrete cosine transform, k=1 ...., N; N is the number of Mel wave filter; L is the maximum order of speech characteristic parameter MFCC coefficient;
The computing formula of t rank parameter D (t) of first order difference MFCC parameter Δ MFCC is:
D ( t ) = Σ n = 1 R n ( M ( t + n ) - M ( t - n ) ) 2 Σ n = 1 R n 2 , t = 1 , 2 , 3 , ... , L ′
L' is the maximum order of Δ MFCC coefficient; The t+n rank parameter that M (t+n) is speech characteristic parameter MFCC; The t-n rank parameter that M (t-n) is speech characteristic parameter MFCC; N=1 ...., R, R represent the mistiming of single order cepstrum coefficient;
Step 103, according to the order of sequence by the speech characteristic parameter MFCC of each frame and first order difference characteristic parameter composition one-dimensional characteristic collection of illustrative plates thereof;
The computing formula of one-dimensional characteristic collection of illustrative plates V is as follows:
V=[MFCC|ΔMFCC]=[M(1),M(2),…M(n),…,M(L),D(l),D(2),…D(t),…,D(L′)]
Step 104, bring the one-dimensional characteristic collection of illustrative plates of step 103 into convolutional neural networks system, obtain court verdict;
Step 105, for each frame voice, the ideal that actual court verdict and the step 101 of step 104 mark is exported contrast according to the order of sequence, calculate decision error item;
For the n-th frame of input voice, the actual Output rusults of convolutional neural networks system is y (n), and desirable output is designated as the computing formula of decision error item d (n) is as follows:
d(n)=e(n)·{y(n)·[1-y(n)]}
In formula n is the frame number of sample voice;
Step 106, decision error item for each frame voice, convolutional neural networks system, by this error term backpropagation, upgrades weight matrix w and weight matrix m, completes a training process;
Step 107, constantly input amendment voice, repeat step 101-106, until the output error δ of convolutional neural networks system is less than default value;
δ = 1 2 Σ i = 1 S ( y ^ - y ) 2 ,
In formula, S is the voice totalframes of each training input, y be S frame voice after convolutional neural networks system, the vector that actual Output rusults is formed, for the vector that the desired result of the S frame voice of input is formed;
Step 108, when output error δ is less than preset value, record weight matrix w and weight matrix m is also kept at file.
4. the method reducing VOIP communication resource expense as claimed in claim 3, it is characterized in that, described step 104 is specially:
Step 1041, the one-dimensional characteristic collection of illustrative plates of each frame of sample voice is input to the input layer of convolutional neural networks system;
Step 1042, obtained the characteristic spectrum of the every one dimension of convolutional layer by the one-dimensional characteristic collection of illustrative plates of input layer and convolution kernel convolution;
The characteristic spectrum Q of the every one dimension of convolutional layer jbe calculated as follows:
Q j=σ(O*w j),j=1,...,J
Wherein, Q jfor the jth dimensional feature collection of illustrative plates of convolutional layer, j=1 ..., J, J are the dimension of convolutional layer characteristic spectrum, are also the dimensions of sub sampling layer characteristic spectrum; O is the one-dimensional characteristic collection of illustrative plates of input layer, and * represents convolution operation, w jrepresent that input layer is connected with convolutional layer jth dimensional feature collection of illustrative plates neuronic weight matrix, all weight matrix w jform matrix w;
σ (x) is defined as:
σ ( x ) = 1 1 + e - x
After convolution, what the neuron on convolutional layer every one-dimensional characteristic collection of illustrative plates extracted is local feature corresponding in input layer characteristic spectrum;
Step 1043, sub sampling layer are to every one-dimensional characteristic collection of illustrative plates Q of convolutional layer jcarry out dimensionality reduction operation;
By the neuron average packet on every for convolutional layer one-dimensional characteristic collection of illustrative plates, then choose the neuron of the maximum neuron in every group as sub sampling layer, computing formula is as follows:
p j , m = m a x n = 1 G q j , ( m - 1 ) × s + n
Wherein, p j,mrepresent sub sampling layer jth dimensional feature collection of illustrative plates P jm neuron, q j,mrepresent convolutional layer jth dimensional feature collection of illustrative plates Q jm neuron, s is that window moves size; G is the window size of the dimensionality reduction operation that convolutional layer carries out to sub sampling layer, and the value that window moves size s is identical with the value of window G;
Each neuron of step 1044, output layer is all connected with all neurons of sub sampling layer;
The computing formula of output vector y is:
y = σ ( P ^ · m )
In formula, represent the characteristic spectrum of each for sub sampling layer dimensional feature collection of illustrative plates generate one n dimensional vector n, p jrepresent sub sampling layer jth dimensional feature collection of illustrative plates; M is that the characteristic spectrum of a n dimensional vector n is connected with output layer neuronic weight matrix; Export a n dimensional vector n y to represent with [x, y];
Step 1045, through entirely connect after, output layer Output rusults [x, y] also records result.
5. the method reducing VOIP communication resource expense as claimed in claim 3, it is characterized in that, described step 106 is specially:
Step 1061, for input voice the n-th frame, error term d (n) reverse transfer, to sub sampling layer, obtains output error item sd (n) of sub sampling layer:
sd(n)=m T(n)·d(n)
M tn transposed matrix that () is m (n), m (n) for input voice n-th frame be input to convolutional neural networks system time, sub sampling layer output characteristic collection of illustrative plates is connected with output layer neuronic weight matrix;
Step 1062, output error item sd (n) split into J and tie up matrix
Sd (n)=[sd 1, sd 2..., sd m] split obtain J tie up matrix process is as follows:
In formula, I is the number of sub sampling layer every one-dimensional characteristic collection of illustrative plates epineural unit, and M is all neuronic number of sub sampling layer, M=J × I;
Step 1063, J tie up matrix be transferred to convolutional layer and carry out increasing dimension, obtain the error term c (n) of convolutional layer;
In formula, G is the window size of the dimensionality reduction operation that convolutional layer carries out to sub sampling layer;
Step 1064, voice n-th frame for input, calculate error term and the error term of weight matrix m (n) of weight matrix w (n) respectively;
Calculate input layer to be connected with convolutional layer the error term of weight matrix w (n):
Δw(n)=o″(n)*c(n)
" (n) is that the input feature vector collection of illustrative plates O of the corresponding input layer of input voice n-th frame revolves turnback to o in formula;
Calculate sub sampling layer to be connected with output layer the error term of weight matrix m (n):
Δ m ( n ) = d ( n ) · P ^ ( n )
In formula for the voice n-th frame sub sampling layer of input exports the one dimension vector characteristic collection of illustrative plates formed, d (n) is actual output and desirable output error vector;
Step 1065, upgrade weight matrix w (n) corresponding to speech frame of the n-th frame input and the value of weight matrix m (n) respectively, obtain the speech frame of the (n+1)th frame input corresponding weight matrix w (n+1) and weight matrix m (n+1):
w(n+1)=w(n)-αΔw(n)
m(n+1)=m(n)-αΔm(n)
W (n) and m (n) represents the input layer corresponding to speech frame of the n-th frame input and convolutional layer and the weight matrix between sub sampling layer and output layer respectively, and α is learning rate.
CN201510638936.1A 2015-09-30 2015-09-30 A kind of system and method reducing VOIP communication resource expense Active CN105321525B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510638936.1A CN105321525B (en) 2015-09-30 2015-09-30 A kind of system and method reducing VOIP communication resource expense

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510638936.1A CN105321525B (en) 2015-09-30 2015-09-30 A kind of system and method reducing VOIP communication resource expense

Publications (2)

Publication Number Publication Date
CN105321525A true CN105321525A (en) 2016-02-10
CN105321525B CN105321525B (en) 2019-02-22

Family

ID=55248773

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510638936.1A Active CN105321525B (en) 2015-09-30 2015-09-30 A kind of system and method reducing VOIP communication resource expense

Country Status (1)

Country Link
CN (1) CN105321525B (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106683680A (en) * 2017-03-10 2017-05-17 百度在线网络技术(北京)有限公司 Speaker recognition method and device and computer equipment and computer readable media
CN106790300A (en) * 2017-03-21 2017-05-31 青岛海信宽带多媒体技术有限公司 A kind of method and apparatus conversed
CN108847251A (en) * 2018-07-04 2018-11-20 武汉斗鱼网络科技有限公司 A kind of voice De-weight method, device, server and storage medium
CN108962231A (en) * 2018-07-04 2018-12-07 武汉斗鱼网络科技有限公司 A kind of method of speech classification, device, server and storage medium
CN109891434A (en) * 2016-09-06 2019-06-14 渊慧科技有限公司 Audio is generated using neural network
CN110017991A (en) * 2019-05-13 2019-07-16 山东大学 Rolling bearing fault classification method and system based on spectrum kurtosis and neural network
CN110223713A (en) * 2019-06-11 2019-09-10 苏州思必驰信息科技有限公司 Sound event detection model training method and sound event detection method
CN110706694A (en) * 2019-09-26 2020-01-17 成都数之联科技有限公司 Voice endpoint detection method and system based on deep learning
US10586531B2 (en) 2016-09-06 2020-03-10 Deepmind Technologies Limited Speech recognition using convolutional neural networks
CN111179975A (en) * 2020-04-14 2020-05-19 深圳壹账通智能科技有限公司 Voice endpoint detection method for emotion recognition, electronic device and storage medium
CN111210815A (en) * 2019-11-28 2020-05-29 赵铭 Deep neural network construction method for voice command word recognition, and recognition method and device
US10733390B2 (en) 2016-10-26 2020-08-04 Deepmind Technologies Limited Processing text sequences using neural networks
US11080591B2 (en) 2016-09-06 2021-08-03 Deepmind Technologies Limited Processing sequences using convolutional neural networks
CN114420111A (en) * 2022-03-31 2022-04-29 成都启英泰伦科技有限公司 One-dimensional hypothesis-based speech vector distance calculation method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101278337A (en) * 2005-07-22 2008-10-01 索福特迈克斯有限公司 Robust separation of speech signals in a noisy environment
US7660713B2 (en) * 2003-10-23 2010-02-09 Microsoft Corporation Systems and methods that detect a desired signal via a linear discriminative classifier that utilizes an estimated posterior signal-to-noise ratio (SNR)
CN104021373A (en) * 2014-05-27 2014-09-03 江苏大学 Semi-supervised speech feature variable factor decomposition method
CN104157290A (en) * 2014-08-19 2014-11-19 大连理工大学 Speaker recognition method based on depth learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7660713B2 (en) * 2003-10-23 2010-02-09 Microsoft Corporation Systems and methods that detect a desired signal via a linear discriminative classifier that utilizes an estimated posterior signal-to-noise ratio (SNR)
CN101278337A (en) * 2005-07-22 2008-10-01 索福特迈克斯有限公司 Robust separation of speech signals in a noisy environment
CN104021373A (en) * 2014-05-27 2014-09-03 江苏大学 Semi-supervised speech feature variable factor decomposition method
CN104157290A (en) * 2014-08-19 2014-11-19 大连理工大学 Speaker recognition method based on depth learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
THEA GHISELLI-CRIPPA,AMRO EI-JAROUDI: ""A fast neural net training algorithm and its application to voiced-unvoiced-silence classification of speech"", 《PROC.INT.CONF.ASSP,1991》 *
XIAO-LEI ZHANG,JI WU: ""Denoising Deep Neural Networks Based Voice Activity Detection"", 《IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS》 *
王璟尧: ""基于SDCNN的人脸识别研究"", 《中国优秀硕士学位论文全文数据库》 *
谢剑斌 等编著: "《视觉机器学习》", 30 June 2015, 清华大学出版社 *

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10803884B2 (en) 2016-09-06 2020-10-13 Deepmind Technologies Limited Generating audio using neural networks
US11948066B2 (en) 2016-09-06 2024-04-02 Deepmind Technologies Limited Processing sequences using convolutional neural networks
US11869530B2 (en) 2016-09-06 2024-01-09 Deepmind Technologies Limited Generating audio using neural networks
US11386914B2 (en) 2016-09-06 2022-07-12 Deepmind Technologies Limited Generating audio using neural networks
CN109891434A (en) * 2016-09-06 2019-06-14 渊慧科技有限公司 Audio is generated using neural network
US11080591B2 (en) 2016-09-06 2021-08-03 Deepmind Technologies Limited Processing sequences using convolutional neural networks
US11069345B2 (en) 2016-09-06 2021-07-20 Deepmind Technologies Limited Speech recognition using convolutional neural networks
CN109891434B (en) * 2016-09-06 2020-10-30 渊慧科技有限公司 Generating audio using neural networks
US10586531B2 (en) 2016-09-06 2020-03-10 Deepmind Technologies Limited Speech recognition using convolutional neural networks
US10733390B2 (en) 2016-10-26 2020-08-04 Deepmind Technologies Limited Processing text sequences using neural networks
US11321542B2 (en) 2016-10-26 2022-05-03 Deepmind Technologies Limited Processing text sequences using neural networks
CN106683680A (en) * 2017-03-10 2017-05-17 百度在线网络技术(北京)有限公司 Speaker recognition method and device and computer equipment and computer readable media
US10957339B2 (en) 2017-03-10 2021-03-23 Baidu Online Network Technology (Beijing) Co., Ltd. Speaker recognition method and apparatus, computer device and computer-readable medium
CN106790300A (en) * 2017-03-21 2017-05-31 青岛海信宽带多媒体技术有限公司 A kind of method and apparatus conversed
CN108847251A (en) * 2018-07-04 2018-11-20 武汉斗鱼网络科技有限公司 A kind of voice De-weight method, device, server and storage medium
CN108847251B (en) * 2018-07-04 2022-12-02 武汉斗鱼网络科技有限公司 Voice duplicate removal method, device, server and storage medium
CN108962231B (en) * 2018-07-04 2021-05-28 武汉斗鱼网络科技有限公司 Voice classification method, device, server and storage medium
CN108962231A (en) * 2018-07-04 2018-12-07 武汉斗鱼网络科技有限公司 A kind of method of speech classification, device, server and storage medium
CN110017991A (en) * 2019-05-13 2019-07-16 山东大学 Rolling bearing fault classification method and system based on spectrum kurtosis and neural network
CN110017991B (en) * 2019-05-13 2020-03-31 山东大学 Rolling bearing fault classification method and system based on spectral kurtosis and neural network
CN110223713A (en) * 2019-06-11 2019-09-10 苏州思必驰信息科技有限公司 Sound event detection model training method and sound event detection method
CN110706694A (en) * 2019-09-26 2020-01-17 成都数之联科技有限公司 Voice endpoint detection method and system based on deep learning
CN111210815B (en) * 2019-11-28 2023-01-06 赵铭 Deep neural network construction method for voice command word recognition, and recognition method and device
CN111210815A (en) * 2019-11-28 2020-05-29 赵铭 Deep neural network construction method for voice command word recognition, and recognition method and device
CN111179975A (en) * 2020-04-14 2020-05-19 深圳壹账通智能科技有限公司 Voice endpoint detection method for emotion recognition, electronic device and storage medium
CN114420111A (en) * 2022-03-31 2022-04-29 成都启英泰伦科技有限公司 One-dimensional hypothesis-based speech vector distance calculation method
CN114420111B (en) * 2022-03-31 2022-06-17 成都启英泰伦科技有限公司 One-dimensional hypothesis-based speech vector distance calculation method

Also Published As

Publication number Publication date
CN105321525B (en) 2019-02-22

Similar Documents

Publication Publication Date Title
CN105321525A (en) System and method for reducing VOIP (voice over internet protocol) communication resource overhead
CN110136731B (en) Cavity causal convolution generation confrontation network end-to-end bone conduction voice blind enhancement method
CN107393526B (en) Voice silence detection method, device, computer equipment and storage medium
CN111243620B (en) Voice separation model training method and device, storage medium and computer equipment
Zhang et al. Robust sound event recognition using convolutional neural networks
US6691090B1 (en) Speech recognition system including dimensionality reduction of baseband frequency signals
CN113488058B (en) Voiceprint recognition method based on short voice
CN108922513A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN113612808B (en) Audio processing method, related device, storage medium, and program product
CN115602152B (en) Voice enhancement method based on multi-stage attention network
Zhang et al. An efficient perceptual hashing based on improved spectral entropy for speech authentication
CN111883181A (en) Audio detection method and device, storage medium and electronic device
CN114338623B (en) Audio processing method, device, equipment and medium
CN108806725A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN109036470A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN114708855A (en) Voice awakening method and system based on binary residual error neural network
CN112420079B (en) Voice endpoint detection method and device, storage medium and electronic equipment
Raj et al. Multilayered convolutional neural network-based auto-CODEC for audio signal denoising using mel-frequency cepstral coefficients
CN108010533A (en) The automatic identifying method and device of voice data code check
CN114898766A (en) Distributed optical fiber voice enhancement method based on GAN network and tunnel rescue system
CN111785262B (en) Speaker age and gender classification method based on residual error network and fusion characteristics
CN113327633A (en) Method and device for detecting noisy speech endpoint based on deep neural network model
CN111048072A (en) Voiceprint recognition method applied to power enterprises
CN113506583B (en) Camouflage voice detection method using residual error network
CN117393000B (en) Synthetic voice detection method based on neural network and feature fusion

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant