CN105321525A

CN105321525A - System and method for reducing VOIP (voice over internet protocol) communication resource overhead

Info

Publication number: CN105321525A
Application number: CN201510638936.1A
Authority: CN
Inventors: 王珂; 孙瑞浩; 张鹤立; 李曦; 纪红
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2015-09-30
Filing date: 2015-09-30
Publication date: 2016-02-10
Anticipated expiration: 2035-09-30
Also published as: CN105321525B

Abstract

The invention discloses a system for reducing VOIP (voice over internet protocol) communication resource overhead, comprising an input layer, a convolution layer, a sampling sub-layer and an output layer, each layer being composed of a characteristic spectrum, each characteristic spectrum containing nerve cells; a method of using the system to reduce VOIP communication resource overhead includes specifically: 1, training a convolutional neural network; 2, initializing the convolutional neural network; 3, inputting voice to be measured into a VAD (voice activity detection) system; 4, extracting voice characteristic parameter MFCC and its first-order differential characteristic parameter from each frame in order; 5, composing the parameters of each frame into a one-dimensional characteristic map taken into the convolutional neural network system; 6, the convolutional neural network system outputting in order a result [x, y] of each frame of the voice to be detected, and the VAD system making judgment and recording the results. The system and method have the advantages that the convolutional neural network system is used in the VAD system for detecting, the misjudgment rate of the VAD system is reduced, calculation time and bandwidth are saved, and VOIP voice resource overhead can be reduced at the premise of ensuring communication quality.

Description

A kind of system and method reducing VOIP communication resource expense

Technical field

The invention belongs to communication computer field, relate to a kind of system and method reducing VOIP communication resource expense.

Background technology

VoIP (VoiceoverInternetProtocol), has another name called VoiceoverIP, Voice over IP; Specifically by simulating signal (Voice) digitizing of voice, be converted into IP packet, in IP network (IPNetwork), do real-time transmission with the form of data packet (DataPacket).

Along with the fast development of IP network technology and constantly perfect, IPization network progressively substitutes traditional exchange of speech network.Utilize IP network transmitting audio data, under the changeless condition of the network bandwidth, improve the quality that channel utilization can improve voice to a great extent.

In prior art, improve channel utilization and usually adopt voice activation to detect (VoiceActivityDetection, VAD) technology; VAD technology, according to the speech of the daily talk of people and characteristic of mourning in silence, is suppressed when quiet (Silence) being detected, makes it not take or few busy channel bandwidth, is just carried out compressed encoding and transmission when the activation speech of burst being detected.In fact, people are when making a phone call, only there is the time of 27%-40% in speech, use VAD technology that the dumb time is removed, only retain the part containing information, the effective rate of utilization of channel width can be made to improve more than 50%, so VAD technology becomes in voice traffic transmission the powerful tool reducing bandwidth, improve channel utilization.

In IP phone application, VAD technology is adopted to transmitting audio data, during audio session, remove non-speech segments, avoid the coding to quiet data bag and transmission, save computing time and bandwidth.

The method that current VAD technology adopts is: first receive current audio frame to be detected, then from audio frame, time domain parameter and frequency domain parameter is obtained, the first distance when then obtaining long in historical background noise frame of time domain parameter and time domain parameter between sliding average, and second distance when obtaining long in historical background noise frame of frequency domain parameter and frequency domain parameter between sliding average." historical background noise frame " refer to present frame before background noise frames, the multiple background noise frames of the continuous print before such as present frame; If present frame is initial first frame, then can using the frame that presets as historical background noise frame, maybe using this first frame as historical background noise frame.Finally, be that prospect speech frame is still for background noise frames according to the first Distance geometry second distance and the current audio frame to be detected of judgement polynomial expression group judgement based on the first Distance geometry second distance, at least one coefficient in judgement polynomial expression group is variable, and this variable is determined according to voice activation testing mode and/or input signal feature.

Above-mentioned technology mainly by calculate voice signal the numerical value such as Pitch Information, energy information, tone information and by it compared with the threshold value of standard, if numerical value is greater than threshold value be judged to voice, be less than threshold value and be then judged to noise, the advantage of this kind of algorithm is fairly simple, but be only applicable to the comparatively simple noise source of process, for the noise source of complexity, first above-mentioned voice messaging ratio is easier to mix with noise, and its subthreshold is also difficult to the change conformed.

Summary of the invention

The present invention is directed to existing VAD detection technique, not high to the adaptability of environmental change, and under complex environment, adjudicate the low shortcoming of accuracy rate, propose a kind of system and method reducing VOIP communication resource expense.

Reduce a system for VOIP communication resource expense, have another name called convolutional neural networks system, specifically comprise 4 layers, be followed successively by input layer, convolutional layer, sub sampling layer and output layer, every one deck is formed by characteristic spectrum, and every one-dimensional characteristic collection of illustrative plates contains neuron.

Input layer input be speech characteristic parameter MFCC and the first order difference characteristic parameter thereof of each frame of voice to be measured, described parameter forms one-dimensional characteristic collection of illustrative plates;

The characteristic spectrum of convolutional layer has multidimensional, the characteristic spectrum of every one dimension obtains by the characteristic spectrum of input layer and a convolution kernel convolution, according to the size of convolution kernel, 1 neuron of multiple neuron Map Volume laminations of input layer, and in every dimensional feature collection of illustrative plates, the neuron of same position all connects neuron corresponding in input layer.

Sub sampling layer, by carrying out dimensionality reduction operation to every one-dimensional characteristic collection of illustrative plates of convolutional layer, by the neuron average packet on every for convolutional layer one-dimensional characteristic collection of illustrative plates, then chooses the neuron of the maximum neuron in every group as sub sampling layer; The dimension of sub sampling layer characteristic spectrum is identical with the dimension of convolutional layer characteristic spectrum, and the number of sub sampling layer epineural unit is reduced to 1/3 to 2/3 of convolutional layer epineural unit number.

Output layer is one-dimensional characteristic collection of illustrative plates, all neurons of all whole connexon sample level of each neuron on characteristic spectrum, and exports the actual result of voice to be measured.

Reduce a method for VOIP communication resource expense, concrete steps are:

Step one, training convolutional neural networks system, obtain and preserve weight matrix w and weight matrix m;

Concrete steps are as follows:

Step 101, choose speech samples and speech frame and non-speech frame are labeled as respectively to each frame of speech samples.

To speech samples hand labeled according to the order of sequence, speech frame is labeled as [1,0], and non-speech frame is labeled as [0,1], and the ideal respectively as convolutional neural networks system output layer judgement speech frame and non-speech frame exports.

Step 102, for each the frame voice after mark, respectively extract speech characteristic parameter MFCC and first order difference characteristic parameter thereof according to the order of sequence;

MFCC (Mel-scaleFrequencyCepstralCoefficients, mel cepstrum coefficients), divides a frame into by 20ms when extracting characteristic parameter.

The computing formula of speech characteristic parameter MFCC is:

M (n) = \sqrt{\frac{2}{N}} Σ_{k = 1}^{N} m_{k} c o s (\frac{n π}{N} (k - 0.5)), n = 1, 2, 3, ..., L

Wherein, the n-th rank parameter that M (n) is speech characteristic parameter MFCC, m _kfor the input of discrete cosine transform, k=1 ...., N; N is the number of Mel wave filter; L is the maximum order of speech characteristic parameter MFCC coefficient.

Note first order difference MFCC parameter is Δ MFCC, and its computing formula is:

D (t) = \frac{Σ_{n = 1}^{R} n (M (t + n) - M (t - n))}{2 Σ_{n = 1}^{R} n^{2}}, t = 1, 2, 3, ..., L^{'}

In formula, the t rank parameter that D (t) is first order difference MFCC parameter; L' is the maximum order of Δ MFCC coefficient.

The t+n rank parameter that M (t+n) is speech characteristic parameter MFCC; The t-n rank parameter that M (t-n) is speech characteristic parameter MFCC; N=1 ...., R, R represent the mistiming of single order cepstrum coefficient.

Step 103, according to the order of sequence by the speech characteristic parameter MFCC of each frame and first order difference characteristic parameter composition one-dimensional characteristic collection of illustrative plates thereof;

The computing formula of one-dimensional characteristic collection of illustrative plates V is as follows:

V＝[MFCC|ΔMFCC]＝[M(1),M(2),...M(n),...,M(L),D(1),D(2),...D(t),...,D(L')]

Step 104, bring the one-dimensional characteristic collection of illustrative plates of step 103 into convolutional neural networks system, obtain court verdict;

Concrete steps are:

Step 1041, the one-dimensional characteristic collection of illustrative plates of each frame of sample voice is input to the input layer of convolutional neural networks system;

Step 1042, obtained the characteristic spectrum of the every one dimension of convolutional layer by the one-dimensional characteristic collection of illustrative plates of input layer and convolution kernel convolution.

The characteristic spectrum Q of the every one dimension of convolutional layer _jbe calculated as follows:

Q _j＝σ(O*w _j),j＝1,...,J

Wherein, Q _jfor the jth dimensional feature collection of illustrative plates of convolutional layer, j=1 ..., J, J are the dimension of convolutional layer characteristic spectrum; O is the one-dimensional characteristic collection of illustrative plates of input layer, and * represents convolution operation, w _jrepresent that input layer is connected with convolutional layer jth dimensional feature collection of illustrative plates neuronic weight matrix, all weight matrix w _jform matrix w; σ (x) is activation function, selects sigmoid function, is defined as:

σ (x) = \frac{1}{1 + e^{- x}}

After convolution, what the neuron on convolutional layer every one-dimensional characteristic collection of illustrative plates extracted is local feature corresponding in input layer characteristic spectrum.

Step 1043, sub sampling layer are to every one-dimensional characteristic collection of illustrative plates Q of convolutional layer _jcarry out dimensionality reduction operation;

By the neuron average packet on every for convolutional layer one-dimensional characteristic collection of illustrative plates, then choose the neuron of the maximum neuron in every group as sub sampling layer, computing formula is as follows:

p_{j, m} = {m a x}_{n = 1}^{G} q_{j, (m - 1) \times s + n}

Wherein, p _j,mrepresent sub sampling layer jth dimensional feature collection of illustrative plates P _jm neuron, q _j,mrepresent convolutional layer jth dimensional feature collection of illustrative plates Q _jm neuron, s is that window moves size; G is the window size of the dimensionality reduction operation that convolutional layer carries out to sub sampling layer.The value that window moves size s is identical with the value of window G.

Each neuron of step 1044, output layer is all connected with all neurons of sub sampling layer;

Computing formula is:

y = σ (\hat{P} \cdot m)

In formula, y is output vector; represent the characteristic spectrum of each for sub sampling layer dimensional feature collection of illustrative plates generate one n dimensional vector n, p _jrepresent sub sampling layer jth dimensional feature collection of illustrative plates, J is the dimension of sub sampling layer characteristic spectrum; M is that the characteristic spectrum of a n dimensional vector n is connected with output layer neuronic weight matrix, and σ (x) is activation function, selects sigmoid function, is defined as:

σ (x) = \frac{1}{1 + e^{- x}}

Step 1045, through entirely connect after, output layer exports and records result.

Step 105, for each frame voice, the ideal that actual court verdict and the step 101 of step 104 mark is exported contrast according to the order of sequence, calculate decision error item;

For the n-th frame of input voice, the actual Output rusults of convolutional neural networks system is y (n), and desirable output is designated as , the computing formula of decision error item d (n) is as follows:

d(n)＝e(n)·{y(n)·[1-y(n)]}

In formula n is the frame number of sample voice.

Step 106, decision error item for each frame voice, convolutional neural networks system, by this error term backpropagation, upgrades weight matrix w and weight matrix m, completes a training process.

Step 1061, for input voice the n-th frame, error term d (n) reverse transfer, to sub sampling layer, obtains output error item sd (n) of sub sampling layer:

sd(n)＝m ^T(n)·d(n)

M ^tn transposed matrix that () is m (n), m (n) for input voice n-th frame be input to convolutional neural networks system time, sub sampling layer output characteristic collection of illustrative plates is connected with output layer neuronic weight matrix;

Step 1062, output error item sd (n) split into J and tie up matrix

Sd (n)=[sd ₁, sd ₂..., sd _m] split obtain J tie up matrix process is as follows:

In formula, I is the number of sub sampling layer every one-dimensional characteristic collection of illustrative plates epineural unit, and M is all neuronic number of sub sampling layer, M=J × I.

Step 1063, J tie up matrix be transferred to convolutional layer and carry out increasing dimension, obtain the error term c (n) of convolutional layer;

Process is as follows:

In formula, G is the window size of the dimensionality reduction operation that convolutional layer carries out to sub sampling layer.

Step 1064, voice n-th frame for input, calculate error term and the error term of weight matrix m (n) of weight matrix w (n) respectively.

Calculate input layer to be connected with convolutional layer the error term of weight matrix w (n):

Δw(n)＝o″(n)*c(n)

" (n) is that the input feature vector collection of illustrative plates O of the corresponding input layer of input voice n-th frame revolves turnback to o in formula.

Calculate sub sampling layer to be connected with output layer the error term of weight matrix m (n):

Δ m (n) = d (n) \cdot \hat{P} (n)

In formula for the voice n-th frame sub sampling layer of input exports the one dimension vector characteristic collection of illustrative plates formed, d (n) is actual output and desirable output error vector.

Step 1065, upgrade weight matrix w (n) corresponding to speech frame of the n-th frame input and the value of weight matrix m (n) respectively, obtain the speech frame of the (n+1)th frame input corresponding weight matrix w (n+1) and weight matrix m (n+1):

w(n+1)＝w(n)-αΔw(n)

m(n+1)＝m(n)-αΔm(n)

W (n) and m (n) represents the input layer corresponding to speech frame of the n-th frame input and convolutional layer and the weight matrix between sub sampling layer and output layer respectively, and α is learning rate, preferred α=1.

Step 107, constantly input amendment voice, repeat step 101-106, until the output error δ of convolutional neural networks system is less than default value;

δ = \frac{1}{2} Σ_{i = 1}^{S} {(\hat{y} - y)}^{2},

In formula, S is the voice totalframes of each training input, and y is the vector that the court verdict of S frame voice is formed, for the vector that the ideal decision result of S frame voice is formed.

Step 108, when output error δ is less than preset value, record weight matrix w and weight matrix m is also kept at file.

Step 2, from file, read weight matrix w and weight matrix m initialization convolutional neural networks system.

Step 3, by phonetic entry VAD system to be measured;

Step 4, according to the order of sequence speech characteristic parameter MFCC and first order difference characteristic parameter thereof are extracted to each frames of voice to be measured;

Step 5, bring the characteristic parameter MFCC of each frame of voice to be measured and first order difference characteristic parameter composition one-dimensional characteristic collection of illustrative plates thereof into convolutional neural networks system according to the order of sequence.

Step 6, convolutional neural networks system will export the result [x, y] of each frame of voice to be measured according to the order of sequence, and VAD system is adjudicated result [x, y], and records court verdict;

The Output rusults [x, y] of VAD system to output layer is adjudicated, if x>y, being illustrated as sample voice is speech frame, and it is 1 that VAD system exports court verdict, otherwise is non-speech frame, and it is 0 that VAD system exports court verdict.

The invention has the advantages that:

(1) reduce a system for VOIP communication resource expense, after convolutional neural networks systematic training is good, without the need to again training, only need can carry out VAD detection after file imports the weight initialization network preserved, computation complexity is low.

(2) a kind of method reducing VOIP communication resource expense, detect by convolutional neural networks system is incorporated into VAD system, make full use of the learning characteristic of convolutional neural networks, reduce the False Rate of VAD system, save computing time and bandwidth, thus the voice resource expense of VOIP can be reduced when ensureing communication quality.

Accompanying drawing explanation

Fig. 1 is a kind of system architecture schematic diagram reducing VOIP communication resource expense of the present invention.

Fig. 2 is a kind of method flow diagram reducing VOIP communication resource expense of the present invention.

Fig. 3 is the process flow diagram of training convolutional neural networks system of the present invention.

Fig. 4 is the process flow diagram of convolutional neural networks system of the present invention.

Fig. 5 is the process flow diagram that convolutional neural networks system reverse of the present invention is propagated.

Embodiment

Below in conjunction with drawings and Examples, the present invention is described in further detail.

In VOIP call system, before voice coding, first carry out VAD detection, remove non-speech segments, avoid the coding to quiet data bag and transmission, save computing time and bandwidth.In the voice coding stage, can not encode or few coding to the fragment that VAD judgement is non-speech frame, thus reduce communications bandwidth resources.

A kind of system and method reducing VOIP communication resource expense, by simple for convolutional neural networks system architecture, training parameter is few and the feature such as strong adaptability is applied in VAD system, by training convolutional neural networks system, using convolutional neural networks system as two-dimentional decision device, reduce False Rate and the misdetection rate of VAD system, thus reduce VOIP communication resource expense when ensureing communication quality.

Reduce a system for VOIP communication resource expense, have another name called convolutional neural networks system, as shown in Figure 1, specifically comprise 4 layers, be followed successively by input layer, convolutional layer, sub sampling layer and output layer, every one deck is all be made up of some dimensional feature collection of illustrative plates, containing some neurons on every one-dimensional characteristic collection of illustrative plates.

Input layer is one-dimensional characteristic collection of illustrative plates, has 26 neurons, is designated as 1*26; Convolutional layer has 12 dimensional feature collection of illustrative plates, and every one dimension has 21 neurons; Be designated as 12*21; Sub sampling layer is identical with the characteristic spectrum dimension of convolutional layer, and the neuron number on every one-dimensional characteristic collection of illustrative plates is 1/3rd of convolutional layer every one-dimensional characteristic collection of illustrative plates epineural unit number; Be designated as 12*7; Output layer is one-dimensional characteristic collection of illustrative plates, has 2 neurons; Be designated as 1*2.

What input layer inputted is the speech characteristic parameter MFCC of each frame of voice to be measured and the one-dimensional characteristic spectrogram of first order difference characteristic parameter formation thereof;

The characteristic spectrum of convolutional layer has 12 dimensions, and the characteristic spectrum of every one dimension is obtained by the characteristic spectrum of input layer and a convolution kernel convolution.The present embodiment selects the convolution kernel of 12 1*6.After convolution, first neuron on the first dimensional feature collection of illustrative plates of the corresponding convolutional layer of front 6 neurons of input layer; By that analogy.On each dimensional feature collection of illustrative plates of convolutional layer, the neuron of same position all connects correspondence 6 neurons on input layer characteristic spectrum, and connected mode is all identical;

All neurons of two neuron whole connexon sample level of output layer, and export the actual result of voice to be measured.

Application convolutional neural networks system is carried out VAD and is detected the method reducing VOIP communication resource expense, and as shown in Figure 2, concrete steps are:

As shown in Figure 3, concrete steps are as follows:

Speech samples is selected from the part of speech in Clemson university CUAVE database and noise speech storehouse NOIZEUS;

To speech samples hand labeled according to the order of sequence, every 20ms divides a frame into, and speech frame is labeled as [1,0], and non-speech frame is labeled as [0,1], and the ideal respectively as convolutional neural networks system output layer judgement speech frame and non-speech frame exports.

The computing formula of speech characteristic parameter MFCC is:

M (n) = \sqrt{\frac{2}{N}} Σ_{k = 1}^{N} m_{k} c o s (\frac{n π}{N} (k - 0.5)), n = 1, 2, 3, ..., L

Wherein, the n-th rank parameter that M (n) is speech characteristic parameter MFCC, m _kfor the input of discrete cosine transform, k=1 ...., N; N is the number of Mel wave filter, and preferred value is 24; L is speech characteristic parameter MFCC coefficient maximum order, and preferred value is 13.

D (t) = \frac{Σ_{n = 1}^{R} n (M (t + n) - M (t - n))}{2 Σ_{n = 1}^{R} n^{2}}, t = 1, 2, 3, ..., L^{'}

The t+n rank parameter that M (t+n) is speech characteristic parameter MFCC; The t-n rank parameter that M (t-n) is speech characteristic parameter MFCC; N=1 ...., R, R represent the mistiming of single order cepstrum coefficient, can value 1 or 2.Preferred value is 2.

The process being made up of one-dimensional characteristic collection of illustrative plates MFCC and Δ MFCC is as follows:

V＝[MFCC|ΔMFCC]＝[M(1),M(2),...M(n),...,M(L),D(1),D(2),...D(t),...,D(L')]

In formula, V represents the one-dimensional characteristic collection of illustrative plates obtained, the n-th rank parameter that M (n) is speech characteristic parameter MFCC, and D (t) is the t rank parameter of Δ MFCC parameter.

As shown in Figure 4, concrete steps are:

All speech characteristic parameter MFCC of sample voice and first order difference characteristic parameter thereof form the feature spectrogram of a 1*26; Input using one dimension vector form as convolutional neural networks system;

Step 1042, obtained the characteristic spectrum of the every one dimension of convolutional layer by the characteristic spectrum of input layer and convolution kernel convolution.

Convolution kernel size sets according to demand; The present embodiment selects the convolution kernel of 12 1*6.

After convolution kernel convolution, the neuron number of the every one dimension of convolutional layer is reduced to about 80% of input layer number.Input layer is connected with convolutional layer jth dimensional feature collection of illustrative plates neuronic weight matrix w _jrepresent; All weight matrix w _jform matrix w;

Q _j＝σ(O*w _j),j＝1,...,J

Wherein, Q _jfor the jth dimensional feature collection of illustrative plates of convolutional layer, j=1 ..., J, J are convolutional layer characteristic spectrum dimension, and the present embodiment is 12; O is the characteristic spectrum of input layer, and * represents convolution operation, and σ (x) is activation function, selects sigmoid function, is defined as:

σ (x) = \frac{1}{1 + e^{- x}}

12 dimensional feature collection of illustrative plates of convolutional layer are obtained through the characteristic spectrum of input layer and the convolution kernel convolution of 12 1*6; Because the neuron on each characteristic spectrum of convolutional layer is identical with the neuron connected mode on input layer characteristic spectrum, so neuron on the every one-dimensional characteristic collection of illustrative plates of convolutional layer extracts is corresponding local feature in input layer characteristic spectrum.

p_{j, m} = {m a x}_{n = 1}^{G} q_{j, (m - 1) \times s + n}

Wherein, p _j,mrepresent sub sampling layer jth dimensional feature collection of illustrative plates P _jm neuron, q _j,mrepresent convolutional layer jth dimensional feature collection of illustrative plates Q _jm neuron, s is that window moves size; G is the window size of the dimensionality reduction operation that convolutional layer carries out to sub sampling layer, namely often organizes neuronic number in average packet; The value that window moves size s is identical with the value of window G.The present embodiment selects G=s=3.

After dimensionality reduction, sub sampling layer obtains 12*7 neuron, and on sub sampling layer, the neuron number of every one-dimensional characteristic collection of illustrative plates is 1/3 to 2/3 of every one-dimensional characteristic collection of illustrative plates neuron number on convolutional layer, reduces the data volume of calculating.

Adopt between the output of sub sampling layer with the input of output layer and be entirely connected, computing formula is:

y = σ (\hat{P} \cdot m)

In formula, y is output vector, represent the characteristic spectrum of each for sub sampling layer dimensional feature collection of illustrative plates generate one n dimensional vector n; p _jrepresent sub sampling layer jth dimensional feature collection of illustrative plates, J is the dimension of sub sampling layer characteristic spectrum; M is that the characteristic spectrum of a n dimensional vector n is connected with output layer neuronic weight matrix, and σ (x) is activation function, selects sigmoid function, is defined as:

σ (x) = \frac{1}{1 + e^{- x}}

Step 1045, after entirely connecting, output layer exports and is expressed as [x, y], and records result.

For the n-th frame speech frame, the actual output court verdict of convolutional neural networks system is y (n), and desirable output is designated as the computing formula of decision error item d (n) is as follows:

d(n)＝e(n)·{y(n)·[1-y(n)]}

In formula n is the frame number of sample voice.

As shown in Figure 5, concrete steps are as follows:

sd(n)＝m ^T(n)·d(n)

M ^tn transposed matrix that () is m (n), m (n) is that the sub sampling layer characteristic spectrum of the n-th frame speech frame is connected with output layer neuronic weight matrix;

Step 1062, output error item sd (n) split into J and tie up matrix

Sd (n) is one dimension matrix, because sub sampling layer is J dimension, needs to be splitted into J and ties up matrix.

Suppose sd (n)=[sd ₁, sd ₂..., sd _m], sd (n) splits and obtains J dimension matrix split process is as follows:

In formula, I is the number of sub sampling layer every one-dimensional characteristic collection of illustrative plates epineural unit, and M is all neuronic number of sub sampling layer, M=J × I.S _i,Jfor J ties up matrix element.

Process is as follows:

Δw(n)＝o″(n)*c(n)

Δ m (n) = d (n) \cdot \hat{P} (n)

w(n+1)＝w(n)-αΔw(n)

m(n+1)＝m(n)-αΔm(n)

δ = \frac{1}{2} Σ_{i = 1}^{S} {(\hat{y} - y)}^{2},

In formula, S is the voice totalframes of each training input, and y is the vector that the court verdict of S frame voice is formed, for the vector that the ideal decision result of S frame voice is formed.Preset value is 5%.

Step 3, by phonetic entry VAD system to be measured;

The phonetic entry VAD system to be measured of VAD judgement will be carried out.

Step 4, according to the order of sequence speech characteristic parameter MFCC and first order difference characteristic parameter thereof are extracted to speech frame to be measured;

Step 5, bring the characteristic parameter of each frame of voice to be measured composition one-dimensional characteristic collection of illustrative plates into convolutional neural networks system according to the order of sequence.

V＝[MFCC|ΔMFCC]＝[M(1),M(2),...M(n),...,M(L),D(1),D(2),...D(t),...,D(L')]

Voice divide a frame into by 20ms, for a frame speech data, the Output rusults of convolutional neural networks system output layer is [x, y], when VAD system is adjudicated, if x>y, illustrate that sample voice is speech frame, VAD system exports 1, otherwise is non-speech frame, and VAD system exports court verdict 0.

Such as voice to be measured through the result of output layer are: [0.98,0.02], and 0.98>0.02 when VAD system is adjudicated illustrates VAD system determination result: these voice to be measured are speech frame, and VAD system exports court verdict 1.

Present invention achieves a kind of VAD detection technique, using the input of speech characteristic parameter as convolutional neural networks, as MFCC, MFCC and first order difference parameter thereof, MFCC and single order and second order difference parameter; Use convolutional neural networks as two-dimentional decision device.

Can to be declined low voice resource overhead ensureing the situation of communication quality by this technology in VOIP.Speech feature extraction part can replace with LPCC, LPCC and first order difference parameter thereof, LPCC and single order and second order difference parameter, and the combination of MFCC and LPCC.The number of characteristic parameter can be different with the reference value of giving.The convolutional layer of convolutional neural networks and sub sampling layer neuron number can be different, and convolution kernel size can be different, and convolutional layer and sub sampling layer can be multilayer.

Claims

1. reduce a system for VOIP communication resource expense, it is characterized in that, described system is by input layer, convolutional layer, the convolutional neural networks system of sub sampling layer and output layer composition; Every one deck is formed by characteristic spectrum, and every one-dimensional characteristic collection of illustrative plates contains neuron; Described VOIP refers to Voice over IP;

Input layer input be the first order difference characteristic parameter of speech characteristic parameter MFCC and MFCC of each frame of voice to be measured, described parameter forms one-dimensional characteristic collection of illustrative plates; Described MFCC refers to mel cepstrum coefficients;

The characteristic spectrum of the every one dimension of convolutional layer obtains by the characteristic spectrum of input layer and a convolution kernel convolution, and in every dimensional feature collection of illustrative plates, the neuron of same position all connects neuron corresponding in input layer;

The every one-dimensional characteristic collection of illustrative plates of sub sampling layer to convolutional layer carries out dimensionality reduction operation, by the neuron average packet on every for convolutional layer one-dimensional characteristic collection of illustrative plates, then choose the neuron of the maximum neuron in every group as sub sampling layer, the number of sub sampling layer epineural unit is reduced to 1/3 to 2/3 of convolutional layer epineural unit number;

2. application reduces the method for the reduction VOIP communication resource expense of the system of VOIP communication resource expense as claimed in claim 1, and it is characterized in that, concrete steps are:

Step one, training convolutional neural networks system, obtain and preserve be connected with convolutional layer neuronic weight matrix w and sub sampling layer of input layer and to be connected with output layer neuronic weight matrix m;

Step 2, reading weight matrix w and weight matrix m, initialization convolutional neural networks system;

Step 3, by phonetic entry VAD system to be measured; Described VAD system refers to that voice activation detects;

Step 5, bring the characteristic parameter MFCC of each frame of voice to be measured and first order difference characteristic parameter composition one-dimensional characteristic collection of illustrative plates thereof into convolutional neural networks system according to the order of sequence;

Step 6, convolutional neural networks system will export the result of each frame of voice to be measured according to the order of sequence, and VAD system is adjudicated result, and records court verdict;

The Output rusults of output layer is [x, y], and VAD system is adjudicated, if x is greater than y, being illustrated as sample voice is speech frame, and it is 1 that VAD system exports court verdict, otherwise is non-speech frame, and it is 0 that VAD system exports court verdict.

3. the method reducing VOIP communication resource expense as claimed in claim 2, it is characterized in that, described step one is specially: concrete steps are as follows:

Step 101, choose speech samples and speech frame and non-speech frame are labeled as respectively to each frame of speech samples;

Speech frame is labeled as [1,0], and non-speech frame is labeled as [0,1], and the ideal respectively as convolutional neural networks system output layer judgement speech frame and non-speech frame exports;

The computing formula on speech characteristic parameter MFCC n-th rank parameter M (n) is:

M (n) = \sqrt{\frac{2}{N}} Σ_{k = 1}^{N} m_{k} c o s (\frac{n π}{N} (k - 0.5)), n = 1, 2, 3, ..., L

M _kfor the input of discrete cosine transform, k=1 ...., N; N is the number of Mel wave filter; L is the maximum order of speech characteristic parameter MFCC coefficient;

The computing formula of t rank parameter D (t) of first order difference MFCC parameter Δ MFCC is:

D (t) = \frac{Σ_{n = 1}^{R} n (M (t + n) - M (t - n))}{2 Σ_{n = 1}^{R} n^{2}}, t = 1, 2, 3, ..., L^{'}

L' is the maximum order of Δ MFCC coefficient; The t+n rank parameter that M (t+n) is speech characteristic parameter MFCC; The t-n rank parameter that M (t-n) is speech characteristic parameter MFCC; N=1 ...., R, R represent the mistiming of single order cepstrum coefficient;

V＝[MFCC|ΔMFCC]＝[M(1)，M(2)，…M(n)，…，M(L)，D(l)，D(2)，…D(t)，…，D(L′)]

For the n-th frame of input voice, the actual Output rusults of convolutional neural networks system is y (n), and desirable output is designated as the computing formula of decision error item d (n) is as follows:

d(n)＝e(n)·{y(n)·[1-y(n)]}

In formula n is the frame number of sample voice;

Step 106, decision error item for each frame voice, convolutional neural networks system, by this error term backpropagation, upgrades weight matrix w and weight matrix m, completes a training process;

δ = \frac{1}{2} Σ_{i = 1}^{S} {(\hat{y} - y)}^{2},

In formula, S is the voice totalframes of each training input, y be S frame voice after convolutional neural networks system, the vector that actual Output rusults is formed, for the vector that the desired result of the S frame voice of input is formed;

4. the method reducing VOIP communication resource expense as claimed in claim 3, it is characterized in that, described step 104 is specially:

Step 1042, obtained the characteristic spectrum of the every one dimension of convolutional layer by the one-dimensional characteristic collection of illustrative plates of input layer and convolution kernel convolution;

Q _j＝σ(O*w _j),j＝1,...,J

Wherein, Q _jfor the jth dimensional feature collection of illustrative plates of convolutional layer, j=1 ..., J, J are the dimension of convolutional layer characteristic spectrum, are also the dimensions of sub sampling layer characteristic spectrum; O is the one-dimensional characteristic collection of illustrative plates of input layer, and * represents convolution operation, w _jrepresent that input layer is connected with convolutional layer jth dimensional feature collection of illustrative plates neuronic weight matrix, all weight matrix w _jform matrix w;

σ (x) is defined as:

σ (x) = \frac{1}{1 + e^{- x}}

After convolution, what the neuron on convolutional layer every one-dimensional characteristic collection of illustrative plates extracted is local feature corresponding in input layer characteristic spectrum;

p_{j, m} = {m a x}_{n = 1}^{G} q_{j, (m - 1) \times s + n}

Wherein, p _j,mrepresent sub sampling layer jth dimensional feature collection of illustrative plates P _jm neuron, q _j,mrepresent convolutional layer jth dimensional feature collection of illustrative plates Q _jm neuron, s is that window moves size; G is the window size of the dimensionality reduction operation that convolutional layer carries out to sub sampling layer, and the value that window moves size s is identical with the value of window G;

The computing formula of output vector y is:

y = σ (\hat{P} \cdot m)

In formula, represent the characteristic spectrum of each for sub sampling layer dimensional feature collection of illustrative plates generate one n dimensional vector n, p _jrepresent sub sampling layer jth dimensional feature collection of illustrative plates; M is that the characteristic spectrum of a n dimensional vector n is connected with output layer neuronic weight matrix; Export a n dimensional vector n y to represent with [x, y];

Step 1045, through entirely connect after, output layer Output rusults [x, y] also records result.

5. the method reducing VOIP communication resource expense as claimed in claim 3, it is characterized in that, described step 106 is specially:

sd(n)＝m ^T(n)·d(n)

Step 1062, output error item sd (n) split into J and tie up matrix

In formula, I is the number of sub sampling layer every one-dimensional characteristic collection of illustrative plates epineural unit, and M is all neuronic number of sub sampling layer, M=J × I;

In formula, G is the window size of the dimensionality reduction operation that convolutional layer carries out to sub sampling layer;

Step 1064, voice n-th frame for input, calculate error term and the error term of weight matrix m (n) of weight matrix w (n) respectively;

Δw(n)＝o″(n)*c(n)

" (n) is that the input feature vector collection of illustrative plates O of the corresponding input layer of input voice n-th frame revolves turnback to o in formula;

Δ m (n) = d (n) \cdot \hat{P} (n)

In formula for the voice n-th frame sub sampling layer of input exports the one dimension vector characteristic collection of illustrative plates formed, d (n) is actual output and desirable output error vector;

w(n+1)＝w(n)-αΔw(n)

m(n+1)＝m(n)-αΔm(n)

W (n) and m (n) represents the input layer corresponding to speech frame of the n-th frame input and convolutional layer and the weight matrix between sub sampling layer and output layer respectively, and α is learning rate.