CN105321525B

CN105321525B - A kind of system and method reducing VOIP communication resource expense

Info

Publication number: CN105321525B
Application number: CN201510638936.1A
Authority: CN
Inventors: 王珂; 孙瑞浩; 张鹤立; 李曦; 纪红
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2015-09-30
Filing date: 2015-09-30
Publication date: 2019-02-22
Anticipated expiration: 2035-09-30
Also published as: CN105321525A

Abstract

The invention discloses a kind of systems for reducing VOIP communication resource expense, including input layer, convolutional layer, sub-sampling layer and output layer, each layer to be made of characteristic spectrum, and every one-dimensional characteristic map contains neuron；Using the system reduce VOIP communication resource expense method, specifically: one, training convolutional neural networks system；Two, convolutional neural networks system is initialized；Three, voice to be measured is inputted into VAD system；Four, sequentially each frame extracts speech characteristic parameter MFCC and its first-order difference characteristic parameter；Five, each frame parameter composition one-dimensional characteristic map is brought into convolutional neural networks system；Six, convolutional neural networks system will sequentially export the result [x, y] of each frame of voice to be measured, and VAD system adjudicates and records result.Advantage is: convolutional neural networks system being introduced into VAD system and is detected, the False Rate of VAD system is reduced, saves and calculates time and bandwidth, the voice resource expense of VOIP can be reduced in the case where guaranteeing communication quality.

Description

A kind of system and method reducing VOIP communication resource expense

Technical field

The invention belongs to communication computer fields, are related to a kind of system and method for reducing VOIP communication resource expense.

Background technique

VoIP (Voice over Internet Protocol) also known as Voice over IP, Voice over IP；Specifically It is to digitize the analog signal (Voice) of voice, IP data packet is converted into, in the form of data packet (Data Packet) Real-time delivery is done on IP network (IP Network).

It with the fast-developing of IP network technology and constantly improve, IPization network gradually substitutes traditional exchange of speech Network.Using IP network transmitting audio data, under conditions of network bandwidth immobilizes, improving channel utilization can be very big The quality of voice is improved in degree.

In the prior art, it improves channel utilization and generallys use voice activation detection (Voice Activity Detection, VAD) technology；Speech and silent characteristic of the VAD technology according to the daily talk of people, detect mute (Silence) inhibited when, be not take up it or few busy channel bandwidth, just by it when detecting the activation speech of burst Carry out compressed encoding and transmission.In fact, people, when making a phone call, only 27% -40% time is talking, VAD skill is used Art removes silent time, only retains the part containing information, can make the effective rate of utilization of channel width improve 50% with On, so VAD technology becomes in voice traffic transmission the powerful tool for reducing bandwidth, improving channel utilization.

VAD technology is used to transmitting audio data in IP phone application, non-speech segments are removed during audio session, The coding and transmission to quiet data packet are avoided, saves and calculates time and bandwidth.

The method that current VAD technology uses is: receiving current audio frame to be detected first, then obtains from audio frame Time domain parameter and frequency domain parameter are taken, sliding is flat when then obtaining long in historical background noise frame of time domain parameter and time domain parameter First distance between mean value, and sliding average when obtaining long in historical background noise frame of frequency domain parameter and frequency domain parameter Between second distance." historical background noise frame " refers to the background noise frames before present frame, such as before present frame Continuous multiple background noise frames；If present frame is initial first frame, can make an uproar using preset frame as historical background Acoustic frame, or using the first frame as historical background noise frame.Finally, according to first distance and second distance and based on first away from It is still background noise frames that from the judgement multinomial group judgement with second distance, currently audio frame to be detected, which is prospect speech frame, Adjudicating at least one coefficient in multinomial group is variable, which is to detect working method and/or input according to voice activation What signal characteristic determined.

Above-mentioned technology is mainly the numerical value such as Pitch Information, energy information, the tone information for passing through calculating voice signal and will Compared with its threshold value with standard, if numerical value is greater than threshold value and is judged to voice, be judged to noise if less than threshold value, this kind of algorithm it is excellent Point is fairly simple, but is only applicable to handle relatively simple noise source, for complicated noise source, voice messaging above-mentioned first It is easier to mix with noise, subthreshold is also difficult to adapt to the variation of environment.

Summary of the invention

The present invention is directed to existing VAD detection technique, not high to the adaptability of environmental change, and adjudicates under complex environment The low disadvantage of accuracy rate proposes a kind of system and method for reducing VOIP communication resource expense.

A kind of system also known as convolutional neural networks system reducing VOIP communication resource expense, specifically includes 4 layers, successively For input layer, convolutional layer, sub-sampling layer and output layer, each layer is made of characteristic spectrum, and every one-dimensional characteristic map contains mind Through member.

What input layer inputted is the speech characteristic parameter MFCC and its first-order difference characteristic parameter of each frame of voice to be measured, institute It states parameter and constitutes one-dimensional characteristic map；

The characteristic spectrum of convolutional layer has multidimensional, per one-dimensional characteristic spectrum by the characteristic spectrum of input layer and a convolution Nuclear convolution obtains, according to the size of convolution kernel, 1 neuron of multiple neurons mapping convolutional layer of input layer, and every dimension The neuron of same position is all connected with corresponding neuron in input layer in characteristic spectrum.

Sub-sampling layer carries out dimensionality reduction operation by every one-dimensional characteristic map to convolutional layer, by the every one-dimensional characteristic figure of convolutional layer Then neuron average packet in spectrum chooses neuron of the maximum neuron as sub-sampling layer in every group；Sub-sampling layer The dimension of characteristic spectrum is identical as the dimension of convolutional layer characteristic spectrum, and the number of sub-sampling layer epineural member is reduced on convolutional layer 1/3 to the 2/3 of neuron number.

Output layer is one-dimensional characteristic map, all minds of the whole connexon sample levels of each neuron on characteristic spectrum Through member, and export the actual result of voice to be measured.

A method of reducing VOIP communication resource expense, specific steps are as follows:

Step 1: training convolutional neural networks system, obtains and saves weight matrix w and weight matrix m；

Specific step is as follows:

Step 101 chooses speech samples and is respectively labeled as speech frame and non-speech frame to each frame of speech samples.

To speech samples, sequentially hand labeled, speech frame are labeled as [1,0], and non-speech frame is labeled as [0,1], respectively as Convolutional neural networks system output layer adjudicates the ideal output of speech frame and non-speech frame.

Step 102, for each frame voice after label, sequentially extract speech characteristic parameter MFCC and its a scale respectively Divide characteristic parameter；

MFCC (Mel-scale Frequency Cepstral Coefficients, mel cepstrum coefficients) extracts feature A frame is divided by 20ms when parameter.

The calculation formula of speech characteristic parameter MFCC are as follows:

Wherein, M (n) is the n-th order parameter of speech characteristic parameter MFCC, m_kFor the input of discrete cosine transform, k= 1,....,N；N is the number of Mel filter；L is the maximum order of speech characteristic parameter MFCC coefficient.

Note first-order difference MFCC parameter is Δ MFCC, its calculation formula is:

In formula, D (t) is the t rank parameter of first-order difference MFCC parameter；L' is the maximum order of Δ MFCC coefficient.

M (t+n) is the t+n rank parameter of speech characteristic parameter MFCC；M (t-n) is the t-n of speech characteristic parameter MFCC Rank parameter；N=1 ... the time difference of, R, R expression single order cepstrum coefficient.

The speech characteristic parameter MFCC of each frame and its first-order difference characteristic parameter are sequentially formed one-dimensional spy by step 103 Levy map；

The calculation formula of one-dimensional characteristic map V is as follows:

V=[MFCC | Δ MFCC]=[M (1), M (2) ... M (n) ..., M (L), D (1), D (2) ... D (t) ..., D (L')]

Step 104 brings the one-dimensional characteristic map of step 103 into convolutional neural networks system, obtains court verdict；

Specific steps are as follows:

Step 1041, the input that the one-dimensional characteristic map of each frame of sample voice is input to convolutional neural networks system Layer；

Step 1042 obtains convolutional layer by the one-dimensional characteristic map and convolution nuclear convolution of input layer per one-dimensional characteristic pattern Spectrum.

Convolutional layer is per one-dimensional characteristic spectrum Q_jCalculating it is as follows:

Q_j=σ (O*w_j), j=1 ..., J

Wherein, Q_jFor the jth dimensional feature map of convolutional layer, j=1 ..., J, J are the dimension of convolutional layer characteristic spectrum；O is The one-dimensional characteristic map of input layer, * represent convolution operation, w_jIndicate that input layer is connected nerve with convolutional layer jth dimensional feature map The weight matrix of member, all weight matrix w_jConstitute matrix w；σ (x) is activation primitive, selects sigmoid function, is defined as:

After convolution, what the neuron on the every one-dimensional characteristic map of convolutional layer extracted is opposite in input layer characteristic spectrum The local feature answered.

Step 1043, sub-sampling layer are to every one-dimensional characteristic map Q of convolutional layer_jCarry out dimensionality reduction operation；

By the neuron average packet on the every one-dimensional characteristic map of convolutional layer, the maximum neuron then chosen in every group is made For the neuron of sub-sampling layer, calculation formula is as follows:

Wherein, p_j,mRepresent sub-sampling layer jth dimensional feature map P_jM-th of neuron, q_j,mRepresent convolutional layer jth Wei Te Levy map Q_jM-th of neuron, s is the mobile size of window；G is the window for the dimensionality reduction operation that convolutional layer is carried out to sub-sampling layer Size.The value of the mobile size s of window is identical as the value of window G.

Each neuron of step 1044, output layer is all connect with all neurons of sub-sampling layer；

Calculation formula are as follows:

In formula, y is output vector；Indicate the characteristic spectrum that each dimensional feature map of sub-sampling layer is transformed into a n dimensional vector n,P_jSub-sampling layer jth dimensional feature map is represented, J is the dimension of sub-sampling layer characteristic spectrum；M is The characteristic spectrum of one n dimensional vector n is connected the weight matrix of neuron with output layer, and σ (x) is activation primitive, selects sigmoid letter Number, is defined as:

Step 1045, after connecting entirely, output layer exports and records result.

Step 105 is directed to each frame voice, the ideal for sequentially marking the practical court verdict of step 104 and step 101 Output compares, and calculates decision error item；

For the n-th frame of input voice, the reality output result of convolutional neural networks system is y (n), and ideal output is denoted as, the calculation formula of decision error item d (n) is as follows:

D (n)=e (n) { y (n) [1-y (n)] }

In formulaN is the frame number of sample voice.

Step 106, for the decision error item of each frame voice, convolutional neural networks system reversely passes the error term It broadcasts, updates weight matrix w and weight matrix m, complete a training process.

Step 1061, for the n-th frame of input voice, error term d (n) reverse transfer obtains sub-sampling to sub-sampling layer The output error item sd (n) of layer:

Sd (n)=m^T(n)·d(n)

m^T(n) transposed matrix for being m (n), when m (n) is that input voice n-th frame is input to convolutional neural networks system, son Sample level output characteristic spectrum is connected the weight matrix of neuron with output layer；

Output error item sd (n) is split into J dimension matrix by step 1062

Sd (n)=[sd₁,sd₂,...,sd_M] split obtain J dimension matrixProcess is as follows:

I is the number of the every one-dimensional characteristic map epineural member of sub-sampling layer in formula, and M is of all neurons of sub-sampling layer Number, M=J × I.

Step 1063, J tie up matrixIt is transferred to convolutional layer and carries out increasing dimension, obtain the error term c (n) of convolutional layer；

Process is as follows:

G is the window size for the dimensionality reduction operation that convolutional layer is carried out to sub-sampling layer in formula.

Step 1064, the voice n-th frame for input calculate separately the error term and weight matrix m of weight matrix w (n) (n) error term.

Input layer is calculated to be connected with convolutional layer the error term of weight matrix w (n):

Δ w (n)=o " (n) * c (n)

O in formula " (n) is the input feature vector map O rotation 180 degree for inputting voice n-th frame and corresponding to input layer.

Sub-sampling layer is calculated to be connected with output layer the error term of weight matrix m (n):

In formulaFor the one-dimensional vector characteristic map that the voice n-th frame sub-sampling layer output of input is constituted, d (n) is real Border output and ideal output error vector.

Step 1065, the corresponding weight matrix w (n) of speech frame for updating n-th frame input respectively are with weight matrix m's (n) Value obtains the corresponding weight matrix w (n+1) of speech frame and the weight matrix m (n+1) of the (n+1)th frame input:

W (n+1)=w (n)-α Δ w (n)

M (n+1)=m (n)-α Δ m (n)

W (n) and m (n) respectively indicate the corresponding input layer of the speech frame of n-th frame input and convolutional layer and sub-sampling layer and The weight matrix of interlayer is exported, α is learning rate, preferably α=1.

Step 107 continually enters sample voice, repeats step 101-106, until the output of convolutional neural networks system misses Poor δ is less than preset value；

S is the voice totalframes of training input every time in formula, and y is the vector that the court verdict of S frame voice is constituted,For S frame The vector that the ideal decision result of voice is constituted.

Step 108, when output error δ be less than preset value when, record weight matrix w and weight matrix m simultaneously be stored in file.

Step 2: reading weight matrix w and weight matrix m from file initializes convolutional neural networks system.

Step 3: voice to be measured is inputted VAD system；

Step 4: sequentially extracting speech characteristic parameter MFCC and its first-order difference feature ginseng to each frame of voice to be measured Number；

Step 5: sequentially by the characteristic parameter MFCC of each frame of voice to be measured and its first-order difference characteristic parameter composition one Dimensional feature map brings convolutional neural networks system into.

Step 6: convolutional neural networks system will sequentially export the result [x, y] of each frame of voice to be measured, VAD system pair As a result [x, y] makes decisions, and records court verdict；

VAD system makes decisions the output result [x, y] of output layer, if x > y, illustrates for sample voice to be voice Frame, otherwise it is non-speech frame that it is 1 that VAD system, which exports court verdict, and it is 0 that VAD system, which exports court verdict,.

The present invention has the advantages that

(1) a kind of system for reducing VOIP communication resource expense, after convolutional neural networks systematic training is good, without instructing again Practice, VAD detection can be carried out after only need to importing the weight initialization network saved from file, computation complexity is low.

(2) a kind of method for reducing VOIP communication resource expense, by the way that convolutional neural networks system is introduced into VAD system It is detected, makes full use of the learning characteristic of convolutional neural networks, reduce the False Rate of VAD system, saved and calculate time and band Width, so as to reduce the voice resource expense of VOIP in the case where guaranteeing communication quality.

Detailed description of the invention

Fig. 1 is a kind of system structure diagram for reducing VOIP communication resource expense of the present invention.

Fig. 2 is a kind of method flow diagram for reducing VOIP communication resource expense of the present invention.

Fig. 3 is the flow chart of training convolutional neural networks system of the present invention.

Fig. 4 is the flow chart of convolutional neural networks system of the present invention.

Fig. 5 is the flow chart that convolutional neural networks system reverse of the present invention is propagated.

Specific embodiment

Below in conjunction with drawings and examples, the present invention is described in further detail.

In VOIP call system, VAD detection is first carried out before voice coding, is removed non-speech segments, is avoided to quiet The coding of sound data packet and transmission save and calculate time and bandwidth.It is non-speech frame to VAD judgement in the voice coding stage Segment can be without encoding or encoding less, to reduce communications bandwidth resources.

A kind of system and method reducing VOIP communication resource expense, convolutional neural networks system structure is simple, training The features such as parameter is few and adaptable is applied in VAD system, by training convolutional neural networks system, by convolutional neural networks System reduces the False Rate and misdetection rate of VAD system, to reduce in the case where guaranteeing communication quality as two-dimentional decision device VOIP communication resource expense.

A kind of system also known as convolutional neural networks system reducing VOIP communication resource expense, as shown in Figure 1, specific packet 4 layers are included, input layer, convolutional layer, sub-sampling layer and output layer are followed successively by, each layer is made of several dimensional feature maps, often Contain several neurons on one-dimensional characteristic map.

Input layer is one-dimensional characteristic map, has 26 neurons, is denoted as 1*26；Convolutional layer has 12 dimensional feature maps, each Dimension has 21 neurons；It is denoted as 12*21；Sub-sampling layer is identical as the characteristic spectrum dimension of convolutional layer, on every one-dimensional characteristic map Neuron number be the every one-dimensional characteristic map epineural member number of convolutional layer one third；It is denoted as 12*7；Output layer is one Dimensional feature map has 2 neurons；It is denoted as 1*2.

That input layer inputs is the speech characteristic parameter MFCC and its first-order difference characteristic parameter structure of each frame of voice to be measured At one-dimensional characteristic spectrogram；

The characteristic spectrum of convolutional layer has 12 dimensions, per one-dimensional characteristic spectrum by the characteristic spectrum of input layer and a convolution kernel Convolution obtains.The present embodiment selects the convolution kernel of 12 1*6.After convolution, preceding 6 neurons of input layer correspond to convolutional layer The first dimensional feature map on first neuron；And so on.The nerve of same position on each dimensional feature map of convolutional layer Member is all connected with 6 neurons of the correspondence on input layer characteristic spectrum, and connection type is all the same；

All neurons of two neuron whole connexon sample levels of output layer, and export the practical knot of voice to be measured Fruit.

The method that VAD detection reduces VOIP communication resource expense is carried out using convolutional neural networks system, as shown in Fig. 2, Specific steps are as follows:

As shown in Figure 3, the specific steps are as follows:

Part of speech of the speech samples in Clemson university CUAVE database and noise speech library NOIZEUS；

To speech samples, sequentially hand labeled, every 20ms divide a frame into, and speech frame is labeled as [1,0], non-speech frame label For [0,1], speech frame is adjudicated respectively as convolutional neural networks system output layer and the ideal of non-speech frame exports.

The calculation formula of speech characteristic parameter MFCC are as follows:

Wherein, M (n) is the n-th order parameter of speech characteristic parameter MFCC, m_kFor the input of discrete cosine transform, k= 1,....,N；N is the number of Mel filter, preferred value 24；L is speech characteristic parameter MFCC coefficient maximum order, preferred value It is 13.

M (t+n) is the t+n rank parameter of speech characteristic parameter MFCC；M (t-n) is the t-n of speech characteristic parameter MFCC Rank parameter；N=1 ..., R, R indicate the time difference of single order cepstrum coefficient, can be with value 1 or 2.Preferred value is 2.

The process for forming one-dimensional characteristic map by MFCC and Δ MFCC is as follows:

In formula, the one-dimensional characteristic map that V is indicated, M (n) is the n-th order parameter of speech characteristic parameter MFCC, and D (t) is The t rank parameter of Δ MFCC parameter.

As shown in figure 4, specific steps are as follows:

All speech characteristic parameter MFCC and its first-order difference characteristic parameter of sample voice constitute the feature of a 1*26 Spectrogram；Using one-dimensional vector form as the input of convolutional neural networks system；

Step 1042 obtains convolutional layer by the characteristic spectrum and convolution nuclear convolution of input layer per one-dimensional characteristic spectrum.

Convolution kernel size is set according to demand；The present embodiment selects the convolution kernel of 12 1*6.

After convolution nuclear convolution, convolutional layer is reduced to the 80% of input layer number per one-dimensional neuron number Left and right.Input layer is connected the weight matrix w of neuron with convolutional layer jth dimensional feature map_jIt indicates；All weight matrix w_j Constitute matrix w；

Q_j=σ (O*w_j), j=1 ..., J

Wherein, Q_jFor the jth dimensional feature map of convolutional layer, j=1 ..., J, J are convolutional layer characteristic spectrum dimension, this reality Applying example is 12；O is the characteristic spectrum of input layer, and * represents convolution operation, and σ (x) is activation primitive, selects sigmoid function, fixed Justice are as follows:

12 dimensional feature maps of convolutional layer are obtained by the convolution nuclear convolution of the characteristic spectrum of input layer and 12 1*6；Cause It is identical with the neuron connection type on input layer characteristic spectrum for the neuron on each characteristic spectrum of convolutional layer, so convolutional layer What the neuron on every one-dimensional characteristic map extracted is corresponding local feature in input layer characteristic spectrum.

Wherein, p_j,mRepresent sub-sampling layer jth dimensional feature map P_jM-th of neuron, q_j,mRepresent convolutional layer jth Wei Te Levy map Q_jM-th of neuron, s is the mobile size of window；G is the window for the dimensionality reduction operation that convolutional layer is carried out to sub-sampling layer The number of every group of neuron in size, that is, average packet；The value of the mobile size s of window is identical as the value of window G.This implementation Example selects G=s=3.

Sub-sampling layer obtains 12*7 neuron after dimensionality reduction, and the neuron number of every one-dimensional characteristic map is on sub-sampling layer 1/3 to the 2/3 of every one-dimensional characteristic map neuron number, reduces the data volume of calculating on convolutional layer.

Using full connection, calculation formula between the output of sub-sampling layer and the input of output layer are as follows:

In formula, y is output vector,Indicate the characteristic spectrum that each dimensional feature map of sub-sampling layer is transformed into a n dimensional vector n；P_jSub-sampling layer jth dimensional feature map is represented, J is the dimension of sub-sampling layer characteristic spectrum；M is The characteristic spectrum of one n dimensional vector n is connected the weight matrix of neuron with output layer, and σ (x) is activation primitive, selects sigmoid letter Number, is defined as:

Step 1045, after connecting entirely, output layer output is expressed as [x, y], and records result.

For n-th frame speech frame, the reality output court verdict of convolutional neural networks system is y (n), and ideal output is denoted asThe calculation formula of decision error item d (n) is as follows:

D (n)=e (n) { y (n) [1-y (n)] }

In formulaN is the frame number of sample voice.

As shown in Figure 5, the specific steps are as follows:

Sd (n)=m^T(n)·d(n)

m^T(n) transposed matrix for being m (n), m (n) are that the sub-sampling layer characteristic spectrum of n-th frame speech frame is connected with output layer The weight matrix of neuron；

Output error item sd (n) is split into J dimension matrix by step 1062

Sd (n) is one-dimensional matrix, since sub-sampling layer is J dimension, needs to be splitted into J dimension matrix.

Assuming that sd (n)=[sd₁,sd₂,...,sd_M], sd (n) fractionation obtains J dimension matrixSplit process is as follows:

I is the number of the every one-dimensional characteristic map epineural member of sub-sampling layer in formula, and M is of all neurons of sub-sampling layer Number, M=J × I.s_I,JMatrix is tieed up for JElement.

Process is as follows:

Δ w (n)=o " (n) * c (n)

W (n+1)=w (n)-α Δ w (n)

M (n+1)=m (n)-α Δ m (n)

S is the voice totalframes of training input every time in formula, and y is the vector that the court verdict of S frame voice is constituted,For S The vector that the ideal decision result of frame voice is constituted.Preset value is 5%.

Step 3: voice to be measured is inputted VAD system；

It will carry out the voice to be measured input VAD system of VAD judgement.

Step 4: sequentially extracting speech characteristic parameter MFCC and its first-order difference characteristic parameter to speech frame to be measured；

Step 5: sequentially bringing the characteristic parameter composition one-dimensional characteristic map of each frame of voice to be measured into convolutional Neural net Network system.

Voice divides a frame by 20ms, for a frame voice data, the output result of convolutional neural networks system output layer For [x, y], when VAD system makes decisions, if x > y, illustrate that sample voice is speech frame, VAD system output 1, is otherwise non-language Sound frame, VAD system export court verdict 0.

For example voice to be measured passes through the result of output layer are as follows: [0.98,0.02], 0.98 when VAD system makes decisions > 0.02, illustrate VAD system court verdict: the voice to be measured is speech frame, and VAD system exports court verdict 1.

The present invention realizes a kind of VAD detection technique, using speech characteristic parameter as the input of convolutional neural networks, such as MFCC, MFCC and its first-order difference parameter, MFCC and single order and second differnce parameter；Convolutional neural networks are used to sentence as two dimension Certainly device.

Voice resource expense can be reduced in the case where guaranteeing communication quality by the technology in VOIP.Phonetic feature Extracting part could alternatively be LPCC, LPCC and its first-order difference parameter, LPCC and single order and second differnce parameter and MFCC With the combination of LPCC.The number of characteristic parameter can be different with the reference value given.The convolutional layer and sub-sampling of convolutional neural networks Layer neuron number can be different, and convolution kernel size can be different, and convolutional layer and sub-sampling layer can be multilayer.

Claims

1. a kind of system for reducing VOIP communication resource expense, which is characterized in that the system is by input layer, convolutional layer, son The convolutional neural networks system of sample level and output layer composition；Each layer is made of characteristic spectrum, and every one-dimensional characteristic map contains There is neuron；The VOIP refers to Voice over IP；

What input layer inputted is the first-order difference characteristic parameter of the speech characteristic parameter MFCC and MFCC of each frame of voice to be measured, institute It states parameter and constitutes one-dimensional characteristic map；The MFCC refers to mel cepstrum coefficients；

Convolutional layer is obtained per one-dimensional characteristic spectrum by the characteristic spectrum of input layer and a convolution nuclear convolution, and every dimensional feature The neuron of same position is all connected with corresponding neuron in input layer in map；

Sub-sampling layer carries out dimensionality reduction operation to every one-dimensional characteristic map of convolutional layer, by the mind on the every one-dimensional characteristic map of convolutional layer Through first average packet, neuron of the maximum neuron as sub-sampling layer in every group, sub-sampling layer epineural member are then chosen Number be reduced to 1/3 to the 2/3 of convolutional layer epineural member number；

Output layer is one-dimensional characteristic map, all nerves of the whole connexon sample levels of each neuron on characteristic spectrum Member, and export the actual result of voice to be measured；

The working principle of the system is as follows:

Training convolutional neural networks system, obtains and saves input layer and be connected with convolutional layer the weight matrix w and son of neuron Sample level is connected the weight matrix m of neuron with output layer；Weight matrix w and weight matrix m is read, convolutional Neural net is initialized Network system；Voice to be measured is inputted into VAD system；Sequentially to each frame of voice to be measured extract speech characteristic parameter MFCC and its First-order difference characteristic parameter；Sequentially the characteristic parameter MFCC of each frame of voice to be measured and its first-order difference characteristic parameter are formed One-dimensional characteristic map brings convolutional neural networks system into；Convolutional neural networks system will sequentially export the knot of each frame of voice to be measured Fruit, VAD system make decisions result, and record court verdict；The output result of output layer is [x, y], and VAD system is sentenced Certainly, if x is greater than y, illustrating to be speech frame for sample voice, otherwise it is non-speech frame that it is 1 that VAD system, which exports court verdict, It is 0 that VAD system, which exports court verdict,.

2. the side that application reduces the reduction VOIP communication resource expense of the system of VOIP communication resource expense as described in claim 1 Method, which is characterized in that specific steps are as follows:

Step 1: training convolutional neural networks system, obtains and save input layer and be connected with convolutional layer the weight matrix w of neuron And sub-sampling layer is connected the weight matrix m of neuron with output layer；

Step 2: reading weight matrix w and weight matrix m, convolutional neural networks system is initialized；

Step 3: voice to be measured is inputted VAD system；The VAD system refers to that voice activation detects；

Step 4: sequentially extracting speech characteristic parameter MFCC and its first-order difference characteristic parameter to each frame of voice to be measured；

Step 5: the characteristic parameter MFCC of each frame of voice to be measured and its first-order difference characteristic parameter are sequentially formed one-dimensional spy Sign map brings convolutional neural networks system into；

Step 6: convolutional neural networks system by sequentially export each frame of voice to be measured as a result, VAD system sentences result Certainly, and court verdict is recorded；

The output result of output layer is [x, y], and VAD system makes decisions, if x is greater than y, illustrates for sample voice to be voice Frame, otherwise it is non-speech frame that it is 1 that VAD system, which exports court verdict, and it is 0 that VAD system, which exports court verdict,.

3. reducing the method for VOIP communication resource expense as claimed in claim 2, which is characterized in that the step one is specific Are as follows: specific step is as follows:

Step 101 chooses speech samples and is respectively labeled as speech frame and non-speech frame to each frame of speech samples；

Speech frame is labeled as [1,0], and non-speech frame is labeled as [0,1], adjudicates respectively as convolutional neural networks system output layer The ideal output of speech frame and non-speech frame；

Step 102, for each frame voice after label, sequentially extract speech characteristic parameter MFCC respectively and its first-order difference be special Levy parameter；

The calculation formula of speech characteristic parameter MFCC n-th order parameter M (n) are as follows:

m_kFor the input of discrete cosine transform, k=1 ..., N；N is the number of Mel filter；L is speech characteristic parameter The maximum order of MFCC coefficient；

The calculation formula of the t rank parameter D (t) of first-order difference MFCC parameter, Δ MFCC are as follows:

L' is the maximum order of Δ MFCC coefficient；M (t+n) is the t+n rank parameter of speech characteristic parameter MFCC；M (t-n) is language The t-n rank parameter of sound characteristic parameter MFCC；N=1 ... the time difference of, R, R expression single order cepstrum coefficient；

The speech characteristic parameter MFCC of each frame and its first-order difference characteristic parameter are sequentially formed one-dimensional characteristic figure by step 103 Spectrum；

The calculation formula of one-dimensional characteristic map V is as follows:

Step 105 is directed to each frame voice, the ideal output for sequentially marking the practical court verdict of step 104 and step 101 It compares, calculates decision error item；

For the n-th frame of input voice, the reality output result of convolutional neural networks system is y (n), and ideal output is denoted as The calculation formula of decision error item d (n) is as follows:

D (n)=e (n) { y (n) [1-y (n)] }

In formulaN is the frame number of sample voice；

Step 106, for the decision error item of each frame voice, convolutional neural networks system is by the error term backpropagation, more New weight matrix w and weight matrix m, completes a training process；

Step 107 continually enters sample voice, repeats step 101-106, until the output error δ of convolutional neural networks system Less than preset value；

S is the voice totalframes of training input every time in formula, and y is S frame voice after convolutional neural networks system, reality output As a result the vector constituted,The vector constituted for the desired result of the S frame voice of input；

4. reducing the method for VOIP communication resource expense as claimed in claim 3, which is characterized in that step 104 tool Body are as follows:

Step 1041, the input layer that the one-dimensional characteristic map of each frame of sample voice is input to convolutional neural networks system；

Step 1042 obtains convolutional layer by the one-dimensional characteristic map and convolution nuclear convolution of input layer per one-dimensional characteristic spectrum；

Q_j=σ (O*w_j), j=1 ..., J

Wherein, Q_jFor the jth dimensional feature map of convolutional layer, j=1 ..., J, J are that the dimension of convolutional layer characteristic spectrum and son are adopted The dimension of sample layer characteristic spectrum；O is the one-dimensional characteristic map of input layer, and * represents convolution operation, w_jIndicate input layer and convolutional layer The weight matrix of the connected neuron of jth dimensional feature map, all weight matrix w_jConstitute matrix w；

σ (x) is defined as:

After convolution, what the neuron on the every one-dimensional characteristic map of convolutional layer extracted is corresponding in input layer characteristic spectrum Local feature；

By the neuron average packet on the every one-dimensional characteristic map of convolutional layer, the maximum neuron in every group is then chosen as son The neuron of sample level, calculation formula are as follows:

Wherein, p_j,mRepresent sub-sampling layer jth dimensional feature map P_jM-th of neuron, q_j,mRepresent convolutional layer jth dimensional feature figure Compose Q_jM-th of neuron, s is the mobile size of window；G is that the window that convolutional layer is operated to the dimensionality reduction that sub-sampling layer carries out is big Small, the value of the mobile size s of window is identical as the value of window G；

The calculation formula of output vector y are as follows:

In formula,Indicate the characteristic spectrum that each dimensional feature map of sub-sampling layer is transformed into a n dimensional vector n,P_jRepresent sub-sampling layer jth dimensional feature map；M is the characteristic spectrum and output layer of a n dimensional vector n The weight matrix of connected neuron；Export a n dimensional vector n y is indicated with [x, y]；

Step 1045, after connecting entirely, output layer output result [x, y] simultaneously record result.

5. reducing the method for VOIP communication resource expense as claimed in claim 3, which is characterized in that step 106 tool Body are as follows:

Step 1061, for the n-th frame of input voice, error term d (n) reverse transfer to sub-sampling layer obtains sub-sampling layer Output error item sd (n):

Sd (n)=m^T(n)·d(n)

m^T(n) transposed matrix for being m (n), when m (n) is that input voice n-th frame is input to convolutional neural networks system, sub-sampling layer Output characteristic spectrum is connected the weight matrix of neuron with output layer；

Output error item sd (n) is split into J dimension matrix by step 1062

Sd (n)=[sd₁,sd₂,…,sd_M] split obtain J dimension matrixProcess is as follows:

I is the number of the every one-dimensional characteristic map epineural member of sub-sampling layer in formula, and M is the number of all neurons of sub-sampling layer, M =J × I；

G is the window size for the dimensionality reduction operation that convolutional layer is carried out to sub-sampling layer in formula；

Step 1064, the voice n-th frame for input calculate separately the error term and weight matrix m (n) of weight matrix w (n) Error term；

Δ w (n)=o " (n) * c (n)

O in formula " (n) is the input feature vector map O rotation 180 degree for inputting voice n-th frame and corresponding to input layer；

In formulaFor the one-dimensional vector characteristic map that the voice n-th frame sub-sampling layer output of input is constituted, d (n) is practical defeated Out with ideal output error vector；

Step 1065, the value for updating speech frame corresponding weight matrix w (n) and weight matrix m (n) that n-th frame inputs respectively, obtain The corresponding weight matrix w (n+1) of speech frame inputted to the (n+1)th frame and weight matrix m (n+1):

W (n+1)=w (n)-α Δ w (n)

M (n+1)=m (n)-α Δ m (n)

W (n) input layer corresponding with the speech frame that m (n) respectively indicates n-th frame input and convolutional layer and sub-sampling layer and output The weight matrix of interlayer, α are learning rate.