CN105321525B - A kind of system and method reducing VOIP communication resource expense - Google Patents

A kind of system and method reducing VOIP communication resource expense Download PDF

Info

Publication number
CN105321525B
CN105321525B CN201510638936.1A CN201510638936A CN105321525B CN 105321525 B CN105321525 B CN 105321525B CN 201510638936 A CN201510638936 A CN 201510638936A CN 105321525 B CN105321525 B CN 105321525B
Authority
CN
China
Prior art keywords
layer
frame
voice
output
dimensional
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510638936.1A
Other languages
Chinese (zh)
Other versions
CN105321525A (en
Inventor
王珂
孙瑞浩
张鹤立
李曦
纪红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201510638936.1A priority Critical patent/CN105321525B/en
Publication of CN105321525A publication Critical patent/CN105321525A/en
Application granted granted Critical
Publication of CN105321525B publication Critical patent/CN105321525B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)

Abstract

The invention discloses a kind of systems for reducing VOIP communication resource expense, including input layer, convolutional layer, sub-sampling layer and output layer, each layer to be made of characteristic spectrum, and every one-dimensional characteristic map contains neuron;Using the system reduce VOIP communication resource expense method, specifically: one, training convolutional neural networks system;Two, convolutional neural networks system is initialized;Three, voice to be measured is inputted into VAD system;Four, sequentially each frame extracts speech characteristic parameter MFCC and its first-order difference characteristic parameter;Five, each frame parameter composition one-dimensional characteristic map is brought into convolutional neural networks system;Six, convolutional neural networks system will sequentially export the result [x, y] of each frame of voice to be measured, and VAD system adjudicates and records result.Advantage is: convolutional neural networks system being introduced into VAD system and is detected, the False Rate of VAD system is reduced, saves and calculates time and bandwidth, the voice resource expense of VOIP can be reduced in the case where guaranteeing communication quality.

Description

A kind of system and method reducing VOIP communication resource expense
Technical field
The invention belongs to communication computer fields, are related to a kind of system and method for reducing VOIP communication resource expense.
Background technique
VoIP (Voice over Internet Protocol) also known as Voice over IP, Voice over IP;Specifically It is to digitize the analog signal (Voice) of voice, IP data packet is converted into, in the form of data packet (Data Packet) Real-time delivery is done on IP network (IP Network).
It with the fast-developing of IP network technology and constantly improve, IPization network gradually substitutes traditional exchange of speech Network.Using IP network transmitting audio data, under conditions of network bandwidth immobilizes, improving channel utilization can be very big The quality of voice is improved in degree.
In the prior art, it improves channel utilization and generallys use voice activation detection (Voice Activity Detection, VAD) technology;Speech and silent characteristic of the VAD technology according to the daily talk of people, detect mute (Silence) inhibited when, be not take up it or few busy channel bandwidth, just by it when detecting the activation speech of burst Carry out compressed encoding and transmission.In fact, people, when making a phone call, only 27% -40% time is talking, VAD skill is used Art removes silent time, only retains the part containing information, can make the effective rate of utilization of channel width improve 50% with On, so VAD technology becomes in voice traffic transmission the powerful tool for reducing bandwidth, improving channel utilization.
VAD technology is used to transmitting audio data in IP phone application, non-speech segments are removed during audio session, The coding and transmission to quiet data packet are avoided, saves and calculates time and bandwidth.
The method that current VAD technology uses is: receiving current audio frame to be detected first, then obtains from audio frame Time domain parameter and frequency domain parameter are taken, sliding is flat when then obtaining long in historical background noise frame of time domain parameter and time domain parameter First distance between mean value, and sliding average when obtaining long in historical background noise frame of frequency domain parameter and frequency domain parameter Between second distance." historical background noise frame " refers to the background noise frames before present frame, such as before present frame Continuous multiple background noise frames;If present frame is initial first frame, can make an uproar using preset frame as historical background Acoustic frame, or using the first frame as historical background noise frame.Finally, according to first distance and second distance and based on first away from It is still background noise frames that from the judgement multinomial group judgement with second distance, currently audio frame to be detected, which is prospect speech frame, Adjudicating at least one coefficient in multinomial group is variable, which is to detect working method and/or input according to voice activation What signal characteristic determined.
Above-mentioned technology is mainly the numerical value such as Pitch Information, energy information, the tone information for passing through calculating voice signal and will Compared with its threshold value with standard, if numerical value is greater than threshold value and is judged to voice, be judged to noise if less than threshold value, this kind of algorithm it is excellent Point is fairly simple, but is only applicable to handle relatively simple noise source, for complicated noise source, voice messaging above-mentioned first It is easier to mix with noise, subthreshold is also difficult to adapt to the variation of environment.
Summary of the invention
The present invention is directed to existing VAD detection technique, not high to the adaptability of environmental change, and adjudicates under complex environment The low disadvantage of accuracy rate proposes a kind of system and method for reducing VOIP communication resource expense.
A kind of system also known as convolutional neural networks system reducing VOIP communication resource expense, specifically includes 4 layers, successively For input layer, convolutional layer, sub-sampling layer and output layer, each layer is made of characteristic spectrum, and every one-dimensional characteristic map contains mind Through member.
What input layer inputted is the speech characteristic parameter MFCC and its first-order difference characteristic parameter of each frame of voice to be measured, institute It states parameter and constitutes one-dimensional characteristic map;
The characteristic spectrum of convolutional layer has multidimensional, per one-dimensional characteristic spectrum by the characteristic spectrum of input layer and a convolution Nuclear convolution obtains, according to the size of convolution kernel, 1 neuron of multiple neurons mapping convolutional layer of input layer, and every dimension The neuron of same position is all connected with corresponding neuron in input layer in characteristic spectrum.
Sub-sampling layer carries out dimensionality reduction operation by every one-dimensional characteristic map to convolutional layer, by the every one-dimensional characteristic figure of convolutional layer Then neuron average packet in spectrum chooses neuron of the maximum neuron as sub-sampling layer in every group;Sub-sampling layer The dimension of characteristic spectrum is identical as the dimension of convolutional layer characteristic spectrum, and the number of sub-sampling layer epineural member is reduced on convolutional layer 1/3 to the 2/3 of neuron number.
Output layer is one-dimensional characteristic map, all minds of the whole connexon sample levels of each neuron on characteristic spectrum Through member, and export the actual result of voice to be measured.
A method of reducing VOIP communication resource expense, specific steps are as follows:
Step 1: training convolutional neural networks system, obtains and saves weight matrix w and weight matrix m;
Specific step is as follows:
Step 101 chooses speech samples and is respectively labeled as speech frame and non-speech frame to each frame of speech samples.
To speech samples, sequentially hand labeled, speech frame are labeled as [1,0], and non-speech frame is labeled as [0,1], respectively as Convolutional neural networks system output layer adjudicates the ideal output of speech frame and non-speech frame.
Step 102, for each frame voice after label, sequentially extract speech characteristic parameter MFCC and its a scale respectively Divide characteristic parameter;
MFCC (Mel-scale Frequency Cepstral Coefficients, mel cepstrum coefficients) extracts feature A frame is divided by 20ms when parameter.
The calculation formula of speech characteristic parameter MFCC are as follows:
Wherein, M (n) is the n-th order parameter of speech characteristic parameter MFCC, mkFor the input of discrete cosine transform, k= 1,....,N;N is the number of Mel filter;L is the maximum order of speech characteristic parameter MFCC coefficient.
Note first-order difference MFCC parameter is Δ MFCC, its calculation formula is:
In formula, D (t) is the t rank parameter of first-order difference MFCC parameter;L' is the maximum order of Δ MFCC coefficient.
M (t+n) is the t+n rank parameter of speech characteristic parameter MFCC;M (t-n) is the t-n of speech characteristic parameter MFCC Rank parameter;N=1 ... the time difference of, R, R expression single order cepstrum coefficient.
The speech characteristic parameter MFCC of each frame and its first-order difference characteristic parameter are sequentially formed one-dimensional spy by step 103 Levy map;
The calculation formula of one-dimensional characteristic map V is as follows:
V=[MFCC | Δ MFCC]=[M (1), M (2) ... M (n) ..., M (L), D (1), D (2) ... D (t) ..., D (L')]
Step 104 brings the one-dimensional characteristic map of step 103 into convolutional neural networks system, obtains court verdict;
Specific steps are as follows:
Step 1041, the input that the one-dimensional characteristic map of each frame of sample voice is input to convolutional neural networks system Layer;
Step 1042 obtains convolutional layer by the one-dimensional characteristic map and convolution nuclear convolution of input layer per one-dimensional characteristic pattern Spectrum.
Convolutional layer is per one-dimensional characteristic spectrum QjCalculating it is as follows:
Qj=σ (O*wj), j=1 ..., J
Wherein, QjFor the jth dimensional feature map of convolutional layer, j=1 ..., J, J are the dimension of convolutional layer characteristic spectrum;O is The one-dimensional characteristic map of input layer, * represent convolution operation, wjIndicate that input layer is connected nerve with convolutional layer jth dimensional feature map The weight matrix of member, all weight matrix wjConstitute matrix w;σ (x) is activation primitive, selects sigmoid function, is defined as:
After convolution, what the neuron on the every one-dimensional characteristic map of convolutional layer extracted is opposite in input layer characteristic spectrum The local feature answered.
Step 1043, sub-sampling layer are to every one-dimensional characteristic map Q of convolutional layerjCarry out dimensionality reduction operation;
By the neuron average packet on the every one-dimensional characteristic map of convolutional layer, the maximum neuron then chosen in every group is made For the neuron of sub-sampling layer, calculation formula is as follows:
Wherein, pj,mRepresent sub-sampling layer jth dimensional feature map PjM-th of neuron, qj,mRepresent convolutional layer jth Wei Te Levy map QjM-th of neuron, s is the mobile size of window;G is the window for the dimensionality reduction operation that convolutional layer is carried out to sub-sampling layer Size.The value of the mobile size s of window is identical as the value of window G.
Each neuron of step 1044, output layer is all connect with all neurons of sub-sampling layer;
Calculation formula are as follows:
In formula, y is output vector;Indicate the characteristic spectrum that each dimensional feature map of sub-sampling layer is transformed into a n dimensional vector n,PjSub-sampling layer jth dimensional feature map is represented, J is the dimension of sub-sampling layer characteristic spectrum;M is The characteristic spectrum of one n dimensional vector n is connected the weight matrix of neuron with output layer, and σ (x) is activation primitive, selects sigmoid letter Number, is defined as:
Step 1045, after connecting entirely, output layer exports and records result.
Step 105 is directed to each frame voice, the ideal for sequentially marking the practical court verdict of step 104 and step 101 Output compares, and calculates decision error item;
For the n-th frame of input voice, the reality output result of convolutional neural networks system is y (n), and ideal output is denoted as, the calculation formula of decision error item d (n) is as follows:
D (n)=e (n) { y (n) [1-y (n)] }
In formulaN is the frame number of sample voice.
Step 106, for the decision error item of each frame voice, convolutional neural networks system reversely passes the error term It broadcasts, updates weight matrix w and weight matrix m, complete a training process.
Step 1061, for the n-th frame of input voice, error term d (n) reverse transfer obtains sub-sampling to sub-sampling layer The output error item sd (n) of layer:
Sd (n)=mT(n)·d(n)
mT(n) transposed matrix for being m (n), when m (n) is that input voice n-th frame is input to convolutional neural networks system, son Sample level output characteristic spectrum is connected the weight matrix of neuron with output layer;
Output error item sd (n) is split into J dimension matrix by step 1062
Sd (n)=[sd1,sd2,...,sdM] split obtain J dimension matrixProcess is as follows:
I is the number of the every one-dimensional characteristic map epineural member of sub-sampling layer in formula, and M is of all neurons of sub-sampling layer Number, M=J × I.
Step 1063, J tie up matrixIt is transferred to convolutional layer and carries out increasing dimension, obtain the error term c (n) of convolutional layer;
Process is as follows:
G is the window size for the dimensionality reduction operation that convolutional layer is carried out to sub-sampling layer in formula.
Step 1064, the voice n-th frame for input calculate separately the error term and weight matrix m of weight matrix w (n) (n) error term.
Input layer is calculated to be connected with convolutional layer the error term of weight matrix w (n):
Δ w (n)=o " (n) * c (n)
O in formula " (n) is the input feature vector map O rotation 180 degree for inputting voice n-th frame and corresponding to input layer.
Sub-sampling layer is calculated to be connected with output layer the error term of weight matrix m (n):
In formulaFor the one-dimensional vector characteristic map that the voice n-th frame sub-sampling layer output of input is constituted, d (n) is real Border output and ideal output error vector.
Step 1065, the corresponding weight matrix w (n) of speech frame for updating n-th frame input respectively are with weight matrix m's (n) Value obtains the corresponding weight matrix w (n+1) of speech frame and the weight matrix m (n+1) of the (n+1)th frame input:
W (n+1)=w (n)-α Δ w (n)
M (n+1)=m (n)-α Δ m (n)
W (n) and m (n) respectively indicate the corresponding input layer of the speech frame of n-th frame input and convolutional layer and sub-sampling layer and The weight matrix of interlayer is exported, α is learning rate, preferably α=1.
Step 107 continually enters sample voice, repeats step 101-106, until the output of convolutional neural networks system misses Poor δ is less than preset value;
S is the voice totalframes of training input every time in formula, and y is the vector that the court verdict of S frame voice is constituted,For S frame The vector that the ideal decision result of voice is constituted.
Step 108, when output error δ be less than preset value when, record weight matrix w and weight matrix m simultaneously be stored in file.
Step 2: reading weight matrix w and weight matrix m from file initializes convolutional neural networks system.
Step 3: voice to be measured is inputted VAD system;
Step 4: sequentially extracting speech characteristic parameter MFCC and its first-order difference feature ginseng to each frame of voice to be measured Number;
Step 5: sequentially by the characteristic parameter MFCC of each frame of voice to be measured and its first-order difference characteristic parameter composition one Dimensional feature map brings convolutional neural networks system into.
Step 6: convolutional neural networks system will sequentially export the result [x, y] of each frame of voice to be measured, VAD system pair As a result [x, y] makes decisions, and records court verdict;
VAD system makes decisions the output result [x, y] of output layer, if x > y, illustrates for sample voice to be voice Frame, otherwise it is non-speech frame that it is 1 that VAD system, which exports court verdict, and it is 0 that VAD system, which exports court verdict,.
The present invention has the advantages that
(1) a kind of system for reducing VOIP communication resource expense, after convolutional neural networks systematic training is good, without instructing again Practice, VAD detection can be carried out after only need to importing the weight initialization network saved from file, computation complexity is low.
(2) a kind of method for reducing VOIP communication resource expense, by the way that convolutional neural networks system is introduced into VAD system It is detected, makes full use of the learning characteristic of convolutional neural networks, reduce the False Rate of VAD system, saved and calculate time and band Width, so as to reduce the voice resource expense of VOIP in the case where guaranteeing communication quality.
Detailed description of the invention
Fig. 1 is a kind of system structure diagram for reducing VOIP communication resource expense of the present invention.
Fig. 2 is a kind of method flow diagram for reducing VOIP communication resource expense of the present invention.
Fig. 3 is the flow chart of training convolutional neural networks system of the present invention.
Fig. 4 is the flow chart of convolutional neural networks system of the present invention.
Fig. 5 is the flow chart that convolutional neural networks system reverse of the present invention is propagated.
Specific embodiment
Below in conjunction with drawings and examples, the present invention is described in further detail.
In VOIP call system, VAD detection is first carried out before voice coding, is removed non-speech segments, is avoided to quiet The coding of sound data packet and transmission save and calculate time and bandwidth.It is non-speech frame to VAD judgement in the voice coding stage Segment can be without encoding or encoding less, to reduce communications bandwidth resources.
A kind of system and method reducing VOIP communication resource expense, convolutional neural networks system structure is simple, training The features such as parameter is few and adaptable is applied in VAD system, by training convolutional neural networks system, by convolutional neural networks System reduces the False Rate and misdetection rate of VAD system, to reduce in the case where guaranteeing communication quality as two-dimentional decision device VOIP communication resource expense.
A kind of system also known as convolutional neural networks system reducing VOIP communication resource expense, as shown in Figure 1, specific packet 4 layers are included, input layer, convolutional layer, sub-sampling layer and output layer are followed successively by, each layer is made of several dimensional feature maps, often Contain several neurons on one-dimensional characteristic map.
Input layer is one-dimensional characteristic map, has 26 neurons, is denoted as 1*26;Convolutional layer has 12 dimensional feature maps, each Dimension has 21 neurons;It is denoted as 12*21;Sub-sampling layer is identical as the characteristic spectrum dimension of convolutional layer, on every one-dimensional characteristic map Neuron number be the every one-dimensional characteristic map epineural member number of convolutional layer one third;It is denoted as 12*7;Output layer is one Dimensional feature map has 2 neurons;It is denoted as 1*2.
That input layer inputs is the speech characteristic parameter MFCC and its first-order difference characteristic parameter structure of each frame of voice to be measured At one-dimensional characteristic spectrogram;
The characteristic spectrum of convolutional layer has 12 dimensions, per one-dimensional characteristic spectrum by the characteristic spectrum of input layer and a convolution kernel Convolution obtains.The present embodiment selects the convolution kernel of 12 1*6.After convolution, preceding 6 neurons of input layer correspond to convolutional layer The first dimensional feature map on first neuron;And so on.The nerve of same position on each dimensional feature map of convolutional layer Member is all connected with 6 neurons of the correspondence on input layer characteristic spectrum, and connection type is all the same;
Sub-sampling layer carries out dimensionality reduction operation by every one-dimensional characteristic map to convolutional layer, by the every one-dimensional characteristic figure of convolutional layer Then neuron average packet in spectrum chooses neuron of the maximum neuron as sub-sampling layer in every group;Sub-sampling layer The dimension of characteristic spectrum is identical as the dimension of convolutional layer characteristic spectrum, and the number of sub-sampling layer epineural member is reduced on convolutional layer 1/3 to the 2/3 of neuron number.
All neurons of two neuron whole connexon sample levels of output layer, and export the practical knot of voice to be measured Fruit.
The method that VAD detection reduces VOIP communication resource expense is carried out using convolutional neural networks system, as shown in Fig. 2, Specific steps are as follows:
Step 1: training convolutional neural networks system, obtains and saves weight matrix w and weight matrix m;
As shown in Figure 3, the specific steps are as follows:
Step 101 chooses speech samples and is respectively labeled as speech frame and non-speech frame to each frame of speech samples.
Part of speech of the speech samples in Clemson university CUAVE database and noise speech library NOIZEUS;
To speech samples, sequentially hand labeled, every 20ms divide a frame into, and speech frame is labeled as [1,0], non-speech frame label For [0,1], speech frame is adjudicated respectively as convolutional neural networks system output layer and the ideal of non-speech frame exports.
Step 102, for each frame voice after label, sequentially extract speech characteristic parameter MFCC and its a scale respectively Divide characteristic parameter;
MFCC (Mel-scale Frequency Cepstral Coefficients, mel cepstrum coefficients) extracts feature A frame is divided by 20ms when parameter.
The calculation formula of speech characteristic parameter MFCC are as follows:
Wherein, M (n) is the n-th order parameter of speech characteristic parameter MFCC, mkFor the input of discrete cosine transform, k= 1,....,N;N is the number of Mel filter, preferred value 24;L is speech characteristic parameter MFCC coefficient maximum order, preferred value It is 13.
Note first-order difference MFCC parameter is Δ MFCC, its calculation formula is:
In formula, D (t) is the t rank parameter of first-order difference MFCC parameter;L' is the maximum order of Δ MFCC coefficient.
M (t+n) is the t+n rank parameter of speech characteristic parameter MFCC;M (t-n) is the t-n of speech characteristic parameter MFCC Rank parameter;N=1 ..., R, R indicate the time difference of single order cepstrum coefficient, can be with value 1 or 2.Preferred value is 2.
The speech characteristic parameter MFCC of each frame and its first-order difference characteristic parameter are sequentially formed one-dimensional spy by step 103 Levy map;
The process for forming one-dimensional characteristic map by MFCC and Δ MFCC is as follows:
V=[MFCC | Δ MFCC]=[M (1), M (2) ... M (n) ..., M (L), D (1), D (2) ... D (t) ..., D (L')]
In formula, the one-dimensional characteristic map that V is indicated, M (n) is the n-th order parameter of speech characteristic parameter MFCC, and D (t) is The t rank parameter of Δ MFCC parameter.
Step 104 brings the one-dimensional characteristic map of step 103 into convolutional neural networks system, obtains court verdict;
As shown in figure 4, specific steps are as follows:
Step 1041, the input that the one-dimensional characteristic map of each frame of sample voice is input to convolutional neural networks system Layer;
All speech characteristic parameter MFCC and its first-order difference characteristic parameter of sample voice constitute the feature of a 1*26 Spectrogram;Using one-dimensional vector form as the input of convolutional neural networks system;
Step 1042 obtains convolutional layer by the characteristic spectrum and convolution nuclear convolution of input layer per one-dimensional characteristic spectrum.
Convolution kernel size is set according to demand;The present embodiment selects the convolution kernel of 12 1*6.
After convolution nuclear convolution, convolutional layer is reduced to the 80% of input layer number per one-dimensional neuron number Left and right.Input layer is connected the weight matrix w of neuron with convolutional layer jth dimensional feature mapjIt indicates;All weight matrix wj Constitute matrix w;
Convolutional layer is per one-dimensional characteristic spectrum QjCalculating it is as follows:
Qj=σ (O*wj), j=1 ..., J
Wherein, QjFor the jth dimensional feature map of convolutional layer, j=1 ..., J, J are convolutional layer characteristic spectrum dimension, this reality Applying example is 12;O is the characteristic spectrum of input layer, and * represents convolution operation, and σ (x) is activation primitive, selects sigmoid function, fixed Justice are as follows:
12 dimensional feature maps of convolutional layer are obtained by the convolution nuclear convolution of the characteristic spectrum of input layer and 12 1*6;Cause It is identical with the neuron connection type on input layer characteristic spectrum for the neuron on each characteristic spectrum of convolutional layer, so convolutional layer What the neuron on every one-dimensional characteristic map extracted is corresponding local feature in input layer characteristic spectrum.
Step 1043, sub-sampling layer are to every one-dimensional characteristic map Q of convolutional layerjCarry out dimensionality reduction operation;
By the neuron average packet on the every one-dimensional characteristic map of convolutional layer, the maximum neuron then chosen in every group is made For the neuron of sub-sampling layer, calculation formula is as follows:
Wherein, pj,mRepresent sub-sampling layer jth dimensional feature map PjM-th of neuron, qj,mRepresent convolutional layer jth Wei Te Levy map QjM-th of neuron, s is the mobile size of window;G is the window for the dimensionality reduction operation that convolutional layer is carried out to sub-sampling layer The number of every group of neuron in size, that is, average packet;The value of the mobile size s of window is identical as the value of window G.This implementation Example selects G=s=3.
Sub-sampling layer obtains 12*7 neuron after dimensionality reduction, and the neuron number of every one-dimensional characteristic map is on sub-sampling layer 1/3 to the 2/3 of every one-dimensional characteristic map neuron number, reduces the data volume of calculating on convolutional layer.
Each neuron of step 1044, output layer is all connect with all neurons of sub-sampling layer;
Using full connection, calculation formula between the output of sub-sampling layer and the input of output layer are as follows:
In formula, y is output vector,Indicate the characteristic spectrum that each dimensional feature map of sub-sampling layer is transformed into a n dimensional vector n;PjSub-sampling layer jth dimensional feature map is represented, J is the dimension of sub-sampling layer characteristic spectrum;M is The characteristic spectrum of one n dimensional vector n is connected the weight matrix of neuron with output layer, and σ (x) is activation primitive, selects sigmoid letter Number, is defined as:
Step 1045, after connecting entirely, output layer output is expressed as [x, y], and records result.
Step 105 is directed to each frame voice, the ideal for sequentially marking the practical court verdict of step 104 and step 101 Output compares, and calculates decision error item;
For n-th frame speech frame, the reality output court verdict of convolutional neural networks system is y (n), and ideal output is denoted asThe calculation formula of decision error item d (n) is as follows:
D (n)=e (n) { y (n) [1-y (n)] }
In formulaN is the frame number of sample voice.
Step 106, for the decision error item of each frame voice, convolutional neural networks system reversely passes the error term It broadcasts, updates weight matrix w and weight matrix m, complete a training process.
As shown in Figure 5, the specific steps are as follows:
Step 1061, for the n-th frame of input voice, error term d (n) reverse transfer obtains sub-sampling to sub-sampling layer The output error item sd (n) of layer:
Sd (n)=mT(n)·d(n)
mT(n) transposed matrix for being m (n), m (n) are that the sub-sampling layer characteristic spectrum of n-th frame speech frame is connected with output layer The weight matrix of neuron;
Output error item sd (n) is split into J dimension matrix by step 1062
Sd (n) is one-dimensional matrix, since sub-sampling layer is J dimension, needs to be splitted into J dimension matrix.
Assuming that sd (n)=[sd1,sd2,...,sdM], sd (n) fractionation obtains J dimension matrixSplit process is as follows:
I is the number of the every one-dimensional characteristic map epineural member of sub-sampling layer in formula, and M is of all neurons of sub-sampling layer Number, M=J × I.sI,JMatrix is tieed up for JElement.
Step 1063, J tie up matrixIt is transferred to convolutional layer and carries out increasing dimension, obtain the error term c (n) of convolutional layer;
Process is as follows:
G is the window size for the dimensionality reduction operation that convolutional layer is carried out to sub-sampling layer in formula.
Step 1064, the voice n-th frame for input calculate separately the error term and weight matrix m of weight matrix w (n) (n) error term.
Input layer is calculated to be connected with convolutional layer the error term of weight matrix w (n):
Δ w (n)=o " (n) * c (n)
O in formula " (n) is the input feature vector map O rotation 180 degree for inputting voice n-th frame and corresponding to input layer.
Sub-sampling layer is calculated to be connected with output layer the error term of weight matrix m (n):
In formulaFor the one-dimensional vector characteristic map that the voice n-th frame sub-sampling layer output of input is constituted, d (n) is real Border output and ideal output error vector.
Step 1065, the corresponding weight matrix w (n) of speech frame for updating n-th frame input respectively are with weight matrix m's (n) Value obtains the corresponding weight matrix w (n+1) of speech frame and the weight matrix m (n+1) of the (n+1)th frame input:
W (n+1)=w (n)-α Δ w (n)
M (n+1)=m (n)-α Δ m (n)
W (n) and m (n) respectively indicate the corresponding input layer of the speech frame of n-th frame input and convolutional layer and sub-sampling layer and The weight matrix of interlayer is exported, α is learning rate, preferably α=1.
Step 107 continually enters sample voice, repeats step 101-106, until the output of convolutional neural networks system misses Poor δ is less than preset value;
S is the voice totalframes of training input every time in formula, and y is the vector that the court verdict of S frame voice is constituted,For S The vector that the ideal decision result of frame voice is constituted.Preset value is 5%.
Step 108, when output error δ be less than preset value when, record weight matrix w and weight matrix m simultaneously be stored in file.
Step 2: reading weight matrix w and weight matrix m from file initializes convolutional neural networks system.
Step 3: voice to be measured is inputted VAD system;
It will carry out the voice to be measured input VAD system of VAD judgement.
Step 4: sequentially extracting speech characteristic parameter MFCC and its first-order difference characteristic parameter to speech frame to be measured;
Step 5: sequentially bringing the characteristic parameter composition one-dimensional characteristic map of each frame of voice to be measured into convolutional Neural net Network system.
The process for forming one-dimensional characteristic map by MFCC and Δ MFCC is as follows:
V=[MFCC | Δ MFCC]=[M (1), M (2) ... M (n) ..., M (L), D (1), D (2) ... D (t) ..., D (L')]
Step 6: convolutional neural networks system will sequentially export the result [x, y] of each frame of voice to be measured, VAD system pair As a result [x, y] makes decisions, and records court verdict;
Voice divides a frame by 20ms, for a frame voice data, the output result of convolutional neural networks system output layer For [x, y], when VAD system makes decisions, if x > y, illustrate that sample voice is speech frame, VAD system output 1, is otherwise non-language Sound frame, VAD system export court verdict 0.
For example voice to be measured passes through the result of output layer are as follows: [0.98,0.02], 0.98 when VAD system makes decisions > 0.02, illustrate VAD system court verdict: the voice to be measured is speech frame, and VAD system exports court verdict 1.
The present invention realizes a kind of VAD detection technique, using speech characteristic parameter as the input of convolutional neural networks, such as MFCC, MFCC and its first-order difference parameter, MFCC and single order and second differnce parameter;Convolutional neural networks are used to sentence as two dimension Certainly device.
Voice resource expense can be reduced in the case where guaranteeing communication quality by the technology in VOIP.Phonetic feature Extracting part could alternatively be LPCC, LPCC and its first-order difference parameter, LPCC and single order and second differnce parameter and MFCC With the combination of LPCC.The number of characteristic parameter can be different with the reference value given.The convolutional layer and sub-sampling of convolutional neural networks Layer neuron number can be different, and convolution kernel size can be different, and convolutional layer and sub-sampling layer can be multilayer.

Claims (5)

1. a kind of system for reducing VOIP communication resource expense, which is characterized in that the system is by input layer, convolutional layer, son The convolutional neural networks system of sample level and output layer composition;Each layer is made of characteristic spectrum, and every one-dimensional characteristic map contains There is neuron;The VOIP refers to Voice over IP;
What input layer inputted is the first-order difference characteristic parameter of the speech characteristic parameter MFCC and MFCC of each frame of voice to be measured, institute It states parameter and constitutes one-dimensional characteristic map;The MFCC refers to mel cepstrum coefficients;
Convolutional layer is obtained per one-dimensional characteristic spectrum by the characteristic spectrum of input layer and a convolution nuclear convolution, and every dimensional feature The neuron of same position is all connected with corresponding neuron in input layer in map;
Sub-sampling layer carries out dimensionality reduction operation to every one-dimensional characteristic map of convolutional layer, by the mind on the every one-dimensional characteristic map of convolutional layer Through first average packet, neuron of the maximum neuron as sub-sampling layer in every group, sub-sampling layer epineural member are then chosen Number be reduced to 1/3 to the 2/3 of convolutional layer epineural member number;
Output layer is one-dimensional characteristic map, all nerves of the whole connexon sample levels of each neuron on characteristic spectrum Member, and export the actual result of voice to be measured;
The working principle of the system is as follows:
Training convolutional neural networks system, obtains and saves input layer and be connected with convolutional layer the weight matrix w and son of neuron Sample level is connected the weight matrix m of neuron with output layer;Weight matrix w and weight matrix m is read, convolutional Neural net is initialized Network system;Voice to be measured is inputted into VAD system;Sequentially to each frame of voice to be measured extract speech characteristic parameter MFCC and its First-order difference characteristic parameter;Sequentially the characteristic parameter MFCC of each frame of voice to be measured and its first-order difference characteristic parameter are formed One-dimensional characteristic map brings convolutional neural networks system into;Convolutional neural networks system will sequentially export the knot of each frame of voice to be measured Fruit, VAD system make decisions result, and record court verdict;The output result of output layer is [x, y], and VAD system is sentenced Certainly, if x is greater than y, illustrating to be speech frame for sample voice, otherwise it is non-speech frame that it is 1 that VAD system, which exports court verdict, It is 0 that VAD system, which exports court verdict,.
2. the side that application reduces the reduction VOIP communication resource expense of the system of VOIP communication resource expense as described in claim 1 Method, which is characterized in that specific steps are as follows:
Step 1: training convolutional neural networks system, obtains and save input layer and be connected with convolutional layer the weight matrix w of neuron And sub-sampling layer is connected the weight matrix m of neuron with output layer;
Step 2: reading weight matrix w and weight matrix m, convolutional neural networks system is initialized;
Step 3: voice to be measured is inputted VAD system;The VAD system refers to that voice activation detects;
Step 4: sequentially extracting speech characteristic parameter MFCC and its first-order difference characteristic parameter to each frame of voice to be measured;
Step 5: the characteristic parameter MFCC of each frame of voice to be measured and its first-order difference characteristic parameter are sequentially formed one-dimensional spy Sign map brings convolutional neural networks system into;
Step 6: convolutional neural networks system by sequentially export each frame of voice to be measured as a result, VAD system sentences result Certainly, and court verdict is recorded;
The output result of output layer is [x, y], and VAD system makes decisions, if x is greater than y, illustrates for sample voice to be voice Frame, otherwise it is non-speech frame that it is 1 that VAD system, which exports court verdict, and it is 0 that VAD system, which exports court verdict,.
3. reducing the method for VOIP communication resource expense as claimed in claim 2, which is characterized in that the step one is specific Are as follows: specific step is as follows:
Step 101 chooses speech samples and is respectively labeled as speech frame and non-speech frame to each frame of speech samples;
Speech frame is labeled as [1,0], and non-speech frame is labeled as [0,1], adjudicates respectively as convolutional neural networks system output layer The ideal output of speech frame and non-speech frame;
Step 102, for each frame voice after label, sequentially extract speech characteristic parameter MFCC respectively and its first-order difference be special Levy parameter;
The calculation formula of speech characteristic parameter MFCC n-th order parameter M (n) are as follows:
mkFor the input of discrete cosine transform, k=1 ..., N;N is the number of Mel filter;L is speech characteristic parameter The maximum order of MFCC coefficient;
The calculation formula of the t rank parameter D (t) of first-order difference MFCC parameter, Δ MFCC are as follows:
L' is the maximum order of Δ MFCC coefficient;M (t+n) is the t+n rank parameter of speech characteristic parameter MFCC;M (t-n) is language The t-n rank parameter of sound characteristic parameter MFCC;N=1 ... the time difference of, R, R expression single order cepstrum coefficient;
The speech characteristic parameter MFCC of each frame and its first-order difference characteristic parameter are sequentially formed one-dimensional characteristic figure by step 103 Spectrum;
The calculation formula of one-dimensional characteristic map V is as follows:
V=[MFCC | Δ MFCC]=[M (1), M (2) ... M (n) ..., M (L), D (1), D (2) ... D (t) ..., D (L')]
Step 104 brings the one-dimensional characteristic map of step 103 into convolutional neural networks system, obtains court verdict;
Step 105 is directed to each frame voice, the ideal output for sequentially marking the practical court verdict of step 104 and step 101 It compares, calculates decision error item;
For the n-th frame of input voice, the reality output result of convolutional neural networks system is y (n), and ideal output is denoted as The calculation formula of decision error item d (n) is as follows:
D (n)=e (n) { y (n) [1-y (n)] }
In formulaN is the frame number of sample voice;
Step 106, for the decision error item of each frame voice, convolutional neural networks system is by the error term backpropagation, more New weight matrix w and weight matrix m, completes a training process;
Step 107 continually enters sample voice, repeats step 101-106, until the output error δ of convolutional neural networks system Less than preset value;
S is the voice totalframes of training input every time in formula, and y is S frame voice after convolutional neural networks system, reality output As a result the vector constituted,The vector constituted for the desired result of the S frame voice of input;
Step 108, when output error δ be less than preset value when, record weight matrix w and weight matrix m simultaneously be stored in file.
4. reducing the method for VOIP communication resource expense as claimed in claim 3, which is characterized in that step 104 tool Body are as follows:
Step 1041, the input layer that the one-dimensional characteristic map of each frame of sample voice is input to convolutional neural networks system;
Step 1042 obtains convolutional layer by the one-dimensional characteristic map and convolution nuclear convolution of input layer per one-dimensional characteristic spectrum;
Convolutional layer is per one-dimensional characteristic spectrum QjCalculating it is as follows:
Qj=σ (O*wj), j=1 ..., J
Wherein, QjFor the jth dimensional feature map of convolutional layer, j=1 ..., J, J are that the dimension of convolutional layer characteristic spectrum and son are adopted The dimension of sample layer characteristic spectrum;O is the one-dimensional characteristic map of input layer, and * represents convolution operation, wjIndicate input layer and convolutional layer The weight matrix of the connected neuron of jth dimensional feature map, all weight matrix wjConstitute matrix w;
σ (x) is defined as:
After convolution, what the neuron on the every one-dimensional characteristic map of convolutional layer extracted is corresponding in input layer characteristic spectrum Local feature;
Step 1043, sub-sampling layer are to every one-dimensional characteristic map Q of convolutional layerjCarry out dimensionality reduction operation;
By the neuron average packet on the every one-dimensional characteristic map of convolutional layer, the maximum neuron in every group is then chosen as son The neuron of sample level, calculation formula are as follows:
Wherein, pj,mRepresent sub-sampling layer jth dimensional feature map PjM-th of neuron, qj,mRepresent convolutional layer jth dimensional feature figure Compose QjM-th of neuron, s is the mobile size of window;G is that the window that convolutional layer is operated to the dimensionality reduction that sub-sampling layer carries out is big Small, the value of the mobile size s of window is identical as the value of window G;
Each neuron of step 1044, output layer is all connect with all neurons of sub-sampling layer;
The calculation formula of output vector y are as follows:
In formula,Indicate the characteristic spectrum that each dimensional feature map of sub-sampling layer is transformed into a n dimensional vector n,PjRepresent sub-sampling layer jth dimensional feature map;M is the characteristic spectrum and output layer of a n dimensional vector n The weight matrix of connected neuron;Export a n dimensional vector n y is indicated with [x, y];
Step 1045, after connecting entirely, output layer output result [x, y] simultaneously record result.
5. reducing the method for VOIP communication resource expense as claimed in claim 3, which is characterized in that step 106 tool Body are as follows:
Step 1061, for the n-th frame of input voice, error term d (n) reverse transfer to sub-sampling layer obtains sub-sampling layer Output error item sd (n):
Sd (n)=mT(n)·d(n)
mT(n) transposed matrix for being m (n), when m (n) is that input voice n-th frame is input to convolutional neural networks system, sub-sampling layer Output characteristic spectrum is connected the weight matrix of neuron with output layer;
Output error item sd (n) is split into J dimension matrix by step 1062
Sd (n)=[sd1,sd2,…,sdM] split obtain J dimension matrixProcess is as follows:
I is the number of the every one-dimensional characteristic map epineural member of sub-sampling layer in formula, and M is the number of all neurons of sub-sampling layer, M =J × I;
Step 1063, J tie up matrixIt is transferred to convolutional layer and carries out increasing dimension, obtain the error term c (n) of convolutional layer;
G is the window size for the dimensionality reduction operation that convolutional layer is carried out to sub-sampling layer in formula;
Step 1064, the voice n-th frame for input calculate separately the error term and weight matrix m (n) of weight matrix w (n) Error term;
Input layer is calculated to be connected with convolutional layer the error term of weight matrix w (n):
Δ w (n)=o " (n) * c (n)
O in formula " (n) is the input feature vector map O rotation 180 degree for inputting voice n-th frame and corresponding to input layer;
Sub-sampling layer is calculated to be connected with output layer the error term of weight matrix m (n):
In formulaFor the one-dimensional vector characteristic map that the voice n-th frame sub-sampling layer output of input is constituted, d (n) is practical defeated Out with ideal output error vector;
Step 1065, the value for updating speech frame corresponding weight matrix w (n) and weight matrix m (n) that n-th frame inputs respectively, obtain The corresponding weight matrix w (n+1) of speech frame inputted to the (n+1)th frame and weight matrix m (n+1):
W (n+1)=w (n)-α Δ w (n)
M (n+1)=m (n)-α Δ m (n)
W (n) input layer corresponding with the speech frame that m (n) respectively indicates n-th frame input and convolutional layer and sub-sampling layer and output The weight matrix of interlayer, α are learning rate.
CN201510638936.1A 2015-09-30 2015-09-30 A kind of system and method reducing VOIP communication resource expense Active CN105321525B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510638936.1A CN105321525B (en) 2015-09-30 2015-09-30 A kind of system and method reducing VOIP communication resource expense

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510638936.1A CN105321525B (en) 2015-09-30 2015-09-30 A kind of system and method reducing VOIP communication resource expense

Publications (2)

Publication Number Publication Date
CN105321525A CN105321525A (en) 2016-02-10
CN105321525B true CN105321525B (en) 2019-02-22

Family

ID=55248773

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510638936.1A Active CN105321525B (en) 2015-09-30 2015-09-30 A kind of system and method reducing VOIP communication resource expense

Country Status (1)

Country Link
CN (1) CN105321525B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11080591B2 (en) 2016-09-06 2021-08-03 Deepmind Technologies Limited Processing sequences using convolutional neural networks
JP6577159B1 (en) * 2016-09-06 2019-09-18 ディープマインド テクノロジーズ リミテッド Generating audio using neural networks
EP3767547A1 (en) 2016-09-06 2021-01-20 Deepmind Technologies Limited Processing sequences using convolutional neural networks
CN110023963B (en) 2016-10-26 2023-05-30 渊慧科技有限公司 Processing text sequences using neural networks
CN106683680B (en) * 2017-03-10 2022-03-25 百度在线网络技术(北京)有限公司 Speaker recognition method and device, computer equipment and computer readable medium
CN106790300A (en) * 2017-03-21 2017-05-31 青岛海信宽带多媒体技术有限公司 A kind of method and apparatus conversed
CN108962231B (en) * 2018-07-04 2021-05-28 武汉斗鱼网络科技有限公司 Voice classification method, device, server and storage medium
CN108847251B (en) * 2018-07-04 2022-12-02 武汉斗鱼网络科技有限公司 Voice duplicate removal method, device, server and storage medium
CN110017991B (en) * 2019-05-13 2020-03-31 山东大学 Rolling bearing fault classification method and system based on spectral kurtosis and neural network
CN110223713A (en) * 2019-06-11 2019-09-10 苏州思必驰信息科技有限公司 Sound event detection model training method and sound event detection method
CN110706694B (en) * 2019-09-26 2022-04-08 成都数之联科技股份有限公司 Voice endpoint detection method and system based on deep learning
CN111210815B (en) * 2019-11-28 2023-01-06 赵铭 Deep neural network construction method for voice command word recognition, and recognition method and device
CN111179975B (en) * 2020-04-14 2020-08-04 深圳壹账通智能科技有限公司 Voice endpoint detection method for emotion recognition, electronic device and storage medium
CN114420111B (en) * 2022-03-31 2022-06-17 成都启英泰伦科技有限公司 One-dimensional hypothesis-based speech vector distance calculation method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101278337A (en) * 2005-07-22 2008-10-01 索福特迈克斯有限公司 Robust separation of speech signals in a noisy environment
US7660713B2 (en) * 2003-10-23 2010-02-09 Microsoft Corporation Systems and methods that detect a desired signal via a linear discriminative classifier that utilizes an estimated posterior signal-to-noise ratio (SNR)
CN104021373A (en) * 2014-05-27 2014-09-03 江苏大学 Semi-supervised speech feature variable factor decomposition method
CN104157290A (en) * 2014-08-19 2014-11-19 大连理工大学 Speaker recognition method based on depth learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7660713B2 (en) * 2003-10-23 2010-02-09 Microsoft Corporation Systems and methods that detect a desired signal via a linear discriminative classifier that utilizes an estimated posterior signal-to-noise ratio (SNR)
CN101278337A (en) * 2005-07-22 2008-10-01 索福特迈克斯有限公司 Robust separation of speech signals in a noisy environment
CN104021373A (en) * 2014-05-27 2014-09-03 江苏大学 Semi-supervised speech feature variable factor decomposition method
CN104157290A (en) * 2014-08-19 2014-11-19 大连理工大学 Speaker recognition method based on depth learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"A fast neural net training algorithm and its application to voiced-unvoiced-silence classification of speech";Thea Ghiselli-Crippa,Amro EI-Jaroudi;《Proc.Int.Conf.ASSP,1991》;19911231;第1卷;全文 *
"Denoising Deep Neural Networks Based Voice Activity Detection";Xiao-Lei Zhang,Ji Wu;《IEEE International Conference on Acoustics》;20130304;第32卷(第3期);全文 *
"基于SDCNN的人脸识别研究";王璟尧;《中国优秀硕士学位论文全文数据库》;20150515(第05期);全文 *

Also Published As

Publication number Publication date
CN105321525A (en) 2016-02-10

Similar Documents

Publication Publication Date Title
CN105321525B (en) A kind of system and method reducing VOIP communication resource expense
CN110992987B (en) Parallel feature extraction system and method for general specific voice in voice signal
CN1121681C (en) Speech processing
CN110379412A (en) Method, apparatus, electronic equipment and the computer readable storage medium of speech processes
CN110739003B (en) Voice enhancement method based on multi-head self-attention mechanism
US5812973A (en) Method and system for recognizing a boundary between contiguous sounds for use with a speech recognition system
CN103065629A (en) Speech recognition system of humanoid robot
CN108922513A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN110415701A (en) The recognition methods of lip reading and its device
AU684214B2 (en) System for recognizing spoken sounds from continuous speech and method of using same
CN108597505A (en) Audio recognition method, device and terminal device
CN113488058A (en) Voiceprint recognition method based on short voice
CN114338623B (en) Audio processing method, device, equipment and medium
CN113539232B (en) Voice synthesis method based on lesson-admiring voice data set
CN112466297B (en) Speech recognition method based on time domain convolution coding and decoding network
Shi et al. End-to-End Monaural Speech Separation with Multi-Scale Dynamic Weighted Gated Dilated Convolutional Pyramid Network.
CN109036470A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN113763965A (en) Speaker identification method with multiple attention characteristics fused
CN113450761A (en) Parallel speech synthesis method and device based on variational self-encoder
KR100779242B1 (en) Speaker recognition methods of a speech recognition and speaker recognition integrated system
Yang et al. RS-CAE-based AR-Wiener filtering and harmonic recovery for speech enhancement
Park et al. The Second DIHARD Challenge: System Description for USC-SAIL Team.
Gaafar et al. An improved method for speech/speaker recognition
CN110136741A (en) A kind of single-channel voice Enhancement Method based on multiple dimensioned context
CN113516987B (en) Speaker recognition method, speaker recognition device, storage medium and equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant