CN103258536B - A kind of extensive speaker's identification method - Google Patents

A kind of extensive speaker's identification method Download PDF

Info

Publication number
CN103258536B
CN103258536B CN201310074743.9A CN201310074743A CN103258536B CN 103258536 B CN103258536 B CN 103258536B CN 201310074743 A CN201310074743 A CN 201310074743A CN 103258536 B CN103258536 B CN 103258536B
Authority
CN
China
Prior art keywords
speaker
audio frequency
frequency characteristics
haar
integrogram
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201310074743.9A
Other languages
Chinese (zh)
Other versions
CN103258536A (en
Inventor
罗森林
谢尔曼
潘丽敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN201310074743.9A priority Critical patent/CN103258536B/en
Publication of CN103258536A publication Critical patent/CN103258536A/en
Application granted granted Critical
Publication of CN103258536B publication Critical patent/CN103258536B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The present invention relates to a kind of based on 2D-Haar audio frequency characteristics, the text that is applicable to extensive speaker has nothing to do speaker's identification method.The present invention proposes concept and the computing method of 2D-Haar audio frequency characteristics, first use elementary audio structural feature audio frequency characteristics figure; And then utilize audio frequency characteristics figure to extract 2D-Haar audio frequency characteristics, re-use the training that AdaBoost.MH algorithm completes screening to 2D-Haar audio frequency characteristics and speaker clustering device; The speaker clustering device that final utilization trains realizes speaker's identification.Compared with prior art, the present invention recognizes the decay of accuracy rate under can effectively suppressing extensive speaker to recognize occasion, have higher identification accuracy rate and identification speed; Be not only applicable to desktop computer, be also applicable to the mobile computing platform such as mobile phone, panel computer.

Description

A kind of extensive speaker's identification method
Technical field
The present invention relates to a kind of text being applicable to extensive speaker to have nothing to do speaker's identification method, belong to technical field of biometric identification; From the angle that technology realizes, also belong to computer science and voice processing technology field.
Background technology
Speaker's identification (Speaker Identification) technology is Speaker Identification (SpeakerRecognition, SR) important branch of technology, it is the voice signal feature utilizing each speaker, speaker information is extracted from one section of voice, and then judge that this section of voice are which in some people is said, be the pattern recognition problem of " multiselect one ".Along with the high speed development of modern electronic technology in recent years, the application demand of speaker's recognition techniques more next strong fields such as () such as court's discriminating, suspect's tone tracking location, speech retrievals, also receives increasing concern with features such as the convenience of its uniqueness, economy and accuracys.
Different according to the type of content of speaking, speaker's identification can be divided into text dependent (Text-dependent) and the large class of text irrelevant (Text-independent) two.Require that user pronounces according to the content specified with the speaker identification system of text dependent, everyone identification model is accurately set up one by one, and also must by the content pronunciation of regulation when recognizing; The recognition system that text has nothing to do then does not specify the pronunciation content of speaker, and relative difficulty set up by model, can range of application wider.In some cases, people cannot (or not wishing) force speaker to read aloud one section of specific word, and in these application scenarioss, speaker's identification method that text has nothing to do just seems especially important.
The basic fundamental of this irrelevant speaker identification can be divided into voice collecting, feature extraction, and sorting technique three class, wherein key issue is feature extraction and classifying method.
Feature extraction aspect, the many employings of current main stream approach based on the mel cepstrum coefficients (MFCC) of bottom Principles of Acoustics or linear prediction residue error (Linear Predictive Coding Cepstrum, LPCC) as characteristic parameter.
Sorting technique aspect, main stream approach can be divided three classes, template matching method (dynamic time warping (DTW), vector quantization (VQ)), probabilistic method (hidden Markov model (HMM), gauss hybrid models (GMM)), identification, classification device algorithm (artificial neural network (ANN), Support Vector Machine (SVM)).Extensively used gauss hybrid models (GMM) method and Support Vector Machine (SVM) method at present.In said method, GMM-UBM model is used widely; Support Vector Machine (SVM) method and GMM-UBM have very strong contacting, and the feature super vector that the SVM system of current main flow adopts generally all is produced by GMM.
Based on said method, speaker's recognition techniques that text has nothing to do obtains practical application in some occasions.But when constantly increasing wait the number recognized, the accuracy rate of said method can obviously decline, and when number is increased to certain scale, will be difficult to the demand meeting practical application, this is that text has nothing to do a major issue of speaker's recognition techniques needs solution.
Summary of the invention
The object of the invention is: innovate from feature extraction and sorting technique two levels, propose a kind of extensive speaker's identification method, treating, under the scene that identification number is more, still can obtain higher accuracy rate.
Design concept of the present invention is: propose 2D-Haar audio feature extraction methods, introduce certain sequential relationship information, and audio feature space is extended to hundreds thousand of dimension, for recognition algorithm provides huger feature space; Meanwhile, use AdaBoost.MH algorithm, in feature space, screen representative Feature Combination, for the identification sorter of establishing target speaker.The present invention, while promoting accuracy rate further, does not increase training and recognize time expense, has feature fast and accurately.
Technical scheme of the present invention realizes as follows:
Step 1, obtains the voice signal waiting to recognize speaker (i.e. target speaker), basis of formation sound bank S.
Concrete grammar is: microphone is connected with computing machine, obtains the voice signal of target speaker, and is stored in computing machine with the form of audio file, the corresponding audio file of each target speaker, basis of formation sound bank S={s 1, s 2, s 3..., s k, wherein k is the sum of target speaker.
Step 2, carries out the calculating of audio frequency characteristics integrogram to the voice in the S of basic speech storehouse, basis of formation feature database R.Detailed process is as follows:
Step 2.1, for a kth target speaker, to its audio file s kcarry out sub-frame processing (frame length f s, frame moves Δ f sbe set by the user), and extract the elementary audio feature (as MFCC, LPCC, sub belt energy etc.) of each frame, by the elementary audio Feature Combination of each frame, form the foundation characteristic file v that comprises c frame, every frame p dimensional feature amount k.
V kin the content of proper vector of each frame be: { [foundation characteristic 1(p 1dimension)], [foundation characteristic 2(p 2dimension)] ..., [foundation characteristic n(p ndimension)] }.
In more than describing, be the audio file s of t for a duration k:
p = Σ 1 n p n .
Step 2.2, for the foundation characteristic file v of a kth target speaker k, adopt the mode of sliding window, be that window is long, s is stepping with a, convert all c frame audio feature vector to audio frequency characteristics graphic sequence file G k(see Fig. 2).
G k={ g 1, g 2, g 3... g uk, wherein,
Step 2.3, on the basis of step 2.2, calculates the characteristic pattern sequential file G for a kth target speaker kin every width characteristic pattern g ucharacteristic-integration figure r u, form the characteristic-integration graphic sequence file R of this speaker k={ r 1, r 2, r 3... r u, the characteristic-integration graphic sequence file of all k target speaker in the S of basic speech storehouse is put together, basis of formation feature database R={R 1, R 2..., R k.
Yi Zhi, in foundation characteristic storehouse, the computing formula of the characteristic-integration figure sum m of all speakers is:
Described characteristic-integration figure and primitive character figure is measure-alike, and on it, the value of any point (x, y) is defined as all eigenwert sum of former figure corresponding point (x ', y ') and upper left side thereof.Definition is as follows:
ii ( x , y ) = Σ x ′ ≤ x , y ′ ≤ y i ( x ′ , y ′ ) ,
In formula, ii (x, y) represents the value of point (x, y) on integrogram, and i (x', y') represents the eigenwert of point on primitive character figure (x ', y ').
Step 3, on the basis of foundation characteristic storehouse R, generates the training characteristics file set B of each target speaker.Detailed process is as follows:
Step 3.1, mark the tag file in the R of foundation characteristic storehouse, concrete grammar is:
Use continuous print integer numbering as speaker's label, represent different target speakers, so that computer disposal.Final mark pattern is R '={ (R 1, 1), (R 2, 2) ... (R k, k) }, wherein, Y={1,2 ..., k} is target speaker tally set, and k is target speaker number;
Step 3.2, on the basis of step 3.1, for each target speaker sets up the tag file collection B being used for speaker's registration, concrete grammar is:
In the feature database R ' of the amiable people's label of mark, carry out k wheel and arrange, often taking turns in housekeeping, first by the audio frequency characteristics file r of a kth target speaker kas positive sample, retain its speaker's label k; Then using remaining speaker's audio frequency characteristics file as negative sample, and their speaker's label to be changed to " other "; Finally above-mentioned k audio frequency characteristics file is stored in independent file, and by this tag file folder called after B k, that is:
B 1={(R 1,1),(R 2,other),…(R k,other)},
B 2={(R 1,other),(R 2,2),…(R k,other)},
……
B k={(R 1,other),(R 2,other),…(R k,k)}
After k takes turns housekeeping, final formation presss from both sides by k tag file the tag file collection B={B formed 1, B 2..., B k.
Step 4, on the basis of step 3, extract 2D-Haar audio frequency characteristics, and carry out speaker's registration, namely travel through k file in tag file collection B successively, and use training characteristics file to be wherein that each target speaker trains independent " 1 to remaining " sorter, finally obtain the sorter pond be made up of k speaker clustering device.
For a kth target speaker, the sorter W of its correspondence ktraining process as follows:
Step 4.1, to the tag file folder B that step 3.2 is formed kin all characteristic-integration graphic sequence file R kevery width integrogram carry out 2D-Haar audio feature extraction.Concrete grammar is:
Calculate corresponding H according to each integrogram and tie up 2D-Haar audio frequency characteristics value (wherein H is determined by the size of the 2D-Haar audio frequency characteristics type adopted and integrogram), obtain the data acquisition S={ (x for the training of speaker clustering device 1, l i) ..., (x m, l i).Wherein, x irepresent that the whole H corresponding to i-th integrogram tie up 2D-Haar audio feature vector, l i∈ Y, (Y={1,2 ..., k}) and represent speaker's label corresponding to i-th integrogram.
Described H ties up 2D-Haar audio frequency characteristics value, the value often tieing up 2D-Haar audio frequency characteristics is on original audio characteristic pattern, in the square region of arbitrary dimension, position, use the eigenwert sum of a certain specific rectangular area to deduct the eigenwert sum of another specific rectangular area, calculate acquisition fast by integrogram.
Corresponding for every width integrogram H is tieed up 2D-Haar audio feature vector and is denoted as a line, make tag file press from both sides B kin the whole H of all m width integrograms tie up 2D-Haar audio feature vector and form that m is capable, the eigenmatrix X of H row.
Step 4.2, uses AdaBoost.MH method to carry out Feature Selection and sorter training to the 2D-Haar audio frequency characteristics matrix X that step 4.1 obtains, obtains speaker clustering device.The ultimate principle of described AdaBoost.MH method is: take turns iteration by F, and tie up 2D-Haar audio frequency characteristics value set from H and select F to tie up principal character, simultaneous training obtains F Weak Classifier, is formed strong classifier.
The Weak Classifier used in above-mentioned interative computation, need meet the following conditions: 1. the input of Weak Classifier is one-dimensional eigenwert (a certain specific dimension namely in proper vector, or a certain row in eigenmatrix X); 2. for speaker's label l to be recognized i, the output of Weak Classifier is 1 or-1.
The concrete training process of AdaBoost.MH is:
Step 4.2.1, the weight that initialization every width integrogram is corresponding, is denoted as D 1(i, l i)=1/ (mk), i=1 ... m, l i∈ Y.
Step 4.2.2, successively using the input of each column data (i.e. the same dimensional feature of H group of all integrograms) of eigenmatrix X as a Weak Classifier, carries out H and takes turns computing, calculate r according to the following formula f,jvalue:
r f , j = Σ j , ( i , l ) D f ( i , l i ) K i [ l i ] h j ( x i , l i ) , j = 1 . . . H
Wherein, h j(x i, l i) represent that the jth dimensional feature value extracted in i-th integrogram is as the Weak Classifier inputted, D f(i, l i) represent that f takes turns the weighted value of i-th training integrogram in iteration, K i [ l i ] = + 1 l i ∈ [ 1 , . . . , k ] - 1 l i ∉ [ 1 , . . . , k ] .
A h is selected from an above-mentioned H Weak Classifier j(x, l i), make r f=max (r f,j), by this sorter characteristic of correspondence f jx this Weak Classifier, as the feature dimensions chosen, is denoted as h by () simultaneously t(x, l), adds in strong classifier.Wherein, f jx () represents that H ties up the jth dimension (i.e. the jth row of eigenmatrix X) of 2D-Haar audio feature vector, h j(x, l) represents the Weak Classifier adopting jth dimensional feature value as input;
Step 4.2.3, calculates the Weak Classifier h selected by step 4.2.2 jthe weight α of (x, l) f:
α f = 1 2 ln ( 1 + r f 1 - r f ) ;
Step 4.2.4, calculates the weight Df+1 of each integrogram in next round iteration;
D f + 1 = D f ( i , l i ) exp ( - α f K i [ l i ] h f ( x i , l i ) ) Z f , i = 1 . . . m .
Wherein, h f(x i, l i) represent that f takes turns using the jth dimensional feature value of i-th integrogram extraction as the Weak Classifier of input in iteration, Z fit is normalized factor
Z f = Σ i , l D f ( i , l i ) exp ( - α f K i [ l i ] h f ( x i , l i ) ) , i = 1 . . . m .
Step 4.2.5, the new weight obtained by step 4.2.4 substitutes into step 4.2.2, according to the method for step 4.2.2 to step 4.2.4, chooses the feature dimensions that is new, obtains a new Weak Classifier simultaneously and adds in strong classifier;
Step 4.2.6, according to method iteration F time of step 4.2.2 to step 4.2.5, obtains the strong classifier be made up of F Weak Classifier, i.e. the identification sorter of a kth speaker, is expressed as:
W k ( x ) = arg max l S l , S l = ( Σ t = 1 F α t h t ( x , l ) ) - - - ( 1 )
Step 4.2.7, after the training of k wheel terminates, gathers all k speaker clustering device, forms speaker clustering device pond W={W 1(x), W 2(x) ..., W k(x) }.
Step 5, utilizes step 4 to train the speaker clustering device pond obtained, and extracts 2D-Haar audio frequency characteristics, and carry out speaker's identification to the voice document of unknown speaker.
Step 5.1, carries out the extraction of audio frequency characteristics integrogram to identification voice document, obtains treating identification audio frequency characteristics integrogram sequence G'={g' 1, g' 2, g' 3... g' u '.Concrete grammar is identical with method described in step 2.Wherein, in audio frequency characteristics graphic sequence transfer process (corresponding to step 2.2), the long a of window, stepping s, identical with step 2 of value; Similar, for one comprise c' frame wait recognize voice document, the characteristic pattern quantity u ' that characteristic pattern sequence comprises is:
Step 5.2, on the basis of step 5.1, according to the 2D-Haar audio feature extraction methods described in step 4.1, for the every width characteristic pattern in characteristic pattern sequence extracts 2D-Haar audio frequency characteristics, forms 2D-Haar audio frequency characteristics matrix X'.
Step 5.3, the 2D-Haar audio frequency characteristics matrix X ' step 5.2 obtained inputs each sorter in speaker clustering device pond W, obtains classification results sequence R.
Described classification results sequence R is made up of the individual element of u ', and wherein the circular of each element is:
Step 5.3.1, according to (1) formula in step 4.2.6, reads certain Weak Classifier h in speaker clustering device t(x, l) and corresponding 2D-Haar audio frequency characteristics f thereof j(x);
Step 5.3.2, for often kind of label to be selected (i.e. k or other), calculates the output h of this Weak Classifier respectively t(f j(x), l), and by this output valve with the weight α in sorter tbe added to label l to be selected icorresponding weighted value S liin;
Step 5.3.3, after carrying out the circulation of F wheel according to the method for step 5.3.1-step 5.3.2, often kind of label l to be selected ia weighted value S will be obtained li.Select the weighted value S that value is maximum li, record label l to be selected corresponding thereto simultaneously ias the classification results of this audio frequency characteristics figure, be denoted as (l i, ), wherein l kfor speaker's label, for corresponding strong classifier weighted sum.
Step 5.3.4, recognizes that all classification results of audio frequency combine, composition and classification result sequence by waiting R = { ( l i , S l i , u ′ ) : ( l 1 , S l 1 , 1 ) , ( l 1 , S l 1 , 2 ) , ( l 2 , S l 2 , 3 ) , · · · ( l i , S l i , u ′ ) } .
Step 5.4, carries out result comprehensively to the classification results sequence that step 5.3 obtains, obtains final speaker's recognition results.
Concrete grammar is: strong classifiers all in result sequence is differentiated weight by speaker's label l iweighting, exports the final recognition results of the speaker's label making weighted sum maximum as this section of voice.
Beneficial effect
Compared to the characteristic parameter extraction method such as mel cepstrum coefficients (MFCC) or linear prediction residue error (LPCC) based on bottom Principles of Acoustics, the 2D-Haar audio feature extraction methods that the present invention proposes introduces the certain sequential relationship information of le, and audio feature space is extended to hundreds thousand of dimension, for recognition algorithm provides huger feature space.
Compared with the speaker clustering methods such as GMM, SVM, the present invention adopts and uses AdaBoost.MH algorithm, Decision Stump Weak Classifier in conjunction with single feature input carries out Feature Selection, both representativeness and the discrimination of proper vector had been improve, decrease the computation burden that speaker recognizes the stage, arithmetic speed is higher.In conjunction with 2D-Haar audio frequency characteristics and AdaBoost.MH algorithm, the accurate identification of extensive speaker can be realized, there is higher practical value.
Accompanying drawing explanation
Fig. 1 is theory diagram of the present invention;
Fig. 2 is the audio frequency characteristics figure that proposes of the present invention and characteristic pattern sequential extraction procedures principle intention;
Fig. 3 is speaker's registration process schematic diagram of the present invention;
Fig. 4 is speaker's identification process schematic diagram of the present invention;
The 5 class 2D-Haar audio frequency characteristics of Fig. 5 for using in speaker's training in embodiment and identification process;
Fig. 6 is in embodiment, when using TIMIT sound bank to test, and the performance test of the present invention and GMM-UBM algorithm.
Embodiment
In order to better objects and advantages of the present invention are described, be described in further details below in conjunction with the embodiment of drawings and Examples to the inventive method.
All tests all complete on same computer below, and concrete configuration is: Intel double-core CPU(dominant frequency 1.8G), 1G internal memory, WindowsXP SP3 operating system.
First link
This link will use the voice document of TIMIT audio repository, describes in detail when target speaker scale is 600 people, the detailed process of speaker's registration/training of the present invention, speaker's identification.
TIMIT sound bank is the java standard library produced jointly by Massachusetts Polytechnics, Stanford Research Institute, Texas Instrument, contains the language material of 630 speakers (438 male sex and 192 women), everyone 10 voice.
Whole speech datas of random selecting 600 people from all speakers, then from everyone 10 voice, choose 1 duration be greater than the file of 5 seconds as speaker's registration/training utterance file; Any voice of random selecting 1 people are as identification voice document in addition.
Concrete implementation step is as follows:
Step 1, obtains the voice signal waiting to recognize speaker (i.e. target speaker), basis of formation sound bank S.
Because TIMIT sound bank has been store complete audio file, therefore direct by the voice document basis of formation sound bank S={s of 600 objective speakers 1, s 2, s 3..., s k, wherein k=600 is the sum of target speaker.
Step 2, carries out the calculating of audio frequency characteristics integrogram to the voice in the S of basic speech storehouse, basis of formation feature database R.Detailed process is as follows:
Step 2.1, for a kth target speaker, to its audio file s kcarry out sub-frame processing, and extract the elementary audio feature (in the present embodiment, using MFCC, LPCC, PLPC) of each frame, by the elementary audio Feature Combination of each frame, form the foundation characteristic file v that comprises c frame, every frame p dimensional feature amount k.
In the present embodiment, v kin the content of proper vector of each frame be: { [MFCC(12 dimension)], [LPCC(12 dimension)], [PLPC(8 dimension)] }, the frame length of framing operation is set as f s=30ms, frame moves and is set as Δ f s=20ms.
p = Σ 1 n p n = 12 + 12 + 8 = 32 .
Step 2.2, for the foundation characteristic file v of a kth target speaker k, adopt the mode of sliding window, be that window is long, s is stepping with a, convert all c frame audio feature vector to audio frequency characteristics graphic sequence file G k(see Fig. 2).In the present embodiment, a=32, s=16.
G k={ g 1, g 2, g 3... g uk, wherein,
Step 2.3, on the basis of step 2.2, calculates the characteristic pattern sequential file G for a kth target speaker kin every width characteristic pattern g ucharacteristic-integration figure r u, form the characteristic-integration graphic sequence file R of this speaker k={ r 1, r 2, r 3... r u, the characteristic-integration graphic sequence file of 600 target speakers all in the S of basic speech storehouse is put together, basis of formation feature database R={R 1, R 2..., R k.
Yi Zhi, in foundation characteristic storehouse, the computing formula of the characteristic-integration figure sum m of all speakers is:
In the present embodiment, total duration of all 600 audio files is 3630.50s, therefore:
Described characteristic-integration figure and primitive character figure is measure-alike, and on it, the value of any point (x, y) is defined as all eigenwert sum of former figure corresponding point (x ', y ') and upper left side thereof.Definition is as follows:
ii ( x , y ) = Σ x ′ ≤ x , y ′ ≤ y i ( x ′ , y ′ ) ,
In formula, ii (x, y) represents the value of point (x, y) on integrogram, and i (x', y') represents the eigenwert of point on primitive character figure (x ', y ').
Step 3, on the basis of foundation characteristic storehouse R, generates the training characteristics file set B of each target speaker.Detailed process is as follows:
Step 3.1, mark the tag file in the R of foundation characteristic storehouse, concrete grammar is:
Use continuous print integer numbering as speaker's label, represent different target speakers, so that computer disposal.Final mark pattern is R '={ (R 1, 1), (R 2, 2) ... (R 600, 600) }, wherein, Y={1,2 ..., 600} is target speaker tally set;
Step 3.2, on the basis of step 3.1, for each target speaker sets up the tag file collection B being used for speaker's registration, concrete grammar is:
In the feature database R ' of the amiable people's label of mark, carry out 600 and take turns arrangement, often taking turns in housekeeping, first by the audio frequency characteristics file r of a kth target speaker kas positive sample, retain its speaker's label k; Then using remaining speaker's audio frequency characteristics file as negative sample, and their speaker's label to be changed to " other "; Finally above-mentioned 600 audio frequency characteristics files are stored in independent file, and by this tag file folder called after B k, that is:
B 1={(R 1,1),(R 2,other),…(R 600,other)},
B 2={(R 1,other),(R 2,2),…(R 600,other)},
……
B 600={(R 1,other),(R 2,other),…(R 600,600)}
600 take turns housekeeping after, final formation presss from both sides by 600 tag files the tag file collection B={B formed 1, B 2..., B 600.
Step 4, on the basis of step 3, extract 2D-Haar audio frequency characteristics, and carry out speaker's registration, namely travel through 600 files in tag file collection B successively, and use training characteristics file to be wherein that each target speaker trains independent " 1 to remaining " sorter.
For a kth target speaker, the sorter W of its correspondence ktraining process as follows:
Step 4.1, to the tag file folder B that step 3.2 is formed kin all characteristic-integration graphic sequence file R kevery width integrogram carry out 2D-Haar audio feature extraction.
Calculate corresponding H according to each integrogram and tie up 2D-Haar audio frequency characteristics value, obtain the data acquisition S={ (x for the training of speaker clustering device 1, l i) ..., (x m, l i).Wherein, x irepresent that the whole H corresponding to i-th integrogram tie up 2D-Haar audio feature vector, l i∈ Y, (Y={1,2 ..., k}) and represent speaker's label corresponding to i-th integrogram.
Fig. 5 illustrates the computation schema of the 5 class 2D-Haar audio frequency characteristics that the present embodiment uses, the value often tieing up 2D-Haar audio frequency characteristics is: on original audio characteristic pattern, in the square region of arbitrary dimension, position, according to quasi-mode a certain in Fig. 5, the eigenwert sum calculating black region deducts the eigenwert sum of white portion.This feature has following three features:
1. fast operation.Coordinate integrogram, the extraction of any size 2D-Haar audio frequency characteristics only need perform digital independent and the plus and minus calculation of fixed number of times.The 2D-Haar audio frequency characteristics comprising 2 rectangles only need read 6 points and carry out plus/minus computing from integrogram, and the feature of 3 rectangles only need read 8 points, and the feature of 4 rectangles only need read 9 points.
2. distinction is strong.The dimension of 2D-Haar audio feature space is very high, and 5 quasi-modes used for the present embodiment, the integrogram of 32 × 32,5 quasi-modes can produce the 2D-Haar audio frequency characteristics that total dimension has exceeded 510,000, and concrete quantity is as shown in table 2.
The quantity of table 2 32 × 32 integrogram 5 class 2D-Haar audio frequency characteristics
This dimension considerably beyond the raw information of audio frequency FFT energy spectrum, also far beyond the dimension of feature space after SVM Nonlinear Mapping.In addition, because audio frequency characteristics figure is made up of the continuous audio frame of some, therefore 2D-Haar audio frequency characteristics also can reflect certain time sequence information.
In the present embodiment, the concrete grammar of 2D-Haar audio feature extraction is: first according to integrogram and said method, calculates 510112 all dimension 2D-Haar audio frequency characteristics values, obtains 2D-Haar audio frequency characteristics value set; And then corresponding for every width integrogram 510112 dimension 2D-Haar audio feature vector are denoted as a line, make tag file press from both sides B kin the whole H of all m width integrograms tie up 2D-Haar audio feature vector and form that m is capable, the eigenmatrix X of 510112 row, as shown in step 2.3, in the present embodiment, m=22690.
Step 4.2, uses AdaBoost.MH method to carry out Feature Selection and sorter training to the 2D-Haar audio frequency characteristics matrix X that step 4.1 obtains, obtains speaker clustering device.The ultimate principle of described AdaBoost.MH method is: take turns iteration by F, and from 510112 dimension 2D-Haar audio frequency characteristics value sets, select F to tie up principal character, simultaneous training obtains F Weak Classifier, is formed strong classifier.
In the present embodiment, the value of F is 400.
The Weak Classifier used in above-mentioned interative computation, its definition is:
h j ( x , y ) = 1 p j , y x j < p j , y &theta; j , y - 1 p j , y x j &GreaterEqual; p j , y &theta; j , y , - - - ( 2 )
Wherein, x jrepresent the input of Weak Classifier, θ j, ythe threshold value obtained after representing training, p j,ythe direction of the instruction sign of inequality.
The concrete training process of AdaBoost.MH is:
Step 4.2.1, the weight that initialization every width integrogram is corresponding, is denoted as D 1(i, l i)=1/ (mk), i=1 ... m, l i∈ Y.
Step 4.2.2, successively using each column data (i.e. 510112 groups of same dimensional features of all integrograms) of eigenmatrix X as the input of a Weak Classifier, carry out 510112 and take turns computing, calculate r according to the following formula f,jvalue:
r f , j = &Sigma; j , ( i , l ) D f ( i , l i ) K i [ l i ] h j ( x i , l i ) , j = 1 . . . 510112
Wherein, h j(x i, l i) represent that the jth dimensional feature value extracted in i-th integrogram is as the Weak Classifier inputted, D f(i, l i) represent that f takes turns the weighted value of i-th training integrogram in iteration, K i [ l i ] = + 1 l i &Element; [ 1 , . . . , k ] - 1 l i &NotElement; [ 1 , . . . , k ] .
A h is selected from above-mentioned 510112 Weak Classifiers j(x, l i), make r f=max (r f,j), by this sorter characteristic of correspondence f jx this Weak Classifier, as the feature dimensions chosen, is denoted as h by () simultaneously t(x, l), adds in strong classifier.Wherein, f jx () represents the jth dimension (i.e. the jth row of eigenmatrix X) of 510112 dimension 2D-Haar audio feature vector, h j(x, l) represents the Weak Classifier adopting jth dimensional feature value as input;
Step 4.2.3, calculates the Weak Classifier h selected by step 4.2.2 jthe weight α of (x, l) f:
&alpha; f = 1 2 ln ( 1 + r f 1 - r f ) ;
Step 4.2.4, calculates the weight Df+1 of each integrogram in next round iteration;
D f + 1 = D f ( i , l i ) exp ( - &alpha; f K i [ l i ] h f ( x i , l i ) ) Z f , i = 1 . . . m .
Wherein, h f(x i, l i) represent that f takes turns using the jth dimensional feature value of i-th integrogram extraction as the Weak Classifier of input in iteration, Z fit is normalized factor
Z f = &Sigma; i , l D f ( i , l i ) exp ( - &alpha; f K i [ l i ] h f ( x i , l i ) ) , i = 1 . . . m .
Step 4.2.5, the new weight obtained by step 4.2.4 substitutes into step 4.2.2, according to the method for step 4.2.2 to step 4.2.4, chooses the feature dimensions that is new, obtains a new Weak Classifier simultaneously and adds in strong classifier;
Step 4.2.6, according to the method iteration 400 times of step 4.2.2 to step 4.2.5, obtains the strong classifier be made up of 400 Weak Classifiers, i.e. the identification sorter of a kth speaker, is expressed as:
W k ( x ) = arg max l S l , S l = ( &Sigma; t = 1 F &alpha; t h t ( x , l ) ) - - - ( 1 )
Step 4.2.7, takes turns after training terminates until 600, is gathered by all 600 speaker clustering devices, form speaker clustering device pond W={W 1(x), W 2(x) ..., W 600(x) }.
Step 5, utilizes step 4 to train the speaker clustering device pond obtained, and extracts 2D-Haar audio frequency characteristics, and carry out speaker's identification to the voice document of unknown speaker.
Step 5.1, carries out the extraction of audio frequency characteristics integrogram to identification voice document, obtains treating identification audio frequency characteristics integrogram sequence G'={g' 1, g' 2, g' 3... g' u '.Concrete grammar is identical with method described in step 2.Wherein, foundation characteristic file v kleaching process in (corresponding to step 2.2), frame length is set as f s=30ms, frame moves and is set as Δ f s=20ms; In audio frequency characteristics graphic sequence transfer process (corresponding to step 2.2), the long a=32 of window, stepping s=16; In the present embodiment, s ktotal duration be 6.30s, therefore
p = &Sigma; 1 n p n = 12 + 12 + 8 = 32 .
Similar, wait that the value of the totalframes c' recognizing voice is also by waiting to recognize that the length of voice document is determined, the characteristic pattern quantity u ' that characteristic pattern sequence comprises is:
Step 5.2, on the basis of step 5.1, according to the 2D-Haar audio feature extraction methods described in step 4.1, for the every width characteristic pattern in characteristic pattern sequence extracts 2D-Haar audio frequency characteristics, forms 510112 row, the 2D-Haar audio frequency characteristics matrix X ' of 39 row.
Step 5.3, the 2D-Haar audio frequency characteristics matrix X ' step 5.2 obtained inputs each sorter in speaker clustering device pond W, obtains classification results sequence R.
Described classification results sequence R is made up of the individual element of u ', and wherein the circular of each element is:
Step 5.3.1, according to (1) formula in step 4.2.6, reads certain Weak Classifier h in speaker clustering device t(x, l) and corresponding 2D-Haar audio frequency characteristics f thereof j(x);
Step 5.3.2, for often kind of label to be selected (i.e. k or other), calculates the output h of this Weak Classifier respectively t(f j(x), l), and by this output valve with the weight α in sorter tbe added to label l to be selected icorresponding weighted value S liin;
Step 5.3.3, according to the method for step 5.3.1-step 5.3.2 carry out 400 take turns circulation after, often kind of label l to be selected ia weighted value S will be obtained li.Select the weighted value S that value is maximum li, record label l to be selected corresponding thereto simultaneously ias the classification results of this audio frequency characteristics figure, be denoted as (l i, ), wherein l kfor speaker's label, for corresponding strong classifier weighted sum.
Step 5.3.4, recognizes that all classification results of audio frequency combine, composition and classification result sequence by waiting R = { ( l i , S l i , u &prime; ) : ( l 1 , S l 1 , 1 ) , ( l 1 , S l 1 , 2 ) , ( l 2 , S l 2 , 3 ) , &CenterDot; &CenterDot; &CenterDot; ( l i , S l i , u &prime; ) } .
Step 5.4, carries out result comprehensively to the classification results sequence that step 5.3 obtains, obtains final speaker's recognition results.
Concrete grammar is: strong classifiers all in result sequence is differentiated weight by speaker's label l iweighting, exports the final recognition results of the speaker's label making weighted sum maximum as this section of voice.
Second link
This link will be tested performance of the present invention, and test platform, speaker register/train flow process speaker and recognize that flow process is identical with embodiment 1, below will repeat no more, and stress method and the result of performance test.
Experimental data is generated by following steps: (1) whole speech datas of random selecting 100,200,300,400,500,600 people from all speakers, (2) from everyone voice, 7 are chosen as training data, 3 as target detection data, (3) for each target speaker, random selecting 50 other people recognize test data as emitting by statement.
In order to compare, adopt GMM-UBM method to contrast, carrying out 3 target detections and emit for 50 times recognizing test to each target speaker, false acceptance rate (the False Acceptance Rate of record two kinds of methods.And false rejection rate (False Rejection Rate, FRR) FAR), draw DET curve, and add up accuracy rate and identification is consuming time.Wherein:
The wrong rates such as accuracy rate=1-.
When speaker's scale is increased to 600 by 100, the performance of two kinds of methods is as shown in Fig. 6 and table 3.Visible, when speaker's scale constantly increases, the identification accuracy rate of control methods declines obvious, and institute's extracting method downtrending is herein more slow, under 600 people's scales, exceed 4.3% than the accuracy rate of control methods, the average identification accuracy rate under 6 kinds of speaker's scales can reach 91.3%.
The accuracy rate (%) of two kinds of methods under the different speaker's scale of table 3
In order to evaluate herein put forward the time efficiency of algorithm, add up the average identification t consuming time of different 2D-Haar intrinsic dimensionality F speech data lower p.s..As shown in Table 4, institute's extracting method has higher identification speed herein.
The average identification of the lower institute's extracting method herein of table 4 different F value is consuming time
From above-mentioned experiment, 2D-Haar audio frequency characteristics, while introducing time sequence information, has expanded the dimension of feature space effectively, for the sorter training performance more excellent provides possibility; Meanwhile, use AdaBoost.MH algorithm, the Decision Stump Weak Classifier in conjunction with single feature input carries out Feature Selection, and both improve representativeness and the discrimination of proper vector, decreased the computation burden in identification stage, identification speed is higher.In conjunction with 2D-Haar audio frequency characteristics and AdaBoost.MH algorithm, the accurate identification of extensive speaker can be realized.

Claims (7)

1. extensive speaker's identification method, is characterized in that, said method comprising the steps of:
Step 1, obtains the voice signal waiting to recognize speaker, basis of formation sound bank S;
Step 2, carries out the calculating of audio frequency characteristics integrogram to the voice in the S of basic speech storehouse, basis of formation feature database R, and the step that described audio frequency characteristics integrogram calculates specifically comprises:
Step 2.1, waits to recognize speaker, to its audio file s for kth kcarry out sub-frame processing, frame length f s, frame moves Δ f sbe set by the user, and extract the elementary audio feature of each frame, by the elementary audio Feature Combination of each frame, form the foundation characteristic file v that comprises c frame, every frame p dimensional feature amount k,
V kin the content of proper vector of each frame be: { [foundation characteristic 1 (p 1dimension)], [foundation characteristic 2 (p 2dimension)] ..., [foundation characteristic n (p ndimension)] },
Step 2.2, waits for kth the foundation characteristic file v recognizing speaker k, adopt the mode of sliding window, be that window is long, s is stepping with a, convert all c frame audio feature vector to audio frequency characteristics graphic sequence file G k,
G k={ g 1, g 2, g 3... g u, wherein,
Step 2.3, on the basis of step 2, calculates the audio frequency characteristics graphic sequence file G waiting to recognize speaker for kth kin every width audio frequency characteristics figure g ucharacteristic-integration figure r u, form the characteristic-integration graphic sequence file R of this speaker k={ r 1, r 2, r 3... r u, all k in the S of basic speech storehouse is waited recognize that the characteristic-integration graphic sequence file of speaker puts together, basis of formation feature database R={R 1, R 2..., R k,
Described characteristic-integration figure and original audio frequency characteristics figure is measure-alike, any point (x on it, y) value is defined as all eigenwert sum of corresponding point on original audio frequency characteristics figure (x ', y ') and upper left side thereof, and definition is as follows:
i i ( x , y ) = &Sigma; x &prime; &le; x , y &prime; &le; y i ( x &prime; , y &prime; ) ,
In formula, ii (x, y) represents the value of point (x, y) on integrogram, and i (x', y') represents the eigenwert of point on original audio frequency characteristics figure (x ', y ');
Step 3, on the basis of foundation characteristic storehouse R, generates each training characteristics file set B waiting to recognize speaker;
Step 4, on the basis of step 3, extract 2D-Haar audio frequency characteristics, and carry out speaker's registration, namely travel through k file in tag file collection B successively, and use training characteristics file to be wherein eachly treat that identification speaker trains independent " 1 to remaining " sorter, finally obtain the sorter pond be made up of k speaker clustering device, described extraction 2D-Haar audio frequency characteristics " computing method be:
The value often tieing up 2D-Haar audio frequency characteristics is all on original audio frequency characteristics figure, in the square region of arbitrary dimension, position, the eigenwert sum of a certain specific rectangular area is used to deduct the eigenwert sum of another specific rectangular area, calculate acquisition fast by integrogram, its total dimension H is determined by the size of the 2D-Haar audio frequency characteristics type adopted and integrogram;
Corresponding for every width integrogram H is tieed up 2D-Haar audio feature vector and is denoted as a line, make tag file press from both sides B kin the whole H of all m width audio frequency characteristics integrograms tie up 2D-Haar audio feature vector and form that m is capable, the eigenmatrix X of H row;
Step 5, utilizes step 4 to train the speaker clustering device pond obtained, and extracts 2D-Haar audio frequency characteristics, finally carry out speaker's identification to the voice document of unknown speaker.
2. method according to claim 1, is characterized in that, described acquisition waits to recognize that the voice signal of speaker does not require that speaker pronounces according to content of text preset in feature templates.
3. method according to claim 1, it is characterized in that, the described sorter pond be made up of k speaker clustering device, need be got by k training in rotation, often wheel training all will take turns iteration by F, and tie up 2D-Haar audio frequency characteristics value set from H and select F to tie up principal character, simultaneous training obtains F Weak Classifier, formed strong classifier, concrete grammar is:
Step 1, the weight that initialization every width integrogram is corresponding, is denoted as D 1(i, l i)=1/ (mk), i=1 ... m, l i∈ Y, represents the speaker's label corresponding to i-th integrogram, Y={1,2 ..., k} is target speaker tally set, and k is target speaker number, and m is the quantity of audio frequency characteristics integrogram;
Step 2, successively using be denoted as each column data of eigenmatrix X, the H group of all integrograms with the input of dimensional feature as a Weak Classifier, carry out H and take turns computing, calculate r according to the following formula f,jvalue:
r f , j = &Sigma; j , ( i , l ) D f ( i , l i ) K i &lsqb; l i &rsqb; h j ( x i l i ) , j = 1 ... H
Wherein, h j(x i, l i) represent that the jth dimensional feature value extracted in i-th integrogram is as the Weak Classifier inputted, D f(i, l i) represent that f takes turns the weighted value of i-th training integrogram in iteration, K i &lsqb; l i &rsqb; = + 1 l i &Element; &lsqb; 1 , ... , k &rsqb; - 1 l i &NotElement; &lsqb; 1 , ... , k &rsqb; ,
A h is selected from an above-mentioned H Weak Classifier j(x, l i), make r f=max (r f,j), by this sorter characteristic of correspondence f jx this Weak Classifier, as the feature dimensions chosen, is denoted as h by () simultaneously t(x, l), adds in strong classifier, wherein, and f jx () represents that H ties up the jth dimension of 2D-Haar audio feature vector, h j(x, l) represents the Weak Classifier adopting jth dimensional feature value as input;
Step 3, calculates the Weak Classifier h selected by step 2 jthe weight α of (x, l) f:
&alpha; f = 1 2 l n ( 1 + r f 1 - r f ) ;
Step 4, calculates the weight D of each integrogram in next round iteration f+1;
D f + 1 = D f ( i , l i ) exp ( - &alpha; f K i &lsqb; l i &rsqb; h f ( x i , l i ) ) Z f , i = 1 ... m .
Wherein, h f(x i, l i) represent that f takes turns using the jth dimensional feature value of i-th integrogram extraction as the Weak Classifier of input in iteration, Z fit is normalized factor
Z f = &Sigma; i , l D f ( i , l i ) exp ( - &alpha; f K i &lsqb; l i &rsqb; h f ( x i , l i ) ) , j = 1 ... m .
Step 5, new weight step 4 obtained substitutes into step 2, according to step 2 to the method for step 4, chooses the feature dimensions that is new, obtains a new Weak Classifier simultaneously and adds in strong classifier;
Step 6, according to step 2 to method iteration F time of step 5, obtains the strong classifier be made up of F Weak Classifier, i.e. the identification sorter of a kth speaker, is expressed as:
W k ( x ) = arg m a x l S l , S l = ( &Sigma; t = 1 F &alpha; t h t ( x , l ) ) - - - ( 1 ) .
4. method according to claim 3, it is characterized in that the Weak Classifier used in described interative computation need meet the following conditions: 1. the input of Weak Classifier is one-dimensional eigenwert, namely a certain specific dimension in proper vector, or a certain row in eigenmatrix X; 2. for speaker's label l to be recognized i, the output of Weak Classifier is 1 or-1.
5. method according to claim 1, is characterized in that, the concrete steps of described speaker's identification are:
Step 1, treats the original audio frequency characteristics figure voice document of identification and carries out the extraction of audio frequency characteristics integrogram, obtains treating identification audio frequency characteristics integrogram sequence G'={g' 1, g' 2, g' 3... g' u ', u' represents and treats in identification audio frequency characteristics integrogram sequence, the quantity of audio frequency characteristics integrogram, for one comprise c' frame wait recognize voice document, the quantity u' of the audio frequency integration characteristic pattern that audio frequency characteristics integrogram sequence comprises is: a represents that the window set in generation audio frequency characteristics figure process is long, and s represents in same process, the stepping of sliding window movement;
Step 2, on the basis of step 1, for the every width audio frequency characteristics figure in audio frequency characteristics graphic sequence extracts 2D-Haar audio frequency characteristics, forms 2D-Haar audio frequency characteristics matrix X ';
Step 3, the 2D-Haar audio frequency characteristics matrix X ' step 2 obtained inputs each sorter of speaker clustering device pond W simultaneously, obtains classification results sequence R;
Step 4, carries out result comprehensively to the classification results sequence that step 3 obtains, obtains final speaker's recognition results.
6. method according to claim 5, is characterized in that, described classification results sequence R is made up of the individual element of u ', and wherein the circular of each element is:
Step 1, according to (1) formula in claim 5 step 6, reads certain Weak Classifier h in speaker clustering device t(x, l) and corresponding 2D-Haar audio frequency characteristics f thereof j(x);
Step 2, for label k and other to be selected, calculates the output h of each Weak Classifier respectively t(f j(x), l), and by this output valve with the weight α in sorter tbe added to label l to be selected icorresponding weighted value S liin;
Step 3, after carrying out the circulation of F wheel according to the method for step 1-step 2, often kind of label l to be selected ia weighted value S will be obtained li, select the weighted value S that value is maximum li, record label l to be selected corresponding thereto simultaneously ias the classification results of this audio frequency characteristics figure, be denoted as wherein l kfor speaker's label, for corresponding strong classifier weighted sum;
Step 4, recognizes that all classification results of audio frequency combine, composition and classification result sequence by waiting R = { ( l i , S l i , u &prime; ) : ( l 1 , S l 1 , 1 ) , ( l 1 S l 1 2 ) , ( l 2 S l 2 , 3 ) , . . . ( l i S l i , u &prime; ) } .
7. method according to claim 5, is characterized in that, the computing method of described " result is comprehensive " link are:
Strong classifiers all in result sequence is differentiated weight by speaker's label l iweighting, exports the final recognition results of the speaker's label making weighted sum maximum as this section of voice.
CN201310074743.9A 2013-03-08 2013-03-08 A kind of extensive speaker's identification method Expired - Fee Related CN103258536B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310074743.9A CN103258536B (en) 2013-03-08 2013-03-08 A kind of extensive speaker's identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310074743.9A CN103258536B (en) 2013-03-08 2013-03-08 A kind of extensive speaker's identification method

Publications (2)

Publication Number Publication Date
CN103258536A CN103258536A (en) 2013-08-21
CN103258536B true CN103258536B (en) 2015-10-21

Family

ID=48962410

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310074743.9A Expired - Fee Related CN103258536B (en) 2013-03-08 2013-03-08 A kind of extensive speaker's identification method

Country Status (1)

Country Link
CN (1) CN103258536B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106448682A (en) * 2016-09-13 2017-02-22 Tcl集团股份有限公司 Open-set speaker recognition method and apparatus
CN107393527A (en) * 2017-07-17 2017-11-24 广东讯飞启明科技发展有限公司 The determination methods of speaker's number
CN108309303B (en) * 2017-12-26 2021-01-08 上海交通大学医学院附属第九人民医院 Wearable intelligent monitoring of gait that freezes and helps capable equipment
CN108962231B (en) * 2018-07-04 2021-05-28 武汉斗鱼网络科技有限公司 Voice classification method, device, server and storage medium
CN110134819B (en) * 2019-04-25 2021-04-23 广州智伴人工智能科技有限公司 Voice audio screening system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6539352B1 (en) * 1996-11-22 2003-03-25 Manish Sharma Subword-based speaker verification with multiple-classifier score fusion weight and threshold adaptation
CN101770774A (en) * 2009-12-31 2010-07-07 吉林大学 Embedded-based open set speaker recognition method and system thereof

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6539352B1 (en) * 1996-11-22 2003-03-25 Manish Sharma Subword-based speaker verification with multiple-classifier score fusion weight and threshold adaptation
CN101770774A (en) * 2009-12-31 2010-07-07 吉林大学 Embedded-based open set speaker recognition method and system thereof

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
基于HAAR小波的分级说话人辨识;范小春 邱政权;《计算机工程与应用》;20101231;122-124 *
基于Haar特征的Turbo-Boost表情识别算法;谢尔曼 罗森林 潘丽敏;《计算机辅助设计与图形学学报》;20111231;1442-1446 *
说话人识别中的HOCOR和改进的MCE;范小春 邱政权;《科学技术与工程》;20081231;全文 *

Also Published As

Publication number Publication date
CN103258536A (en) 2013-08-21

Similar Documents

Publication Publication Date Title
CN103198833B (en) A kind of high precision method for identifying speaker
An et al. Deep CNNs with self-attention for speaker identification
Cai et al. Exploring the encoding layer and loss function in end-to-end speaker and language recognition system
CN101136199B (en) Voice data processing method and equipment
CN103177733B (en) Standard Chinese suffixation of a nonsyllabic &#34;r&#34; sound voice quality evaluating method and system
CN105261367B (en) A kind of method for distinguishing speek person
CN103854645B (en) A kind of based on speaker&#39;s punishment independent of speaker&#39;s speech-emotion recognition method
CN103258536B (en) A kind of extensive speaker&#39;s identification method
CN103544963A (en) Voice emotion recognition method based on core semi-supervised discrimination and analysis
CN112562741B (en) Singing voice detection method based on dot product self-attention convolution neural network
CN105810191B (en) Merge the Chinese dialects identification method of prosodic information
CN110992988B (en) Speech emotion recognition method and device based on domain confrontation
Chandrakala et al. Generative model driven representation learning in a hybrid framework for environmental audio scene and sound event recognition
CN110211594A (en) A kind of method for distinguishing speek person based on twin network model and KNN algorithm
CN104008754A (en) Speech emotion recognition method based on semi-supervised feature selection
CN103594084A (en) Voice emotion recognition method and system based on joint penalty sparse representation dictionary learning
CN104464738B (en) A kind of method for recognizing sound-groove towards Intelligent mobile equipment
Fan et al. Deep Hashing for Speaker Identification and Retrieval.
Shu et al. Time-frequency performance study on urban sound classification with convolutional neural network
Wu et al. Research on voiceprint recognition based on weighted clustering recognition SVM algorithm
Sonkamble et al. Use of support vector machines through linear-polynomial (LP) kernel for speech recognition
Raghavan et al. Speaker verification using support vector machines
Huang et al. Latent discriminative representation learning for speaker recognition
Nazir et al. An Arabic mispronunciation detection system based on the frequency of mistakes for Asian speakers
Fadhel et al. Classification of Written and Spoken Arabic Letters based on Deep Learning

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20151021

Termination date: 20160308