CN103258536A - Large-scaled speaker identification method - Google Patents

Large-scaled speaker identification method Download PDF

Info

Publication number
CN103258536A
CN103258536A CN2013100747439A CN201310074743A CN103258536A CN 103258536 A CN103258536 A CN 103258536A CN 2013100747439 A CN2013100747439 A CN 2013100747439A CN 201310074743 A CN201310074743 A CN 201310074743A CN 103258536 A CN103258536 A CN 103258536A
Authority
CN
China
Prior art keywords
speaker
frequency characteristics
audio frequency
haar
integrogram
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013100747439A
Other languages
Chinese (zh)
Other versions
CN103258536B (en
Inventor
罗森林
谢尔曼
潘丽敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN201310074743.9A priority Critical patent/CN103258536B/en
Publication of CN103258536A publication Critical patent/CN103258536A/en
Application granted granted Critical
Publication of CN103258536B publication Critical patent/CN103258536B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to a text-independent speaker identification method, wherein the text-independent speaker identification method is based on 2D-Haar voice frequency characteristics and suitable for large-scaled speakers. The invention provides conception and a calculation method of the 2D-Haar voice frequency characteristics, and foundational voice frequency characteristics are used to form a voice frequency characteristic graph at first; then the voice frequency characteristic graph is used to extract 2D-Haar voice frequency characteristics; then an AdaBoost.MH algorithm is used to accomplish screening of the 2D-Haar voice frequency characteristics and training of a speaker classifier; finally the trained speaker classifier is used to achieve the identification of speakers. Compared with the prior art, the large-scaled speaker identification method can effectively restrain decay of identification accuracy rate in a large-scale speaker identification situation, and has high identification accuracy rate and identification speed. The text-independent speaker identification method is not only applied to a desktop computer, but also applied to mobile calculation platforms like a cell phone, a tablet and the like.

Description

A kind of extensive speaker's identification method
Technical field
The present invention relates to a kind of text-independent speaker identification method that is applicable to extensive speaker, belong to technical field of biometric identification; Angle from technology realizes also belongs to computer science and voice processing technology field.
Background technology
Speaker's identification (Speaker Identification) technology is Speaker Identification (Speaker Recognition, SR) technology important branch, be the voice signal characteristics of utilizing each speaker, from one section voice, extract speaker information, and then judge that this section voice are which of some philtrums is said, be the pattern recognition problem of " multiselect one ".Along with the high speed development of modern electronic technology in recent years, the application demand of speaker's recognition techniques is more come by force (fields such as for example court's discriminating, suspect's tone tracking location, speech retrieval), also with characteristics such as its unique convenience, economy and accuracys and receive increasing concern.
According to the type difference of the content of speaking, speaker's identification can be divided into text about (Text-dependent) and text-independent (Text-independent) two big classes.The speaker recognition system relevant with text requires user's content pronunciation according to the rules, and everyone identification model is accurately set up one by one, and content pronunciation that also must be in accordance with regulations during identification; The recognition system of text-independent is not then stipulated speaker's pronunciation content, and model is set up difficulty relatively, but range of application is wideer.In some cases, people can't (perhaps not wish) to force the speaker to read aloud one section specific literal, and in these application scenarioss, speaker's identification method of text-independent just seems especially important.
The basic fundamental of speaker's identification that this is irrelevant can be divided into voice collecting, feature extraction, and sorting technique three classes, wherein key issue is feature extraction and sorting technique.
The feature extraction aspect, present main stream approach adopts that (Linear Predictive Coding Cepstrum is LPCC) as characteristic parameter based on the Mel cepstrum coefficient (MFCC) of bottom Principles of Acoustics or linear prediction cepstrum coefficient.
The sorting technique aspect, main stream approach can be divided three classes, template matching method (dynamic time warping (DTW), vector quantization (VQ)), probabilistic method (hidden Markov model (HMM), gauss hybrid models (GMM)), identification, classification device algorithm (artificial neural network (ANN), Support Vector Machine (SVM)).What extensively be used at present is gauss hybrid models (GMM) method and Support Vector Machine (SVM) method.In the said method, the GMM-UBM model is used widely; Support Vector Machine (SVM) method and GMM-UBM have very strong contacting, and the feature super vector that the SVM system of present main flow adopts is generally all produced by GMM.
Based on said method, speaker's recognition techniques of text-independent obtains practical application in some occasions.Yet when number to be recognized constantly increased, the accuracy rate of said method can obviously descend, and when number is increased to certain scale, will be difficult to satisfy the demand of practical application, and this is the major issue that text-independent speaker recognition techniques need solve.
Summary of the invention
The objective of the invention is: innovate from feature extraction and two levels of sorting technique, propose a kind of extensive speaker's identification method, treating still can to obtain higher accuracy rate under the more scene of identification number.
Design concept of the present invention is: propose 2D-Haar audio feature extraction method, introduce the certain time sequence relation information, and audio feature space is extended to hundreds thousand of dimensions, for recognition algorithm provides huger feature space; Simultaneously, use the AdaBoost.MH algorithm, the representative feature combination of screening in feature space is for establishing target speaker's identification sorter.The present invention does not increase training and recognize time expense when further promoting accuracy rate, have characteristics fast and accurately.
Technical scheme of the present invention realizes as follows:
Step 1 is obtained the voice signal of speaker to be recognized (being the target speaker), forms basic speech storehouse S.
Concrete grammar is: microphone is connected with computing machine, obtains target speaker's voice signal, and be stored in the computing machine with the form of audio file, the corresponding audio file of each target speaker forms basic speech storehouse S={s 1, s 2, s 3..., s k, wherein k is target speaker's sum.
Step 2 is carried out the audio frequency characteristics integrogram to the voice among the S of basic speech storehouse and is calculated, and forms foundation characteristic storehouse R.Detailed process is as follows:
Step 2.1 is for k target speaker, to its audio file s kCarry out the branch frame and handle (frame length f s, frame moves Δ f sBe set by the user), and extract the elementary audio feature (as MFCC, LPCC, sub belt energy etc.) of each frame, the elementary audio feature combination with each frame forms a foundation characteristic file v who comprises c frame, every frame p dimensional feature amount k
v kIn the content of proper vector of each frame be: { [foundation characteristic 1(p 1Dimension)], [foundation characteristic 2(p 2Dimension)] ..., [foundation characteristic n(p nDimension)] }.
In more than describing, for the duration audio file s that is t k:
Figure BDA00002897876600031
p = Σ 1 n p n .
Step 2.2 is for k target speaker's foundation characteristic file v k, adopt the mode of sliding window, be that window is long, s is stepping with a, convert all c frame audio frequency characteristics vectors to audio frequency characteristics graphic sequence file G k(referring to Fig. 2).
G k={ g 1, g 2, g 3... g Uk, wherein,
Step 2.3 on the basis of step 2.2, is calculated the characteristic pattern sequential file G for k target speaker kIn every width of cloth characteristic pattern g uCharacteristic-integration figure r u, form this speaker's characteristic-integration graphic sequence file R k={ r 1, r 2, r 3... r u, the characteristic-integration graphic sequence file of all k target speakers among the S of basic speech storehouse is put together, form foundation characteristic storehouse R={R 1, R 2..., R k.
Yi Zhi, the computing formula of all speakers' characteristic-integration figure sum m is in the foundation characteristic storehouse:
Figure BDA00002897876600033
Described characteristic-integration figure and primitive character figure are measure-alike, and (x, value y) is defined as former figure corresponding point (x ', y ') and all eigenwert sums in upper left side thereof more arbitrarily on it.Definition is as follows:
ii ( x , y ) = Σ x ′ ≤ x , y ′ ≤ y i ( x ′ , y ′ ) ,
Ii in the formula (x, y) (x, value y), i (x', y') eigenwert of point on the expression primitive character figure (x ', y ') of point on the expression integrogram.
Step 3 on the basis of foundation characteristic storehouse R, generates each target speaker's training characteristics file set B.Detailed process is as follows:
Step 3.1 marks the tag file among the R of foundation characteristic storehouse, and concrete grammar is:
Use continuous integer numbering as speaker's label, represent different target speakers, so that Computer Processing.Final mark pattern is R '={ (R 1, 1), (R 2, 2) ... (R k, k) }, wherein, Y={1,2 ..., k} is target speaker tally set, k is target speaker number;
Step 3.2, on the basis of step 3.1, for each target speaker sets up the tag file collection B that is used for speaker's registration, concrete grammar is:
In the feature database R ' of the amiable people's label of mark, carry out k wheel arrangement, take turns in the housekeeping every, at first with k target speaker's audio frequency characteristics file r kAs positive sample, keep its speaker's label k; Then with remaining speaker's audio frequency characteristics file as negative sample, and their speaker's label changed to " other "; At last above-mentioned k audio frequency characteristics file stored in the independent file, and with this tag file folder called after B k, that is:
B 1={(R 1,1),(R 2,other),…(R k,other)},
B 2={(R 1,other),(R 2,2),…(R k,other)},
……
B k={(R 1,other),(R 2,other),…(R k,k)}
After the k wheel housekeeping, the final tag file collection B={B that is constituted by k tag file folder that forms 1, B 2..., B k.
Step 4, on the basis of step 3, extract the 2D-Haar audio frequency characteristics, and carry out the speaker and register, just travel through k file among the tag file collection B successively, and use training characteristics file wherein to train independent " 1 pair is surplus " sorter for each target speaker, finally obtain the sorter pond that is constituted by k speaker's sorter.
For k target speaker, the sorter W that it is corresponding kTraining process as follows:
Step 4.1 is to the formed tag file folder of step 3.2 B kIn all characteristic-integration graphic sequence file R kEvery width of cloth integrogram carry out the 2D-Haar audio feature extraction.Concrete grammar is:
Calculate corresponding H dimension 2D-Haar audio frequency characteristics value (wherein H is determined by the size of the 2D-Haar audio frequency characteristics type that adopts and integrogram) according to each integrogram, obtain the data acquisition S={ (x for the training of speaker's sorter 1, l i) ..., (x m, l i).Wherein, x iRepresent i the corresponding whole H dimension 2D-Haar audio frequency characteristics vectors of integrogram, l i∈ Y, (Y={1,2 ..., k}) i corresponding speaker's label of integrogram of expression.
Described H dimension 2D-Haar audio frequency characteristics value, the value of every dimension 2D-Haar audio frequency characteristics is on the original audio characteristic pattern, in the square region of arbitrary dimension, position, use the eigenwert sum of a certain specific rectangular area to deduct the eigenwert sum of another specific rectangular area, can calculate acquisition fast by integrogram.
The corresponding H dimension of every width of cloth integrogram 2D-Haar audio frequency characteristics vector note is done delegation, make tag file folder B kIn the whole H dimension 2D-Haar audio frequency characteristics vector of all m width of cloth integrograms constitute that m is capable, the eigenmatrix X of H row.
Step 4.2, the 2D-Haar audio frequency characteristics matrix X that uses the AdaBoost.MH method that step 4.1 is obtained carries out feature screening and sorter training, obtains speaker's sorter.The ultimate principle of described AdaBoost.MH method is: by F wheel iteration, select F dimension principal character from H dimension 2D-Haar audio frequency characteristics value set, training simultaneously obtains F Weak Classifier, and it is formed strong classifier.
Employed Weak Classifier in the above-mentioned interative computation need meet the following conditions: 1. the input of Weak Classifier is one-dimensional eigenwert (is a certain specific dimension in the proper vector, or a certain row among the eigenmatrix X); 2. at speaker's label l to be recognized i, the output of Weak Classifier is 1 or-1.
The concrete training process of AdaBoost.MH is:
Step 4.2.1, the weight of the every width of cloth integrogram of initialization correspondence, note is made D 1(i, l i)=1/ (mk), i=1 ... m, l i∈ Y.
Step 4.2.2, successively with each columns of eigenmatrix X according to (namely the H of all integrograms the organizes same dimensional feature) input as a Weak Classifier, carry out the computing of H wheel, calculate r according to following formula F, jValue:
r f , j = Σ j , ( i , l ) D f ( i , l i ) K i [ l i ] h j ( x i , l i ) , j = 1 . . . H
Wherein, h j(x i, l i) expression with the j dimensional feature value extracted in i the integrogram as the Weak Classifier of importing, D f(i, l i) represent that f takes turns the weighted value of i training integrogram in the iteration, K i [ l i ] = + 1 l i ∈ [ 1 , . . . , k ] - 1 l i ∉ [ 1 , . . . , k ] .
From an above-mentioned H Weak Classifier, select a h j(x, l i), make r f=max (r F, j), with this sorter characteristic of correspondence f j(x) as the feature dimensions of choosing, simultaneously this Weak Classifier note is made h t(x l), adds in the strong classifier.Wherein, f j(x) the j dimension of expression H dimension 2D-Haar audio frequency characteristics vector (being the j row of eigenmatrix X), h j(x, l) expression adopts j dimensional feature value as the Weak Classifier of input;
Step 4.2.3 calculates the Weak Classifier h that is selected by step 4.2.2 j(x, weight l) f:
α f = 1 2 ln ( 1 + r f 1 - r f ) ;
Step 4.2.4, the weight Df+1 of each integrogram in the calculating next round iteration;
D f + 1 = D f ( i , l i ) exp ( - α f K i [ l i ] h f ( x i , l i ) ) Z f , i = 1 . . . m .
Wherein, h f(x i, l i) the j dimensional feature value extracted with i integrogram in the expression f wheel iteration is as the Weak Classifier of importing, Z fIt is normalized factor
Z f = Σ i , l D f ( i , l i ) exp ( - α f K i [ l i ] h f ( x i , l i ) ) , i = 1 . . . m .
Step 4.2.5, the new weight substitution step 4.2.2 with step 4.2.4 obtains according to the method for step 4.2.2 to step 4.2.4, chooses a new feature dimensions, obtains a new Weak Classifier simultaneously and adds in the strong classifier;
Step 4.2.6 to the method iteration of step 4.2.5 F time, obtains the strong classifier be made up of F Weak Classifier according to step 4.2.2, i.e. k speaker's identification sorter is expressed as:
W k ( x ) = arg max l S l , S l = ( Σ t = 1 F α t h t ( x , l ) ) - - - ( 1 )
Step 4.2.7, treat that k wheel training finishes after, all k speaker's sorter gathered formation speaker sorter pond W={W 1(x), W 2(x) ..., W k(x) }.
Step 5, speaker's identification to the voice document extraction 2D-Haar audio frequency characteristics of unknown speaker, and is carried out in the speaker's sorter pond that utilizes step 4 training to obtain.
Step 5.1 is carried out the audio frequency characteristics integrogram to the identification voice document and is extracted, and obtains waiting to recognize audio frequency characteristics integrogram sequence G'={g' 1, g' 2, g' 3... g' U '.Concrete grammar is identical with the described method of step 2.Wherein, in the audio frequency characteristics graphic sequence transfer process (corresponding to step 2.2), the long a of window, stepping s, value and step 2 in identical; Similarly, for a voice document to be recognized that comprises the c' frame, the characteristic pattern quantity u ' that the characteristic pattern sequence comprises is:
Figure BDA00002897876600062
Step 5.2 on the basis of step 5.1, according to the described 2D-Haar audio feature extraction of step 4.1 method, for the every width of cloth characteristic pattern in the characteristic pattern sequence extracts the 2D-Haar audio frequency characteristics, constitutes 2D-Haar audio frequency characteristics matrix X'.
Step 5.3, each sorter among the 2D-Haar audio frequency characteristics matrix X ' that step 5.2 the is obtained input speaker sorter pond W obtains classification results sequence R.
Described classification results sequence R is made up of the individual element of u ', and wherein the concrete computing method of each element are:
Step 5.3.1 according to (1) formula among the step 4.2.6, reads certain Weak Classifier h in speaker's sorter t(x, l) and corresponding 2D-Haar audio frequency characteristics f j(x);
Step 5.3.2 for every kind of label to be selected (being k or other), calculates the output h of this Weak Classifier respectively t(f j(x), l), and with this output valve with the weight in the sorter tBe added to label l to be selected iCorresponding weighted value S LiIn;
Step 5.3.3 carries out after the F wheel circulation every kind of label l to be selected according to the method for step 5.3.1-step 5.3.2 iTo obtain a weighted value S LiSelect a weighted value S of value maximum Li, record the to be selected label l corresponding with it simultaneously iAs the classification results of this audio frequency characteristics figure, note is made (l i,
Figure BDA00002897876600075
), l wherein kBe speaker's label, Be corresponding strong classifier weighted sum.
Step 5.3.4 will wait that all classification results of recognizing audio frequency combine, and composition and classification is sequence as a result R = { ( l i , S l i , u ′ ) : ( l 1 , S l 1 , 1 ) , ( l 1 , S l 1 , 2 ) , ( l 2 , S l 2 , 3 ) , · · · ( l i , S l i , u ′ ) } .
Step 5.4, it is comprehensive that the classification results sequence that step 5.3 is obtained is carried out the result, obtains final speaker's recognition results.
Concrete grammar is: strong classifiers all in the sequence is as a result differentiated weight
Figure BDA00002897876600074
By speaker's label l iWeighting is the final recognition results output of the speaker's label that makes the weighted sum maximum as this section voice.
Beneficial effect
Than Mel cepstrum coefficient (MFCC) or the linear prediction cepstrum coefficient characteristic parameter extraction methods such as (LPCC) based on the bottom Principles of Acoustics, the 2D-Haar audio feature extraction method that the present invention proposes is introduced le certain time sequence relation information, and audio feature space extended to hundreds thousand of dimensions, for recognition algorithm provides huger feature space.
Compare with speaker's sorting techniques such as GMM, SVM, the present invention adopts and uses the AdaBoost.MH algorithm, Decision Stump Weak Classifier in conjunction with single feature input carries out the feature screening, representativeness and the discrimination of proper vector had both been improved, also reduced the computation burden that the speaker recognizes the stage, arithmetic speed is higher.In conjunction with 2D-Haar audio frequency characteristics and AdaBoost.MH algorithm, can realize extensive speaker's accurate identification, have higher utility.
Description of drawings
Fig. 1 is theory diagram of the present invention;
Fig. 2 extracts the principle intention for audio frequency characteristics figure and characteristic pattern sequence that the present invention proposes;
Fig. 3 is speaker's registration process schematic diagram of the present invention;
Fig. 4 is speaker's identification process schematic diagram of the present invention;
Fig. 5 is employed 5 class 2D-Haar audio frequency characteristics in speaker training and the identification process in the embodiment;
Fig. 6 is in the embodiment, when using the TIMIT sound bank to test, and the performance comparison of the present invention and GMM-UBM algorithm.
Embodiment
Be described in further details objects and advantages of the present invention below in conjunction with the embodiment of drawings and Examples to the inventive method in order better to illustrate.
Below all tests all finish at same computing machine, concrete configuration is: Intel double-core CPU(dominant frequency 1.8G), 1G internal memory, WindowsXP SP3 operating system.
First link
This link will be used the voice document of TIMIT audio repository, describe in detail when target speaker scale is 600 people, the detailed process of speaker's registration/training of the present invention, speaker's identification.
The TIMIT sound bank is the java standard library of being produced jointly by Massachusetts Polytechnics, Stanford Research Institute, Texas Instrument, has comprised 630 speakers' (438 male sex and 192 women) language material, everyone 10 voice.
Whole speech datas of picked at random 600 people from all speakers, from everyone 10 voice, choose again 1 duration greater than 5 seconds file as speaker's registration/training utterance file; Any voice of 1 people of picked at random are as the identification voice document in addition.
Concrete implementation step is as follows:
Step 1 is obtained the voice signal of speaker to be recognized (being the target speaker), forms basic speech storehouse S.
Because the TIMIT sound bank has been the complete audio file of storage, therefore direct voice document with 600 target speakers forms basic speech storehouse S={s 1, s 2, s 3..., s k, wherein k=600 is target speaker's sum.
Step 2 is carried out the audio frequency characteristics integrogram to the voice among the S of basic speech storehouse and is calculated, and forms foundation characteristic storehouse R.Detailed process is as follows:
Step 2.1 is for k target speaker, to its audio file s kCarry out the branch frame and handle, and extract each frame the elementary audio feature (in the present embodiment, use MFCC, LPCC, PLPC), the elementary audio feature combination with each frame forms a foundation characteristic file v who comprises c frame, every frame p dimensional feature amount k
In the present embodiment, v kIn the content of proper vector of each frame be: { [MFCC(12 dimension)], [LPCC(12 dimension)], [PLPC(8 dimension)] }, divide the frame length of frame operation to be set at f s=30ms, frame move and are set at Δ f s=20ms.
Figure BDA00002897876600081
p = Σ 1 n p n = 12 + 12 + 8 = 32 .
Step 2.2 is for k target speaker's foundation characteristic file v k, adopt the mode of sliding window, be that window is long, s is stepping with a, convert all c frame audio frequency characteristics vectors to audio frequency characteristics graphic sequence file G k(referring to Fig. 2).In the present embodiment, a=32, s=16.
G k={ g 1, g 2, g 3... g Uk, wherein,
Figure BDA00002897876600091
Step 2.3 on the basis of step 2.2, is calculated the characteristic pattern sequential file G for k target speaker kIn every width of cloth characteristic pattern g uCharacteristic-integration figure r u, form this speaker's characteristic-integration graphic sequence file R k={ r 1, r 2, r 3... r u, the characteristic-integration graphic sequence file of all 600 target speakers among the S of basic speech storehouse is put together, form foundation characteristic storehouse R={R 1, R 2..., R k.
Yi Zhi, the computing formula of all speakers' characteristic-integration figure sum m is in the foundation characteristic storehouse:
Figure BDA00002897876600092
In the present embodiment, total duration of all 600 audio files is 3630.50s, therefore:
Figure BDA00002897876600093
Described characteristic-integration figure and primitive character figure are measure-alike, and (x, value y) is defined as former figure corresponding point (x ', y ') and all eigenwert sums in upper left side thereof more arbitrarily on it.Definition is as follows:
ii ( x , y ) = Σ x ′ ≤ x , y ′ ≤ y i ( x ′ , y ′ ) ,
Ii in the formula (x, y) (x, value y), i (x', y') eigenwert of point on the expression primitive character figure (x ', y ') of point on the expression integrogram.
Step 3 on the basis of foundation characteristic storehouse R, generates each target speaker's training characteristics file set B.Detailed process is as follows:
Step 3.1 marks the tag file among the R of foundation characteristic storehouse, and concrete grammar is:
Use continuous integer numbering as speaker's label, represent different target speakers, so that Computer Processing.Final mark pattern is R '={ (R 1, 1), (R 2, 2) ... (R 600, 600) }, wherein, Y={1,2 ..., 600} is target speaker tally set;
Step 3.2, on the basis of step 3.1, for each target speaker sets up the tag file collection B that is used for speaker's registration, concrete grammar is:
In the feature database R ' of the amiable people's label of mark, carry out 600 and take turns arrangement, take turns in the housekeeping every, at first with k target speaker's audio frequency characteristics file r kAs positive sample, keep its speaker's label k; Then with remaining speaker's audio frequency characteristics file as negative sample, and their speaker's label changed to " other "; At last above-mentioned 600 audio frequency characteristics files are stored in the independent file, and with this tag file folder called after B k, that is:
B 1={(R 1,1),(R 2,other),…(R 600,other)},
B 2={(R 1,other),(R 2,2),…(R 600,other)},
……
B 600={(R 1,other),(R 2,other),…(R 600,600)}
600 take turns after the housekeeping, the final tag file collection B={B that is made of 600 tag file folders that forms 1, B 2..., B 600.
Step 4, on the basis of step 3, extract the 2D-Haar audio frequency characteristics, and carry out speaker's registration, just travel through 600 files among the tag file collection B successively, and use training characteristics file wherein to train independent " 1 pair surplus " sorter for each target speaker.
For k target speaker, the sorter W that it is corresponding kTraining process as follows:
Step 4.1 is to the formed tag file folder of step 3.2 B kIn all characteristic-integration graphic sequence file R kEvery width of cloth integrogram carry out the 2D-Haar audio feature extraction.
Calculate corresponding H dimension 2D-Haar audio frequency characteristics value according to each integrogram, obtain the data acquisition S={ (x for the training of speaker's sorter 1, l i) ..., (x m, l i).Wherein, x iRepresent i the corresponding whole H dimension 2D-Haar audio frequency characteristics vectors of integrogram, l i∈ Y, (Y={1,2 ..., k}) i corresponding speaker's label of integrogram of expression.
Fig. 5 has showed the computation schema of the 5 class 2D-Haar audio frequency characteristics that present embodiment uses, the value of every dimension 2D-Haar audio frequency characteristics is: on the original audio characteristic pattern, on the square region of arbitrary dimension, position, according to a certain quasi-mode among Fig. 5, the eigenwert sum of calculating black region deducts the eigenwert sum of white portion.This feature has following three characteristics:
1. fast operation.Cooperate integrogram, the extraction of virtually any size 2D-Haar audio frequency characteristics only need be carried out data read and the plus and minus calculation of fixed number of times.The 2D-Haar audio frequency characteristics that comprises 2 rectangles only needs to read 6 points from integrogram and adds/subtract computing, and the feature of 3 rectangles only need read 8 points, and the feature of 4 rectangles only need read 9 points.
2. the property distinguished is strong.The dimension of 2D-Haar audio feature space is very high, and 5 quasi-modes that use with present embodiment are example, and one 32 * 32 integrogram, 5 quasi-modes can produce total dimension and surpass 510,000 2D-Haar audio frequency characteristics, and concrete quantity is as shown in table 2.
The quantity of 32 * 32 integrograms of table 25 class 2D-Haar audio frequency characteristics
Figure BDA00002897876600111
This dimension is considerably beyond the raw information of audio frequency FFT energy spectrum, also head and shoulders above the dimension of feature space after the SVM Nonlinear Mapping.In addition, owing to audio frequency characteristics figure is made up of the continuous audio frame of some, so the 2D-Haar audio frequency characteristics also can reflect certain time sequence information.
In the present embodiment, the concrete grammar of 2D-Haar audio feature extraction is: at first according to integrogram and said method, calculate 510112 all dimension 2D-Haar audio frequency characteristics values, obtain 2D-Haar audio frequency characteristics value set; And then every width of cloth integrogram corresponding 510112 is tieed up 2D-Haar audio frequency characteristics vector note do delegation, make tag file folder B kIn the whole H dimension 2D-Haar audio frequency characteristics vector of all m width of cloth integrograms constitute that m are capable, the eigenmatrix X of 510112 row, shown in step 2.3, in the present embodiment, m=22690.
Step 4.2, the 2D-Haar audio frequency characteristics matrix X that uses the AdaBoost.MH method that step 4.1 is obtained carries out feature screening and sorter training, obtains speaker's sorter.The ultimate principle of described AdaBoost.MH method is: by F wheel iteration, select F dimension principal character from 510112 dimension 2D-Haar audio frequency characteristics value sets, training simultaneously obtains F Weak Classifier, and it is formed strong classifier.
In the present embodiment, the value of F is 400.
Employed Weak Classifier in the above-mentioned interative computation, its definition is:
h j ( x , y ) = 1 p j , y x j < p j , y &theta; j , y - 1 p j , y x j &GreaterEqual; p j , y &theta; j , y , - - - ( 2 )
Wherein, x jThe input of expression Weak Classifier, θ J, yThe threshold value that obtains after the expression training, p J, yThe direction of the indication sign of inequality.
The concrete training process of AdaBoost.MH is:
Step 4.2.1, the weight of the every width of cloth integrogram of initialization correspondence, note is made D 1(i, l i)=1/ (mk), i=1 ... m, l i∈ Y.
Step 4.2.2, each columns with eigenmatrix X, carries out 510112 and takes turns computing as the input of a Weak Classifier according to (i.e. 510112 of all integrograms groups of same dimensional features) successively, calculates r according to following formula F, jValue:
r f , j = &Sigma; j , ( i , l ) D f ( i , l i ) K i [ l i ] h j ( x i , l i ) , j = 1 . . . 510112
Wherein, h j(x i, l i) expression with the j dimensional feature value extracted in i the integrogram as the Weak Classifier of importing, D f(i, l i) represent that f takes turns the weighted value of i training integrogram in the iteration, K i [ l i ] = + 1 l i &Element; [ 1 , . . . , k ] - 1 l i &NotElement; [ 1 , . . . , k ] .
From above-mentioned 510112 Weak Classifiers, select a h j(x, l i), make r f=max (r F, j), with this sorter characteristic of correspondence f j(x) as the feature dimensions of choosing, simultaneously this Weak Classifier note is made h t(x l), adds in the strong classifier.Wherein, f j(x) the j dimension of expression 510112 dimension 2D-Haar audio frequency characteristics vectors (being the j row of eigenmatrix X), h j(x, l) expression adopts j dimensional feature value as the Weak Classifier of input;
Step 4.2.3 calculates the Weak Classifier h that is selected by step 4.2.2 j(x, weight l) f:
&alpha; f = 1 2 ln ( 1 + r f 1 - r f ) ;
Step 4.2.4, the weight Df+1 of each integrogram in the calculating next round iteration;
D f + 1 = D f ( i , l i ) exp ( - &alpha; f K i [ l i ] h f ( x i , l i ) ) Z f , i = 1 . . . m .
Wherein, h f(x i, l i) the j dimensional feature value extracted with i integrogram in the expression f wheel iteration is as the Weak Classifier of importing, Z fIt is normalized factor
Z f = &Sigma; i , l D f ( i , l i ) exp ( - &alpha; f K i [ l i ] h f ( x i , l i ) ) , i = 1 . . . m .
Step 4.2.5, the new weight substitution step 4.2.2 with step 4.2.4 obtains according to the method for step 4.2.2 to step 4.2.4, chooses a new feature dimensions, obtains a new Weak Classifier simultaneously and adds in the strong classifier;
Step 4.2.6 to the method iteration of step 4.2.5 400 times, obtains the strong classifier be made up of 400 Weak Classifiers according to step 4.2.2, i.e. k speaker's identification sorter is expressed as:
W k ( x ) = arg max l S l , S l = ( &Sigma; t = 1 F &alpha; t h t ( x , l ) ) - - - ( 1 )
Step 4.2.7, treat 600 take turns training and finish after, 600 all speaker's sorters are gathered, constitute speaker's sorter pond W={W 1(x), W 2(x) ..., W 600(x) }.
Step 5, speaker's identification to the voice document extraction 2D-Haar audio frequency characteristics of unknown speaker, and is carried out in the speaker's sorter pond that utilizes step 4 training to obtain.
Step 5.1 is carried out the audio frequency characteristics integrogram to the identification voice document and is extracted, and obtains waiting to recognize audio frequency characteristics integrogram sequence G'={g' 1, g' 2, g' 3... g' U '.Concrete grammar is identical with the described method of step 2.Wherein, foundation characteristic file v kLeaching process in (corresponding to step 2.2), frame length is set at f s=30ms, frame move and are set at Δ f s=20ms; In the audio frequency characteristics graphic sequence transfer process (corresponding to step 2.2), the long a=32 of window, stepping s=16; In the present embodiment, s kTotal duration be 6.30s, therefore
Figure BDA00002897876600137
p = &Sigma; 1 n p n = 12 + 12 + 8 = 32 .
Similarly, the value of totalframes c' of waiting to recognize voice is also by waiting that the length of recognizing voice document determines that the characteristic pattern quantity u ' that the characteristic pattern sequence comprises is:
Figure BDA00002897876600132
Step 5.2 on the basis of step 5.1, according to the described 2D-Haar audio feature extraction of step 4.1 method, for the every width of cloth characteristic pattern in the characteristic pattern sequence extracts the 2D-Haar audio frequency characteristics, constitutes 510112 row, the 2D-Haar audio frequency characteristics matrix X ' of 39 row.
Step 5.3, each sorter among the 2D-Haar audio frequency characteristics matrix X ' that step 5.2 the is obtained input speaker sorter pond W obtains classification results sequence R.
Described classification results sequence R is made up of the individual element of u ', and wherein the concrete computing method of each element are:
Step 5.3.1 according to (1) formula among the step 4.2.6, reads certain Weak Classifier h in speaker's sorter t(x, l) and corresponding 2D-Haar audio frequency characteristics f j(x);
Step 5.3.2 for every kind of label to be selected (being k or other), calculates the output h of this Weak Classifier respectively t(f j(x), l), and with this output valve with the weight in the sorter tBe added to label l to be selected iCorresponding weighted value S LiIn;
Step 5.3.3 carries out 400 according to the method for step 5.3.1-step 5.3.2 and takes turns after the circulation every kind of label l to be selected iTo obtain a weighted value S LiSelect a weighted value S of value maximum Li, record the to be selected label l corresponding with it simultaneously iAs the classification results of this audio frequency characteristics figure, note is made (l i,
Figure BDA00002897876600133
), l wherein kBe speaker's label,
Figure BDA00002897876600134
Be corresponding strong classifier weighted sum.
Step 5.3.4 will wait that all classification results of recognizing audio frequency combine, and composition and classification is sequence as a result R = { ( l i , S l i , u &prime; ) : ( l 1 , S l 1 , 1 ) , ( l 1 , S l 1 , 2 ) , ( l 2 , S l 2 , 3 ) , &CenterDot; &CenterDot; &CenterDot; ( l i , S l i , u &prime; ) } .
Step 5.4, it is comprehensive that the classification results sequence that step 5.3 is obtained is carried out the result, obtains final speaker's recognition results.
Concrete grammar is: strong classifiers all in the sequence is as a result differentiated weight
Figure BDA00002897876600136
By speaker's label l iWeighting is the final recognition results output of the speaker's label that makes the weighted sum maximum as this section voice.
Second link
This link will be tested performance of the present invention, and test platform, speaker register/train the flow process speaker and recognize that flow process is identical with embodiment 1, below will repeat no more, and stress method and the result of performance test.
Experimental data generates by following steps: (1) whole speech datas of picked at random 100,200,300,400,500,600 people from all speakers, (2) from everyone voice, choose 7 as training data, 3 as the target test data, (3) at each target speaker, 50 other people statements of picked at random are recognized test data as emitting.
In order to compare, adopt the GMM-UBM method to compare, each target speaker is carried out 3 target tests and emits for 50 times and recognize test, record false acceptance rate (the False Acceptance Rate of two kinds of methods.FAR) and false rejection rate (False Rejection Rate FRR), draws the DET curve, and statistics accuracy rate and identification are consuming time.Wherein:
Figure BDA00002897876600141
Wrong rate such as accuracy rate=1-.
Speaker's scale is increased at 600 o'clock by 100, and the performance of two kinds of methods is shown in Fig. 6 and table 3.As seen, when speaker's scale constantly increased, the identification accuracy rate of control methods descended apparent in view, and the extracting method downtrending of this paper institute is more slow, under 600 people's scales, the average identification accuracy rate that exceeds under 4.3%, 6 kind of speaker's scale than the accuracy rate of control methods can reach 91.3%.
The accuracy rate (%) of two kinds of methods under the different speaker's scales of table 3
Figure BDA00002897876600144
In order to estimate the time efficiency of algorithm that this paper puies forward, add up the average identification t consuming time of different 2D-Haar intrinsic dimensionality F following p.s. of speech datas.As shown in Table 4, this paper institute extracting method has higher identification speed.
The average identification of this paper institute extracting method is consuming time under the different F values of table 4
Figure BDA00002897876600145
By above-mentioned experiment as can be known, the 2D-Haar audio frequency characteristics has expanded the dimension of feature space effectively when having introduced time sequence information, provides possibility for training the more excellent sorter of performance; Simultaneously, use the AdaBoost.MH algorithm, the Decision Stump Weak Classifier of importing in conjunction with single feature carries out the feature screening, has both improved representativeness and the discrimination of proper vector, has also reduced the computation burden in identification stage, and identification speed is higher.In conjunction with 2D-Haar audio frequency characteristics and AdaBoost.MH algorithm, can realize extensive speaker's accurate identification.

Claims (9)

1. extensive speaker's identification method is characterized in that, said method comprising the steps of:
Step 1 is obtained the voice signal of speaker to be recognized (being the target speaker), forms basic speech storehouse S.
Step 2 is carried out the audio frequency characteristics integrogram to the voice among the S of basic speech storehouse and is calculated, and forms foundation characteristic storehouse R.
Step 3 on the basis of foundation characteristic storehouse R, generates each target speaker's training characteristics file set B.
Step 4, on the basis of step 3, extract the 2D-Haar audio frequency characteristics, and carry out the speaker and register, just travel through k file among the tag file collection B successively, and use training characteristics file wherein to train independent " 1 pair is surplus " sorter for each target speaker, finally obtain the sorter pond that is constituted by k speaker's sorter.
Step 5, speaker's identification to the voice document extraction 2D-Haar audio frequency characteristics of unknown speaker, and is carried out in the speaker's sorter pond that utilizes step 4 training to obtain.
2. method according to claim 1 is characterized in that, described obtaining waited to recognize speaker's voice signal and do not required that the speaker pronounces according to presetting content of text in the feature templates.
3. method according to claim 1 is characterized in that, the step that described audio frequency characteristics integrogram calculates specifically comprises:
Step 1 is for k target speaker, to its audio file s kCarry out the branch frame and handle (frame length f s, frame moves Δ f sBe set by the user), and extract the elementary audio feature (as MFCC, LPCC, sub belt energy etc., specifically use which feature, by user specify) of each frame, elementary audio feature combination with each frame forms a foundation characteristic file v who comprises c frame, every frame p dimensional feature amount k
v kIn the content of proper vector of each frame be: { [foundation characteristic 1(p 1Dimension)], [foundation characteristic 2(p 2Dimension)] ..., [foundation characteristic n(p nDimension)] }.
Step 2 is for k target speaker's foundation characteristic file v k, adopt the mode of sliding window, be that window is long, s is stepping with a, convert all c frame audio frequency characteristics vectors to audio frequency characteristics graphic sequence file G k
G k={g 1,g 2,g 3,…g u}.
Step 3 on the basis of step 2, is calculated the characteristic pattern sequential file G for k target speaker kIn every width of cloth characteristic pattern g uCharacteristic-integration figure r u, form this speaker's characteristic-integration graphic sequence file R k={ r 1, r 2, r 3... r u, the characteristic-integration graphic sequence file of all k target speakers among the S of basic speech storehouse is put together, form foundation characteristic storehouse R={R 1, R 2..., R k.
Described characteristic-integration figure and primitive character figure are measure-alike, and (x, value y) is defined as former figure corresponding point (x ', y ') and all eigenwert sums in upper left side thereof more arbitrarily on it.Definition is as follows:
ii ( x , y ) = &Sigma; x &prime; &le; x , y &prime; &le; y i ( x &prime; , y &prime; ) ,
Ii in the formula (x, y) (x, value y), i (x', y') eigenwert of point on the expression primitive character figure (x ', y ') of point on the expression integrogram.
4. method according to claim 1 is characterized in that, the computing method of described extraction 2D-Haar audio frequency characteristics are:
The value of every dimension 2D-Haar audio frequency characteristics all is on the original audio characteristic pattern, in the square region of arbitrary dimension, position, use the eigenwert sum of a certain specific rectangular area to deduct the eigenwert sum of another specific rectangular area, can calculate acquisition fast by integrogram.Its total dimension H is determined by the size of the 2D-Haar audio frequency characteristics type that adopts and integrogram
The corresponding H dimension of every width of cloth integrogram 2D-Haar audio frequency characteristics vector note is done delegation, make tag file folder B kIn the whole H dimension 2D-Haar audio frequency characteristics vector of all m width of cloth integrograms constitute that m is capable, the eigenmatrix X of H row.
5. method according to claim 1, it is characterized in that, the described sorter pond that is constituted by k speaker's sorter, need to get by the k training in rotation, whenever, take turns training and all will take turns iteration by F, select F dimension principal character from H dimension 2D-Haar audio frequency characteristics value set, training simultaneously obtains F Weak Classifier, it is formed strong classifier, and concrete grammar is:
Step 1, the weight of the every width of cloth integrogram of initialization correspondence, note is made D 1(i, l i)=1/ (mk), i=1 ... m, l i∈ Y.
Step 2, successively with each columns of eigenmatrix X according to (namely the H of all integrograms the organizes same dimensional feature) input as a Weak Classifier, carry out the computing of H wheel, calculate r according to following formula F, jValue:
r f , j = &Sigma; j , ( i , l ) D f ( i , l i ) K i [ l i ] h j ( x i , l i ) , j = 1 . . . H
Wherein, h j(x i, l i) expression with the j dimensional feature value extracted in i the integrogram as the Weak Classifier of importing, D f(i, l i) represent that f takes turns the weighted value of i training integrogram in the iteration, K i [ l i ] = + 1 l i &Element; [ 1 , . . . , k ] - 1 l i &NotElement; [ 1 , . . . , k ] .
From an above-mentioned H Weak Classifier, select a h j(x, l i), make r f=max (r F, j), with this sorter characteristic of correspondence f j(x) as the feature dimensions of choosing, simultaneously this Weak Classifier note is made h t(x l), adds in the strong classifier.Wherein, f j(x) the j dimension of expression H dimension 2D-Haar audio frequency characteristics vector (being the j row of eigenmatrix X), h j(x, l) expression adopts j dimensional feature value as the Weak Classifier of input;
Step 3 is calculated the Weak Classifier h that is selected by step 2 j(x, weight l) f:
&alpha; f = 1 2 ln ( 1 + r f 1 - r f ) ;
Step 4, the weight D of each integrogram in the calculating next round iteration f+ 1;
D f + 1 = D f ( i , l i ) exp ( - &alpha; f K i [ l i ] h f ( x i , l i ) ) Z f , i = 1 . . . m .
Wherein, h f(x i, l i) the j dimensional feature value extracted with i integrogram in the expression f wheel iteration is as the Weak Classifier of importing, Z fIt is normalized factor
Z f = &Sigma; i , l D f ( i , l i ) exp ( - &alpha; f K i [ l i ] h f ( x i , l i ) ) , i = 1 . . . m .
Step 5 with the new weight substitution step 2 that step 4 obtains, according to the method for step 2 to step 4, is chosen a new feature dimensions, obtains a new Weak Classifier simultaneously and adds in the strong classifier;
Step 6 to the method iteration of step 5 F time, obtains the strong classifier be made up of F Weak Classifier according to step 2, i.e. k speaker's identification sorter is expressed as:
W k ( x ) = arg max l S l , S l = ( &Sigma; t = 1 F &alpha; t h t ( x , l ) ) - - - ( 1 )
6. method according to claim 5, it is characterized in that, employed Weak Classifier in the described interative computation need meet the following conditions: 1. the input of Weak Classifier is one-dimensional eigenwert (is a certain specific dimension in the proper vector, or a certain row among the eigenmatrix X); 2. at speaker's label l to be recognized i, the output of Weak Classifier is 1 or-1.
7. method according to claim 1 is characterized in that, the step of described speaker's identification is:
Step 1 is carried out the audio frequency characteristics integrogram to the identification voice document and is extracted, and obtains waiting to recognize audio frequency characteristics integrogram sequence G'={g' 1, g' 2, g' 3... g' U ',
Figure FDA00002897876500041
Concrete grammar and parameter value are described identical with claim 3.
Step 2 on the basis of step 1, for the every width of cloth characteristic pattern in the characteristic pattern sequence extracts the 2D-Haar audio frequency characteristics, constitutes 2D-Haar audio frequency characteristics matrix X ', and concrete grammar is described identical with claim 4.
Step 3, the 2D-Haar audio frequency characteristics matrix X ' that step 2 is obtained imports each sorter of speaker's sorter pond W simultaneously, obtains classification results sequence R.
Step 4, it is comprehensive that the classification results sequence that step 3 is obtained is carried out the result, obtains final speaker's recognition results.
8. method according to claim 7 is characterized in that, described classification results sequence R is made up of the individual element of u ', and wherein the concrete computing method of each element are:
Step 1 according to (1) formula in the claim 5, reads certain Weak Classifier h in speaker's sorter t(x, l) and corresponding 2D-Haar audio frequency characteristics f j(x);
Step 2 for every kind of label to be selected (being k or other), is calculated the output h of this Weak Classifier respectively t(f j(x), l), and with this output valve with the weight in the sorter tBe added to label l to be selected iCorresponding weighted value S LiIn;
Step 3 is carried out after the F wheel circulation every kind of label l to be selected according to the method for step 1-step 2 iTo obtain a weighted value S LiSelect a weighted value S of value maximum Li, record the to be selected label l corresponding with it simultaneously iAs the classification results of this audio frequency characteristics figure, note is made (l i,
Figure FDA00002897876500042
), l wherein kBe speaker's label,
Figure FDA00002897876500043
Be corresponding strong classifier weighted sum.
Step 4 will wait that all classification results of recognizing audio frequency combine, and composition and classification is sequence as a result R = { ( l i , S l i , u &prime; ) : ( l 1 , S l 1 , 1 ) , ( l 1 , S l 1 , 2 ) , ( l 2 , S l 2 , 3 ) , &CenterDot; &CenterDot; &CenterDot; ( l i , S l i , u &prime; ) } .
9. method according to claim 7 is characterized in that, the comprehensive computing method of described 7 results are:
Strong classifiers all in the sequence is as a result differentiated weight
Figure FDA00002897876500045
By speaker's label l iWeighting is the final recognition results output of the speaker's label that makes the weighted sum maximum as this section voice.
CN201310074743.9A 2013-03-08 2013-03-08 A kind of extensive speaker's identification method Expired - Fee Related CN103258536B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310074743.9A CN103258536B (en) 2013-03-08 2013-03-08 A kind of extensive speaker's identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310074743.9A CN103258536B (en) 2013-03-08 2013-03-08 A kind of extensive speaker's identification method

Publications (2)

Publication Number Publication Date
CN103258536A true CN103258536A (en) 2013-08-21
CN103258536B CN103258536B (en) 2015-10-21

Family

ID=48962410

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310074743.9A Expired - Fee Related CN103258536B (en) 2013-03-08 2013-03-08 A kind of extensive speaker's identification method

Country Status (1)

Country Link
CN (1) CN103258536B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106448682A (en) * 2016-09-13 2017-02-22 Tcl集团股份有限公司 Open-set speaker recognition method and apparatus
CN107393527A (en) * 2017-07-17 2017-11-24 广东讯飞启明科技发展有限公司 The determination methods of speaker's number
CN108309303A (en) * 2017-12-26 2018-07-24 上海交通大学医学院附属第九人民医院 A kind of wearable freezing of gait intellectual monitoring and walk-aid equipment
CN108962231A (en) * 2018-07-04 2018-12-07 武汉斗鱼网络科技有限公司 A kind of method of speech classification, device, server and storage medium
CN110134819A (en) * 2019-04-25 2019-08-16 广州智伴人工智能科技有限公司 A kind of speech audio screening system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030009333A1 (en) * 1996-11-22 2003-01-09 T-Netix, Inc. Voice print system and method
CN101770774A (en) * 2009-12-31 2010-07-07 吉林大学 Embedded-based open set speaker recognition method and system thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030009333A1 (en) * 1996-11-22 2003-01-09 T-Netix, Inc. Voice print system and method
US6539352B1 (en) * 1996-11-22 2003-03-25 Manish Sharma Subword-based speaker verification with multiple-classifier score fusion weight and threshold adaptation
CN101770774A (en) * 2009-12-31 2010-07-07 吉林大学 Embedded-based open set speaker recognition method and system thereof

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
范小春 邱政权: "基于HAAR小波的分级说话人辨识", 《计算机工程与应用》, 31 December 2010 (2010-12-31), pages 122 - 124 *
范小春 邱政权: "说话人识别中的HOCOR和改进的MCE", 《科学技术与工程》, 31 December 2008 (2008-12-31) *
谢尔曼 罗森林 潘丽敏: "基于Haar特征的Turbo-Boost表情识别算法", 《计算机辅助设计与图形学学报》, 31 December 2011 (2011-12-31), pages 1442 - 1446 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106448682A (en) * 2016-09-13 2017-02-22 Tcl集团股份有限公司 Open-set speaker recognition method and apparatus
CN107393527A (en) * 2017-07-17 2017-11-24 广东讯飞启明科技发展有限公司 The determination methods of speaker's number
CN108309303A (en) * 2017-12-26 2018-07-24 上海交通大学医学院附属第九人民医院 A kind of wearable freezing of gait intellectual monitoring and walk-aid equipment
CN108962231A (en) * 2018-07-04 2018-12-07 武汉斗鱼网络科技有限公司 A kind of method of speech classification, device, server and storage medium
CN108962231B (en) * 2018-07-04 2021-05-28 武汉斗鱼网络科技有限公司 Voice classification method, device, server and storage medium
CN110134819A (en) * 2019-04-25 2019-08-16 广州智伴人工智能科技有限公司 A kind of speech audio screening system
CN110134819B (en) * 2019-04-25 2021-04-23 广州智伴人工智能科技有限公司 Voice audio screening system

Also Published As

Publication number Publication date
CN103258536B (en) 2015-10-21

Similar Documents

Publication Publication Date Title
CN103198833B (en) A kind of high precision method for identifying speaker
An et al. Deep CNNs with self-attention for speaker identification
CN105261367B (en) A kind of method for distinguishing speek person
CN103854645B (en) A kind of based on speaker&#39;s punishment independent of speaker&#39;s speech-emotion recognition method
CN103177733B (en) Standard Chinese suffixation of a nonsyllabic &#34;r&#34; sound voice quality evaluating method and system
CN112562741B (en) Singing voice detection method based on dot product self-attention convolution neural network
CN103544963A (en) Voice emotion recognition method based on core semi-supervised discrimination and analysis
CN103258536B (en) A kind of extensive speaker&#39;s identification method
CN101136199A (en) Voice data processing method and equipment
CN105810191B (en) Merge the Chinese dialects identification method of prosodic information
CN110992988B (en) Speech emotion recognition method and device based on domain confrontation
CN104464738B (en) A kind of method for recognizing sound-groove towards Intelligent mobile equipment
Fan et al. Deep Hashing for Speaker Identification and Retrieval.
CN103136546A (en) Multi-dimension authentication method and authentication device of on-line signature
CN114220179A (en) On-line handwritten signature handwriting retrieval method and system based on faiss
CN106531170B (en) Spoken assessment identity identifying method based on speaker Recognition Technology
Michalevsky et al. Speaker identification using diffusion maps
Gu et al. A text-independent speaker verification system using support vector machines classifier.
Wu et al. Research on voiceprint recognition based on weighted clustering recognition SVM algorithm
Sharma et al. Speech emotion recognition using kernel sparse representation based classifier
Jin et al. Text-independent writer identification based on fusion of dynamic and static features
Kotropoulos et al. Ensemble discriminant sparse projections applied to music genre classification
Houcine et al. Novel approach in speaker identification using SVM and GMM
Lei et al. Mahalanobis Metric Scoring Learned from Weighted Pairwise Constraints in I-Vector Speaker Recognition System.
Raghavan et al. Speaker verification using support vector machines

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20151021

Termination date: 20160308

CF01 Termination of patent right due to non-payment of annual fee