CN103151039A

CN103151039A - Speaker age identification method based on SVM (Support Vector Machine)

Info

Publication number: CN103151039A
Application number: CN2013100494454A
Authority: CN
Inventors: 熊刚; 孔庆杰; 朱菁; 王飞跃; 赵红霞; 朱凤华
Original assignee: Institute of Automation of Chinese Academy of Science; Cloud Computing Industry Technology Innovation and Incubation Center of CAS
Current assignee: Institute of Automation of Chinese Academy of Science; Cloud Computing Industry Technology Innovation and Incubation Center of CAS
Priority date: 2013-02-07
Filing date: 2013-02-07
Publication date: 2013-06-12

Abstract

The invention discloses a speaker age identification method based on an SVM (Support Vector Machine) classifier. The method comprises the following steps that a voice library in which voice signals of speakers of different ages are stored is established; the voice signals in the voice library are preprocessed; voice feature parameters of the preprocessed voice signals are extracted; the SVM training is performed on the basis of the extracted voice feature parameters, and then an SVM model is obtained; and according to the SVM model, the voice feature parameters X of voice to be identified are predicted, after output of each SVM is logically judged in the process of prediction, the voice feature parameter with the most votes is used as the most probable age class, and then a final age identification result is obtained. By using the method provided by the invention, the blank of the prior art in related research on speaker age identification is filled to a certain degree, the speaker age can be judged better, and the method has a broad application prospect on occasions such as man-machine interaction, criminal search, games, entertainments and the like.

Description

A kind of speaker's age bracket recognition methods based on vector machine SVM

Technical field

The present invention relates to mode identification technology, especially a kind of speaker's age bracket recognition methods based on support vector machine (Support Vector Machine, SVM).

Background technology

At present, about speech recognition, the investigative technique of the aspects such as speaker identification is comparative maturity.Other correlative studys that launch on this basis, such as Chinese speech sensibility identification, the identification of speaker's sex, also someone proposes corresponding solution to the directions such as audio classification and identification.But, the identification of relevant speaker's age bracket is not but almost studied and is related to, and the identification of speaker's age bracket is to be applied to a lot of occasions, in interactive system, machine recognition goes out speaker's age bracket, can adopt the machine talk of corresponding age bracket to answer, increase the cordial feeling in man-machine interaction; Perhaps in the detection of some cases, the suspect's that can be identified by audio document age level reduces target search scope etc.Therefore the recognition methods of a kind of speaker's age bracket that the present invention proposes can be provided fundamental basis for the Application and Development of related occasion.

Usually, people's age can be divided into following several stages, children's stage (0～11 years old) roughly, juvenile stage (12～17 years old), the young stage (18～34 years old), stage (35～50 years old) in middle age, old stage (more than 50 years old) etc.Along with the growth at people's age, same person is in the different stages, and one's voice in speech also changes gradually; Be in the voice that the people of same age section sends general character is arranged.The present invention is exactly this characteristic expansion of characteristic that the voice that send of the speaker around each age bracket have corresponding age bracket.

At audio classification, during the identification of speaker's sex, image recognition etc. are identified and used, effect is fine due to the svm classifier method.So the present invention adopts the SVM model to carry out Classification and Identification.Mel-cepstrum coefficient MFCC in speech characteristic parameter is the acoustic feature of deriving as the basis take the auditory properties of people's ear.Because in fact the sound that people's ear can be heard not be simple linear relationship with the frequency of sound.Studies show that, people's ear is followed linear approximate relationship to the perception of sound frequency when 1KHz is following, and frequency is followed linear approximate relationship on the logarithm frequency coordinate at the sound more than 1KHz.MFCC is the cepstrum parameter that extracts in Mei Er scale frequency territory, and this parameter has weakened the radio-frequency component of speech manual, and noise is had adaptability, therefore use this parameter as the characteristic parameter of svm classifier device training identification.

Summary of the invention

The objective of the invention is to adopt the svm classifier device in conjunction with the characteristic parameter MFCC of voice signal, realize the judgement of speaker's age bracket, can be applied to the occasion of needs, detailed process is to extract the phonic signal character parameter that can distinguish speaker's age bracket, utilizes SVM to train and identifies the affiliated age bracket of speaker.

For achieving the above object, a kind of speaker's age bracket recognition methods based on support vector machines of the present invention's proposition comprises the following steps:

Step 1 is set up the sound bank of the speaker's store a plurality of all ages and classes sections voice signal;

Step 2 is carried out pre-service to the voice signal in described sound bank;

Step 3 is to extracting its speech characteristic parameter through pretreated voice signal;

Step 4 is carried out support vector machine training, supported vector machine model based on the speech characteristic parameter that extracts;

Step 5, train the supporting vector machine model that obtains according to described step 4, speech characteristic parameter X to voice to be identified predicts, in forecasting process, the output of each support vector machine is by after logical decision, select the who gets the most votes as most probable age bracket classification, obtain thus final age bracket recognition result.

To sum up, the invention provides a kind of method of the speaker's of identification age bracket, due to the Study of recognition that does not substantially have at present about speaker's age bracket, therefore application prospect of the present invention is more wide, such as, can be applied to man-machine interaction, the criminal investigation search, online chat, the multiple occasion such as Entertainment.In addition, the present invention adopts support vector machine classifier and in conjunction with the characteristic feature parameter of voice signal, identifies the age bracket under the speaker.The characteristic parameter MFCC that extracts in the inventive method meets human hearing characteristic, can effectively distinguish the speaker of all ages and classes section through training.This parameter also has adaptability to noise, has obtained in the speaker identification field using very widely.And the svm classifier device can the realization character parameter dimensionality reduction, have reasonable classifying quality in the application scenario of Classification and Identification.The present invention utilizes the SVM training with the MFCC parameter of all ages and classes section voice, then speech parameter to be measured is carried out Forecasting recognition, can reasonablely realize the judgement of speaker's age bracket.But at all age group boundary, the one's voice in speech temporal evolution is slow due to the speaker, therefore the more difficult identification of the voice at each age group edge, in addition, indivedual speakers' voice characteristic may be inconsistent with corresponding age bracket voice characteristic, and this also will increase the difficulty of identification.In sum, estimate that the present invention can reach more than 70% for the average recognition rate of all age group.

Description of drawings

Fig. 1 is the speaker's age bracket recognition methods process flow diagram that the present invention is based on support vector machines;

Fig. 2 is that SVM trains process flow diagram according to an embodiment of the invention;

Fig. 3 is that SVM adjudicates identification figure according to an embodiment of the invention.

Embodiment

For making the purpose, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in more detail.

Fig. 1 is the speaker's age bracket recognition methods process flow diagram that the present invention is based on support vector machines, and as shown in Figure 1, the method specifically comprises the following steps:

Step 1 is set up the sound bank of the speaker's store a plurality of all ages and classes sections voice signal, and described voice signal is take phrase as the unit;

In this step, at first adopt recording pen or other sound pick-up outfits to gather the speaker's of all ages and classes section voice, sampling rate can unify to be 16KHz, 16bit, monophony, in an embodiment of the present invention, each age bracket is recorded 20 speakers (comprising 10 male 10 female), language scripts is that voice content is classical prose " MoonlIght on the Lotus Pond " " figure viewed from behind " etc., reads 1 time for every piece; And then the voice that record are cut into sound bite signal take phrase as the unit.

Step 2 is carried out pre-service to the voice signal in described sound bank;

In this step, described pre-service is further comprising the steps:

Step 21 is carried out sample quantization to voice signal;

Step 22 in order to remove the impact of mouth and nose radiation, promotes the HFS of signal, adopts following formula to carry out pre-emphasis to the voice signal after quantizing and processes:

H(z)＝1-0.9375z ^-1

Wherein, z represents voice signal, the voice signal that H (z) expression obtains after processing through pre-emphasis;

Step 23 take phrase as the unit, between each word has interval in phrase due to described voice signal, therefore need to adopt end-point detecting method based on energy and zero-crossing rate to remove unvoiced segments in each voice signal.

Wherein, described end-point detecting method adopts two-stage judgement method, and is further comprising the steps:

Step 231 is carried out short division frame with described voice signal and is processed, and frame length is got 20ms, and the voice signal sampling rate is 16KHz, and namely 320 sampled points, obtain a plurality of speech frames;

Step 232 is calculated short-time energy and the short-time zero-crossing rate of each speech frame;

Step 233, according to the average energy of all speech frames, a higher decision threshold E1 is set, size and the described thresholding E1 of the short-time energy of each speech frame are compared, obtain the voice terminal of each speech frame of preliminary judgement, this voice terminal is positioned at the intersection point of described thresholding E1 and speech frame short-time energy envelope outside the corresponding time interval;

Step 234 arranges a slightly low decision threshold E2 according to the average energy of ground unrest, determines the voice terminal of each speech frame on the result of described step 233 preliminary judgement, i.e. the end points of each speech frame;

Step 235 arranges a thresholding Z1 according to the average zero-crossing rate of described ground unrest, and based on the end points of described each speech frame, the voiceless sound of judgement voice front end and the last or end syllable of rear end finally obtain in each speech frame the end points of sound section and unvoiced segments.

In this step, described speech characteristic parameter is taken as MFCC, and in an embodiment of the present invention, MFCC is such as being 12 dimensions.The process that described speech characteristic parameter extracts can comprise the following steps:

Step 31 is divided into a series of leg-of-mutton Mel wave filter sequences with the speech frequency of described voice signal;

Step 32 is got the weighted sum of all signal amplitudes in each leg-of-mutton Mel wave filter sequence frequency bandwidth as the output of respective filter;

Step 33 is done the logarithm computing to the output of all wave filters;

Step 34, the result that described step 33 is obtained is carried out discrete cosine transform and can be obtained MFCC.

Step 4 is carried out the SVM training based on the described speech characteristic parameter that extracts, and obtains the SVM model;

As shown in Figure 2, the process of described SVM training comprises the following steps:

Step 41, with the speech characteristic parameter MFCC of described each all ages and classes section that extracts as eigenvector;

Step 42 is for the speech characteristic parameter of each all ages and classes section adds class label, in an embodiment of the present invention, totally 5 kinds of age brackets (children's stage, juvenile stage, young stages, stage in middle age, the old stage), namely 5 class data, be made as respectively { 1 with five kinds of age brackets, 2,3,4,5}5 class label is processed;

Step 43 with described eigenvector normalization, and is pressed the ratio convergent-divergent, is reduced in [1 ,+1] scope;

step 44, eigenvector after each all ages and classes section normalization is trained, such as training, the kit svmtrain of the LIBSVM that can use the development and Design such as professor Lin Zhiren of Taiwan Univ. (sees C.-C.Chang and C.-J.Lin.LIBSVM:a library for support vector machines.ACM Transactions on Intelligent Systems and Technology, 2:27:1--27:27, 2011.), supported vector machine set, wherein owing to adopting " one to one " method to carry out 5 class classification in one embodiment of the invention, therefore comprise 10 sorters in training result.Wherein, the Selection of kernel function radial basis kernel function of using in SVM:

K(X，X _i)＝exp(-γ||X-X _i|| ²)

Wherein, parameter γ is taken as 0.001, X, X _iBe the input feature vector vector.

Step 5, as shown in Figure 3, train the SVM model that obtains according to described step 4, speech characteristic parameter X to voice to be identified predicts, such as the svmpredict that can use LIBSVM predicts, in forecasting process, the output of each support vector machine is by after logical decision, select the who gets the most votes as most probable age bracket classification, can obtain thus final age bracket recognition result.

Wherein, before the characteristic parameter X to voice to be identified predicted, described step 5 also comprised speech characteristic parameter normalization to be identified, i.e. the identical ratio convergent-divergent during according to parameter training to described speech characteristic parameter: be reduced in [1 ,+1] scope.

Above-described specific embodiment; purpose of the present invention, technical scheme and beneficial effect are further described; institute is understood that; the above is only specific embodiments of the invention; be not limited to the present invention; within the spirit and principles in the present invention all, any modification of making, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims

1. the speaker's age bracket recognition methods based on support vector machines, is characterized in that, the method comprises the following steps:

Step 2 is carried out pre-service to the voice signal in described sound bank;

2. method according to claim 1, is characterized in that, described voice signal is take phrase as the unit.

3. method according to claim 1, is characterized in that, in described step 2, described pre-service is further comprising the steps:

Step 21 is carried out sample quantization to voice signal;

Step 22 is carried out pre-emphasis to the voice signal after quantizing and is processed;

Step 23 adopts end-point detecting method based on energy and zero-crossing rate to remove unvoiced segments in each voice signal.

4. method according to claim 3, is characterized in that, described pre-emphasis processing list is shown:

H(z)＝1-0.9375z ^-1，

Wherein, z represents voice signal, the voice signal that H (z) expression obtains after processing through pre-emphasis.

5. method according to claim 3, is characterized in that, utilizes described end-point detecting method to detect unvoiced segments and comprise the following steps:

Step 231 is carried out short division frame with described voice signal and is processed, and obtains a plurality of speech frames;

Step 233 arranges a higher decision threshold E1 according to the average energy of all speech frames, and size and the described thresholding E1 of the short-time energy of each speech frame compared, and obtains the voice terminal of each speech frame of preliminary judgement;

6. method according to claim 5, is characterized in that, described frame length is got 20ms, and the voice signal sampling rate is 16KHz, i.e. 320 sampled points.

7. method according to claim 1, is characterized in that, described speech characteristic parameter is taken as Mel-cepstrum coefficient MFCC.

8. method according to claim 7, is characterized in that, the step that described speech characteristic parameter extracts comprises the following steps:

Step 33 is done the logarithm computing to the output of all wave filters;

Step 34, the result that described step 33 is obtained is carried out discrete cosine transform and is obtained MFCC.

9. method according to claim 1, is characterized in that, the step of described support vector machine training further comprises:

Step 41, with the speech characteristic parameter of each all ages and classes section of extracting as eigenvector;

Step 42 is for the speech characteristic parameter of each all ages and classes section adds class label;

Step 44 is trained supported vector machine set to the eigenvector after each all ages and classes section normalization.

10. method according to claim 1, is characterized in that, described step 5 also comprised speech characteristic parameter normalization to be identified before the characteristic parameter X to voice to be identified predicts, and it is reduced to the interior step of [1 ,+1] scope.