CN110136726A - A kind of estimation method, device, system and the storage medium of voice gender - Google Patents

A kind of estimation method, device, system and the storage medium of voice gender Download PDF

Info

Publication number
CN110136726A
CN110136726A CN201910539105.7A CN201910539105A CN110136726A CN 110136726 A CN110136726 A CN 110136726A CN 201910539105 A CN201910539105 A CN 201910539105A CN 110136726 A CN110136726 A CN 110136726A
Authority
CN
China
Prior art keywords
voice
identified
voice data
gender
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910539105.7A
Other languages
Chinese (zh)
Inventor
姚灿荣
尤俊生
高志鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Meiya Pico Information Co Ltd
Original Assignee
Xiamen Meiya Pico Information Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Meiya Pico Information Co Ltd filed Critical Xiamen Meiya Pico Information Co Ltd
Priority to CN201910539105.7A priority Critical patent/CN110136726A/en
Publication of CN110136726A publication Critical patent/CN110136726A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The present invention provides estimation method, device, system and the storage mediums of a kind of voice gender, which comprises obtains voice data to be identified;Feature extraction is carried out to the voice data to be identified, obtains the phonetic feature of the voice data to be identified;The phonetic feature is inputted into trained voice and estimates model, obtains the gender estimated result of the voice data to be identified.According to the method for the present invention, device, system and storage medium, after carrying out feature extraction to voice data, estimate that model carries out voice estimation by the voice gender of foundation, to realize that fast and accurately voice gender is estimated under the environment such as voice complexity and different phonetic, promotes user experience.

Description

A kind of estimation method, device, system and the storage medium of voice gender
Technical field
The present invention relates to voice processing technology fields, relate more specifically to the processing of the estimation of voice gender.
Background technique
With the development of information technology and the increase of social safety demand, the sides such as auto authentication, people information are portrayed To living things feature recognition, there are urgent demands for the application in face.Therefore, living things feature recognition becomes computer industry research hotspot One of.Current main living things feature recognition includes face characteristic identification, fingerprint recognition, Application on Voiceprint Recognition, gender identifies, the age is estimated Meter, race's identification, Expression Recognition, Gait Recognition, track identification etc..Main carrier of biological information includes face, iris, refers to Line, voice, gait etc..The biological characteristic of individual generally has uniqueness, by distinguishing one or more individual biological characteristic Information just can recognize that individual ID.And the individual biological information between same community often have very strong similitude and Correlation, such as age, sex, race etc..
However, with social mode diversification in many scenes, the biometric images such as portrait, iris letter can not be acquired Breath, the only other informations such as voice.The research of voice propagation, voice attributes and signature analysis receives more and more attention.Face To different scenes and environment bring noise, all ages and classes, different language, even different moods difference, the voice of speaker Identification complexity greatly increases.Currently, voice gender estimation method is broadly divided into the method based on time sequence, and it is based on time sequence The method key of column is to construct Recognition with Recurrent Neural Network model, as RNN, LSTM method are difficult accurately if background is complex Estimation.
Therefore, the estimation of voice gender exists in the prior art to be affected by background noise and different language environment, Cause voice gender accuracy of identification low, the unhappy problem of speed influences user experience.
Summary of the invention
The present invention is proposed in view of the above problem.The present invention provides a kind of estimation method of voice gender, device, System and computer storage medium after carrying out feature extraction to voice data, estimate that model carries out by the voice gender of foundation Voice estimation, to realize that fast and accurately voice gender is estimated under the environment such as voice complexity and different phonetic.
According to the first aspect of the invention, a kind of estimation method of voice gender is provided, comprising:
Obtain voice data to be identified;
Feature extraction is carried out to the voice data to be identified, obtains the phonetic feature of the voice data to be identified;
The phonetic feature is inputted into trained voice and estimates model, the gender for obtaining the voice data to be identified is estimated Count result.
Optionally, voice data to be identified is obtained further include: voice data to be identified described in alignment and/or preemphasis.
Optionally, feature extraction is carried out to the voice data to be identified, obtains the voice of the voice data to be identified Feature, comprising:
Framing is carried out to the voice data to be identified, and Hamming window is added to every frame voice data to be identified after framing;
Based on adding every frame voice data to be identified after Hamming window to carry out Fourier transformation or Fast Fourier Transform (FFT) Or Short Time Fourier Transform obtains vector characteristics;
The amplitude frequency spectrum of the vector characteristics is converted into power spectrum;
Mel filtering is carried out to the power spectrum, obtains voice of the Mel cepstrum feature as the voice data to be identified Feature.
Optionally, the method also includes:
Feature extraction is carried out to the voice training data with label, obtains training phonetic feature;
Neural network is trained to obtain the trained voice based on training phonetic feature and corresponding label and is estimated Count model.
Optionally, the phonetic feature is inputted into trained voice and estimates model, obtain the voice data to be identified Gender estimated result, comprising:
The phonetic feature is inputted into trained voice and estimates model, obtains the label probability of the phonetic feature;
Using label corresponding to maximum probability in the label probability as the gender estimated result.
Optionally, the trained voice estimation model includes convolutional neural networks.
Optionally, the gender estimated result includes male, women or without voice.
According to the second aspect of the invention, a kind of estimation device of voice gender is provided, comprising:
Data acquisition module, for obtaining voice data to be identified;
Characteristic extracting module obtains the voice to be identified for carrying out feature extraction to the voice data to be identified The phonetic feature of data;
Identification module estimates model for the phonetic feature to be inputted trained voice, obtains the language to be identified The gender estimated result of sound data.
According to the third aspect of the invention we, provide a kind of estimating system of voice gender, including memory, processor and It is stored in the computer program run on the memory and on the processor, which is characterized in that the processor executes The step of first aspect the method is realized when the computer program.
According to the fourth aspect of the invention, a kind of computer storage medium is provided, computer program is stored thereon with, The step of being characterized in that, first aspect the method realized when the computer program is computer-executed.
Estimation method, device, system and the computer storage medium of voice gender according to an embodiment of the present invention, to voice After data carry out feature extraction, estimate that model carries out voice estimation by the voice gender of foundation, to realize that voice is multiple Fast and accurately voice gender is estimated under the environment such as miscellaneous and different phonetic, promotes user experience.
Detailed description of the invention
The embodiment of the present invention is described in more detail in conjunction with the accompanying drawings, the above and other purposes of the present invention, Feature and advantage will be apparent.Attached drawing is used to provide to further understand the embodiment of the present invention, and constitutes explanation A part of book, is used to explain the present invention together with the embodiment of the present invention, is not construed as limiting the invention.In the accompanying drawings, Identical reference label typically represents same parts or step.
Fig. 1 is the schematic flow chart of the estimation method of voice gender according to an embodiment of the present invention;
Fig. 2 is the example of the estimation method of voice gender according to an embodiment of the present invention;
Fig. 3 is the schematic block diagram of the estimation device of voice gender according to an embodiment of the present invention;
Fig. 4 is the schematic block diagram of the estimating system of voice gender according to an embodiment of the present invention.
Specific embodiment
In order to enable the object, technical solutions and advantages of the present invention become apparent, root is described in detail below with reference to accompanying drawings According to example embodiments of the present invention.Obviously, described embodiment is only a part of the embodiments of the present invention, rather than this hair Bright whole embodiments, it should be appreciated that the present invention is not limited by example embodiment described herein.Based on described in the present invention The embodiment of the present invention, those skilled in the art's obtained all other embodiment in the case where not making the creative labor It should all fall under the scope of the present invention.
The estimation of voice gender is exactly the extraction vocal print feature according to speaker's voice, using computer depth learning technology into The analysis of row relevant treatment, judges speaker's gender.By the way that more correlations can be extracted to the accurate gender prediction of speaker's voice Attribute and people information can be applied to several scenes, multiple terminal environment, be suitble to need automated biological under man-machine interaction environment Signature analysis, user's people information such as portray at the application, are of great significance in the work such as security protection, human-computer interaction, business service.
The estimation method 100 of voice gender according to an embodiment of the present invention is shown referring to Fig. 1, Fig. 1, as shown in Figure 1, one The estimation method 100 of kind voice gender, comprising:
Step S110 obtains voice data to be identified;
Step S120 carries out feature extraction to the voice data to be identified, obtains the language of the voice data to be identified Sound feature;
The phonetic feature is inputted trained voice and estimates model, obtains the voice number to be identified by step S130 According to gender estimated result.
Wherein, phonetic feature is to meet or the auditory perception property of similar human ear, and by the voice signal in voice data The voice that computer is capable of handling is converted to, waveform can be become to one by the feature extraction to voice data and include sound The multi-C vector of information.Speech feature extraction is carried out to the voice data to be identified, it can be by voice signal and background signal Or environmental signal separates, so that background signal or environmental signal be avoided to impact the estimation of subsequent gender, improves voice gender The accuracy of estimation.And voice gender is obtained by the sufficient amount of training data training neural network with gender label and is estimated Model is counted, rapidly and accurately gender estimation can be further realized based on the phonetic feature of voice data to be identified.Due to voice Gender estimation model is based on multiple types and sufficient amount of training data is trained, and has high generalization, and but do not have There is the ability for characterizing specific identity, one is given in the probabilistic model of Sex distribution to phonetic feature and is pre-estimated, thus To corresponding gender estimated result.
Optionally, the estimation method of voice gender according to an embodiment of the present invention can be with memory and processor It is realized in unit or system.
The estimation method of voice gender according to an embodiment of the present invention can be deployed in personal terminal, can also be distributed ground portion Administration is at server end (or cloud) and personal terminal.For example, the estimation method when the voice gender is deployed in personal terminal When, after personal terminal obtains voice data to be identified, the estimation of voice gender is carried out at personal terminal, is obtained described to be identified The gender estimated result of voice data;When the estimation method of predicate sound gender is deployed in server end (or cloud) and a with being distributed When at people's terminal, after personal terminal obtains voice data to be identified, the server end (or cloud) is subjected to voice gender After estimation, the gender estimated result of the voice data to be identified is sent to personal terminal.
According to embodiments of the present invention, in step S110, the acquisition voice data to be identified, which can be, directly acquires voice Data are also possible to obtain voice data from other data sources;The voice data can be live signal, be also possible to non-reality When signal, herein with no restrictions.
In one example, obtaining voice data to be identified includes: directly by microphone pickup, acquiring described to be identified Voice data.
In one example, obtaining voice data to be identified includes: obtaining the voice number to be identified from other data sources According to.For example, acquiring the voice data to be identified by other voice acquisition devices, then obtained from the voice acquisition device The voice data to be identified;Or the voice data to be identified is obtained from cloud.
According to embodiments of the present invention, it in step S110, may further include: after obtaining the voice data to be identified, The voice data to be identified is pre-processed.
Optionally, carrying out pretreatment to the voice data to be identified includes: language to be identified described in alignment and/or preemphasis Sound data.
In one example, it includes following at least one for being aligned the voice data to be identified: by the voice to be identified Data are converted to Unified coding format, the voice data to be identified are converted to identical sample rate and/or port number, by institute It states voice data to be identified and is cut to same length, the voice data to be identified is normalized.
Voice data to be identified described in preemphasis can compensate voice signal in voice data and be constrained by articulatory system High frequency section, and the formant of high frequency can be highlighted.
In one example, voice data to be identified described in preemphasis includes:
Voice data s (n) is passed through into a high-pass filter: H (z)=1-a*z-1, wherein the range of pre emphasis factor a Are as follows: 0.9 < a < 1.0;It is then y (n)=x by preemphasis treated result if the speech sample value at n moment is x (n) (n)-a*x (n-1), n are natural number.
According to embodiments of the present invention, step S120 may further include:
Framing is carried out to the voice data to be identified, and Hamming window is added to every frame voice data to be identified after framing;
Based on adding every frame voice data to be identified after Hamming window to carry out Fourier transformation or Fast Fourier Transform (FFT) Or Short Time Fourier Transform obtains vector characteristics;
The amplitude frequency spectrum of the vector characteristics is converted into power spectrum;
Mel filtering is carried out to the power spectrum, obtains voice of the Mel cepstrum feature as the voice data to be identified Feature.
Wherein, after carrying out the processing of preemphasis digital filtering to voice data to be identified, adding window sub-frame processing can be carried out. Since the voice signal in voice data has short-term stationarity, it is considered that voice signal approximation is constant in 10--30ms, Thus voice signal can be divided into some short sections to be handled i.e. framing.For example, the framing of voice signal can adopt It is realized with the method that the window of moveable finite length is weighted, general frame number per second is about 33~100 frames;Or The overlapping part of the method that person uses overlapping segmentation, former frame and a later frame is known as frame shifting, and frame moves and the ratio of frame length is generally 0 ~0.5.
In one example, to the voice data to be identified carry out framing include: by the voice data to be identified by It is 20ms according to frame length, step-length is that 10ms carries out framing.
It is global more continuous in order to make, Gibbs' effect is avoided the occurrence of, the voice data after framing can be carried out adding the Chinese Bright window, every frame signal add a Hamming window, decay to every frame signal both ends close to 0, and add Hamming window after, originally without week The voice signal of phase property shows the Partial Feature of periodic function, convenient for carrying out Fourier expansion when subsequent characteristics extraction.
In one example, Hamming window is added to include: every after assuming framing every frame voice data to be identified after framing Frame voice data to be identified is S (n), and n=0 ... N-1, N is the size of every frame voice data to be identified, then plus (multiplied by) Chinese It is S'(n after bright window)=S (n) * W (n), wherein W (n, b)=(1-b)-b*cos (2pn/ (N-1)), 0≤n≤N-1, b are to be Number.It is appreciated that different b values can generate different Hamming windows, b=0.46 can be generally used.
Since the variation of voice signal in the time domain is generally difficult to find out the characteristic of signal, so being usually converted into frequency Energy distribution on domain is observed, and different Energy distributions can represent the characteristic of different phonetic.So being multiplied by Hamming window Afterwards, voice data must also be using Fourier Tranform (Fourier Transform, or FT) or fast fourier transform (Fast Fourier Transform, or FFT) or Short Time Fourier Transform (Short-time Fourier Transform, Or STFT) to obtain the Energy distribution on frequency spectrum.
In one example, the amplitude frequency spectrum of the vector characteristics is converted into power spectrum, comprising:
Amplitude frequency spectrum modulus square to the vector characteristics thoroughly deserves the power spectrum.
In one example, Mel filtering is carried out to the power spectrum, obtains Mel cepstrum feature as described to be identified The phonetic feature of voice data, comprising:
By the power spectrum multiplied by one group of triangular filter, the logarithmic energy of each filter output is obtained;
Discrete cosine transform is carried out to the logarithmic energy and obtains the Mel cepstrum feature of L rank as the voice to be identified The phonetic feature of data.
Wherein, triangular filter can smooth frequency spectrum, and the effect of harmonic carcellation, highlight being total to for original voice Shake peak, can also reduce operand, accelerate the speed of feature extraction, to promote the speed of entire voice gender estimation method.
Optionally, it includes: linear using linear prediction analysis, perception for carrying out feature extraction to the voice data to be identified Predictive coefficient, Tandem feature and Bottleneck feature, Fbank feature based on wave filter group, linear prediction residue error, Or mel-frequency cepstrum coefficient carries out feature extraction.
Optionally, the phonetic feature includes following one kind: mel cepstrum coefficients MFCC, perception linear predictor coefficient PLP, The regular spectral coefficient PNCC of depth characteristic Deep Feature, energy.
Optionally, the method 100 can also include:
Feature extraction is carried out to the voice training data with label, obtains training phonetic feature;
Neural network is trained to obtain the trained voice based on training phonetic feature and corresponding label and is estimated Count model.
Wherein it is possible to establish voice training database for storing training data, instructed with the training neural network The voice estimation model perfected.Voice training database may include enough data volumes such as 300,000 speech samples, and every Sample has corresponding gender label.Label are as follows: male, female, without voice (non-talking sound).Sample can pick up from video respectively, lead to Words, recording, environment distribution is wide, comprising film, news, speech, dialogue etc., is related to multilingual, the multiplicity of voice training data Change and plays a significant role to the generalization and robustness that improve model.
In one example, the trained voice estimation model may include: 7 one-dimensional convolutional layers, 7 BatchNormal layers, 1 pond layer, 6 relu layers and 3 full articulamentums.It specifically includes: (1) first layer conv layers of convolution kernel Size 3, number 1024, step-length are once to connect one BatchNorm layers behind 1, one relu layers, an one-dimensional MaxPool Layer, core size are 3, and step-length 2 exports 32*500;(2) second layer convolution kernel size 5, number 32, step-length 1, step-length 1 One BatchNorm layers are once connect below, and one relu layers, one MaxPool layers one-dimensional, and core size is 3, step-length 2, output 32*248;(3) third layer conv layers of convolution kernel size 3, number 64, step-length are once to connect one BatchNorm layers behind 1, One relu layers, one MaxPool layers one-dimensional, and core size is 3, and step-length 2 exports 64*122;(4) the 4th layers of convolution kernel size 3, number 128, step-length is once to connect one BatchNorm layers behind 1, and one relu layers, one MaxPool layers one-dimensional, core Size is 3, step-length 2, exports 128*60;(5) the conv layer convolution kernel size 3 of layer 5, layer 6 and layer 7, number are 128, step-length is once to connect one BatchNorm layers behind 1, and one relu layers, one is MaxPool layers one-dimensional, and core size is 3, Step-length is 2, and layer 7 exports 128*5;(6) the 8th layers of conv layers of convolution kernel size 3, number 256, step-length are primary behind 1 One BatchNorm layers are connect, one relu layers, one MaxPool layers one-dimensional, and core size is 5 output 256*1;(7) followed by 2 A full articulamentum, first input dimension 256 export dimension 64;Second full articulamentum inputs dimension 64, exports dimension 3; It (8) is finally SoftMax layers.
According to embodiments of the present invention, it in step S130, may further include:
The phonetic feature is inputted into trained voice and estimates model, obtains the label probability of the phonetic feature;
Using label corresponding to maximum probability in the label probability as the gender estimated result.
Optionally, the trained voice estimation model includes convolutional neural networks.
Optionally, the gender estimated result includes male, women or without voice.
According to embodiments of the present invention, the method 100 can further include:
Show the gender estimated result.
In one embodiment, referring to fig. 2, showing Fig. 2 shows the estimation method of the voice gender of the embodiment of the present invention Example.As shown in Fig. 2, the estimation method 200 of the voice gender includes:
Firstly, establishing voice training database;It include diversified sufficient amount of instruction in the voice training database Practice data, the training data has corresponding gender label, and the gender label can be male, women or without voice;Tool Body may include: collection, clean and according to training data described in the gender label for labelling;
Then, the training data in the voice training database is pre-processed, is can specifically include as follows at least It is a kind of: the training data being converted into Unified coding format, the training data is converted into identical sample rate (such as 48000) and/or port number, training data is cut to same length, training data is normalized;
And preemphasis and feature extraction are carried out to the training data in the voice training database, specifically it can wrap It includes:
Training data is obtained by a high-pass filter by preemphasis treated training data;
Framing is carried out to preemphasis treated training data, and Hamming window is added to every frame training data after framing;
Based on adding every frame training data after Hamming window to carry out Fourier transformation or Fast Fourier Transform (FFT) or in short-term Fourier transformation obtains the vector characteristics of training data;
The amplitude frequency spectrum of the vector characteristics of the training data is converted into power spectrum;
Mel filtering is carried out to the power spectrum of institute's training data, obtains Mel cepstrum feature (if size is 1005*13) institute State the phonetic feature of training data;
Then, it is trained using the phonetic feature of the training data and corresponding gender label training neural network Voice gender estimate model;
Then, after obtaining voice data to be identified, the voice data to be identified is pre-processed, specifically can wrap It includes following at least one: the voice data to be identified is converted into Unified coding format, the voice data to be identified is turned It is changed to identical sample rate (such as 48000) and/or port number, the voice data to be identified is cut to same length, by institute State voice data normalization to be identified;
Then, feature extraction is carried out to the voice data to be identified, obtains the Mel cepstrum of the voice data to be identified Feature can specifically include:
The voice data to be identified obtained by a high-pass filter to that treated is described wait know by preemphasis Other voice data;
Framing is carried out to the voice data to be identified, and Hamming window is added to every frame voice data to be identified after framing;
Based on adding every frame voice data to be identified after Hamming window to carry out Fourier transformation or Fast Fourier Transform (FFT) Or Short Time Fourier Transform obtains vector characteristics;
The amplitude frequency spectrum of the vector characteristics is converted into power spectrum;
To the power spectrum carry out Mel filtering, obtain Mel cepstrum feature (size be 1005*13) i.e. as described in wait know The phonetic feature of other voice data;
The phonetic feature of the voice data to be identified is inputted into trained voice and estimates model, it is special to obtain the voice The label probability of sign;
Using label corresponding to maximum probability in the label probability as the gender estimated result, the gender estimation As a result including male, women or without voice;
It follows that the estimation method of voice gender according to an embodiment of the present invention, carries out feature extraction to voice data Afterwards, estimate that model carries out voice estimation by the voice gender of foundation, to realize the rings such as voice complexity and different phonetic Fast and accurately voice gender is estimated under border, promotes user experience.
Fig. 3 shows the schematic block diagram of the estimation device 300 of voice gender according to an embodiment of the present invention.Such as Fig. 3 institute Show, the estimation device 300 of voice gender according to an embodiment of the present invention includes:
Data acquisition module 310, for obtaining voice data to be identified;
Characteristic extracting module 320 obtains the language to be identified for carrying out feature extraction to the voice data to be identified The phonetic feature of sound data;
Identification module 330 is estimated model for the phonetic feature to be inputted trained voice, is obtained described to be identified The gender estimated result of voice data.
According to embodiments of the present invention, data acquisition module 310, which obtains the voice data to be identified and can be, directly acquires Voice data is also possible to obtain voice data from other data sources;The voice data can be live signal, be also possible to Non-real time signals, herein with no restrictions.
In one example, data acquisition module 310 obtains voice data to be identified and includes: directly being picked up by microphone Sound acquires the voice data to be identified.
In one example, data acquisition module 310 obtains voice data to be identified and includes: obtaining institute from other data sources State voice data to be identified.For example, the voice data to be identified is acquired by other voice acquisition devices, then from institute's predicate Sound acquisition device obtains the voice data to be identified;Or the voice data to be identified is obtained from cloud.
According to embodiments of the present invention, device 300 may further include:
Preprocessing module 340 carries out the voice data to be identified pre- after obtaining the voice data to be identified Processing.
Optionally, it includes: alignment and/or pre-add that preprocessing module 340, which carries out pretreatment to the voice data to be identified, The weight voice data to be identified.
In one example, it includes following at least one that preprocessing module 340, which is aligned the voice data to be identified: by institute State voice data to be identified and be converted to Unified coding format, by the voice data to be identified be converted to identical sample rate and/ Or port number, the voice data to be identified is cut to same length, the voice data to be identified is normalized.
Voice data to be identified described in preemphasis can compensate voice signal in voice data and be constrained by articulatory system High frequency section, and the formant of high frequency can be highlighted.
In one example, voice data to be identified described in 340 preemphasis of preprocessing module includes: by voice data s (n) Pass through a high-pass filter: H (z)=1-a*z-1, wherein the range of pre emphasis factor a are as follows: 0.9 < a < 1.0;If the n moment Speech sample value be x (n), then be y (n)=x (n)-a*x (n-1) by preemphasis treated result, n is natural number.
According to embodiments of the present invention, characteristic extracting module 320 may further include:
Framing module 321, for carrying out framing to the voice data to be identified, and to every frame language to be identified after framing Sound data add Hamming window;
Fourier transformation module 322, for based on adding every frame voice data to be identified after Hamming window to carry out in Fu Leaf transformation or Fast Fourier Transform (FFT) or Short Time Fourier Transform obtain vector characteristics;
Power module 323, for the amplitude frequency spectrum of the vector characteristics to be converted to power spectrum;
Phonetic feature module 324 obtains described in the conduct of Mel cepstrum feature for carrying out Mel filtering to the power spectrum The phonetic feature of voice data to be identified.
Wherein, after carrying out the processing of preemphasis digital filtering to voice data to be identified, adding window sub-frame processing can be carried out. Since the voice signal in voice data has short-term stationarity, it is considered that voice signal approximation is constant in 10--30ms, Thus voice signal can be divided into some short sections to be handled i.e. framing.For example, the framing of voice signal can adopt It is realized with the method that the window of moveable finite length is weighted, general frame number per second is about 33~100 frames;Or The overlapping part of the method that person uses overlapping segmentation, former frame and a later frame is known as frame shifting, and frame moves and the ratio of frame length is generally 0 ~0.5.
In one example, framing module 321 to the voice data to be identified carry out framing include: will be described to be identified Voice data is 20ms according to frame length, and step-length is that 10ms carries out framing.
It is global more continuous in order to make, Gibbs' effect is avoided the occurrence of, the voice data after framing can be carried out adding the Chinese Bright window, every frame signal add a Hamming window, decay to every frame signal both ends close to 0, and add Hamming window after, originally without week The voice signal of phase property shows the Partial Feature of periodic function, convenient for carrying out Fourier expansion when subsequent characteristics extraction.
In one example, framing module 321 adds Hamming window to include: to assume every frame voice data to be identified after framing Every frame voice data to be identified after framing is S (n), and n=0 ... N-1, N are the size of every frame voice data to be identified, then Add after (multiplied by) Hamming window for S'(n)=S (n) * W (n), wherein W (n, b)=(1-b)-b*cos (2pn/ (N-1)), 0≤n≤ N-1, b are coefficient.It is appreciated that different b values can generate different Hamming windows, b=0.46 can be generally used.
Since the variation of voice signal in the time domain is generally difficult to find out the characteristic of signal, so being usually converted into frequency Energy distribution on domain is observed, and different Energy distributions can represent the characteristic of different phonetic.So being multiplied by Hamming window Afterwards, voice data must also be using Fourier Tranform (Fourier Transform, or FT) or fast fourier transform (Fast Fo urier Transform, or FFT) or Short Time Fourier Transform (Short-time Fouri er Transform, or STFT) to obtain the Energy distribution on frequency spectrum.
In one example, the amplitude frequency spectrum of the vector characteristics is converted to power spectrum by power module 323, comprising:
Amplitude frequency spectrum modulus square to the vector characteristics thoroughly deserves the power spectrum.
In one example, phonetic feature module 324 carries out Mel filtering to the power spectrum, obtains Mel cepstrum feature Phonetic feature as the voice data to be identified, comprising:
By the power spectrum multiplied by one group of triangular filter, the logarithmic energy of each filter output is obtained;
Discrete cosine transform is carried out to the logarithmic energy and obtains the Mel cepstrum feature of L rank as the voice to be identified The phonetic feature of data.
Wherein, triangular filter can smooth frequency spectrum, and the effect of harmonic carcellation, highlight being total to for original voice Shake peak, can also reduce operand, accelerate the speed of feature extraction, to promote the speed of entire voice gender estimation method.
Optionally, it includes: linear using linear prediction analysis, perception for carrying out feature extraction to the voice data to be identified Predictive coefficient, Tandem feature and Bottleneck feature, Fbank feature based on wave filter group, linear prediction residue error, Or mel-frequency cepstrum coefficient carries out feature extraction.
Optionally, the phonetic feature includes following one kind: mel cepstrum coefficients MFCC, perception linear predictor coefficient PLP, The regular spectral coefficient PNCC of depth characteristic Deep Feature, energy.
Optionally, described device 300 can also include:
Model module 350 obtains training voice special for carrying out feature extraction to the voice training data with label Sign;Neural network is trained to obtain the trained voice estimation mould based on training phonetic feature and corresponding label Type.
Wherein, model module 350 can establish voice training database for storing training data, with the training nerve Network obtains trained voice estimation model.Voice training database may include enough data volumes such as 300,000 voices Sample, every sample have corresponding gender label.Label are as follows: male, female, without voice (non-talking sound).Sample can be adopted respectively From video, call, recording, environment distribution is wide, comprising film, news, speech, dialogue etc., is related to multilingual, voice training number According to diversification to improve model generalization and robustness play a significant role.
In one example, the trained voice estimation model may include: 7 one-dimensional convolutional layers, 7 BatchNormal layers, 1 pond layer, 6 relu layers and 3 full articulamentums.It specifically includes: (1) first layer conv layers of convolution kernel Size 3, number 1024, step-length are once to connect one BatchNorm layers behind 1, one relu layers, an one-dimensional MaxPool Layer, core size are 3, and step-length 2 exports 32*500;(2) second layer convolution kernel size 5, number 32, step-length 1, step-length 1 One BatchNorm layers are once connect below, and one relu layers, one MaxPool layers one-dimensional, and core size is 3, step-length 2, output 32*248;(3) third layer conv layers of convolution kernel size 3, number 64, step-length are once to connect one BatchNorm layers behind 1, One relu layers, one MaxPool layers one-dimensional, and core size is 3, and step-length 2 exports 64*122;(4) the 4th layers of convolution kernel size 3, number 128, step-length is once to connect one BatchNorm layers behind 1, and one relu layers, one MaxPool layers one-dimensional, core Size is 3, step-length 2, exports 128*60;(5) the conv layer convolution kernel size 3 of layer 5, layer 6 and layer 7, number are 128, step-length is once to connect one BatchNorm layers behind 1, and one relu layers, one is MaxPool layers one-dimensional, and core size is 3, Step-length is 2, and layer 7 exports 128*5;(6) the 8th layers of conv layers of convolution kernel size 3, number 256, step-length are primary behind 1 One BatchNorm layers are connect, one relu layers, one MaxPool layers one-dimensional, and core size is 5 output 256*1;(7) followed by 2 A full articulamentum, first input dimension 256 export dimension 64;Second full articulamentum inputs dimension 64, exports dimension 3; It (8) is finally SoftMax layers.
According to embodiments of the present invention, identification module 330 may further include:
Probability Estimation module 331 estimates model for the phonetic feature to be inputted trained voice, obtains institute's predicate The label probability of sound feature;
Object module 332, for estimating label corresponding to maximum probability in the label probability as the gender As a result.
Optionally, the trained voice estimation model includes convolutional neural networks.
Optionally, the gender estimated result includes male, women or without voice.
According to embodiments of the present invention, device 300 can further include:
Display module 360, for showing the gender estimated result.
Fig. 4 shows the schematic block diagram of the estimating system 400 of voice gender according to an embodiment of the present invention.Voice gender Estimating system 400 include storage device 410 and processor 420.
Storage device 410 stores for realizing the corresponding step in the estimation method of voice gender according to an embodiment of the present invention Rapid program code.
The processor 420 is for running the program code stored in the storage device 410, to execute according to the present invention The corresponding steps of the estimation method of the voice gender of embodiment, and for realizing voice gender according to an embodiment of the present invention Data acquisition module 310 in estimation device, characteristic extracting module 320 and identification module 330.
In addition, according to embodiments of the present invention, additionally providing a kind of storage medium, storing program on said storage Instruction, when described program instruction is run by computer or processor for executing the estimation of the voice gender of the embodiment of the present invention The corresponding steps of method, and for realizing the corresponding module in the estimation device of voice gender according to an embodiment of the present invention. The storage medium for example may include the storage card of smart phone, the storage unit of tablet computer, personal computer hard disk, Read-only memory (ROM), Erasable Programmable Read Only Memory EPROM (EPROM), portable compact disc read-only memory (CD-ROM), Any combination of USB storage or above-mentioned storage medium.The computer readable storage medium can be one or more meters Any combination of calculation machine readable storage medium storing program for executing, such as a computer readable storage medium include to refer to for being randomly generated movement The computer-readable program code of sequence is enabled, another computer readable storage medium includes for carrying out estimating for voice gender The computer-readable program code of meter.
In one embodiment, the computer program instructions may be implemented real according to the present invention when being run by computer Each functional module of the estimation device of the voice gender of example is applied, and/or language according to an embodiment of the present invention can be executed The estimation method of sound gender.
Each module in the estimating system of voice gender according to an embodiment of the present invention can be by implementing according to the present invention The processor computer program instructions that store in memory of operation of the electronic equipment of the estimation of the voice gender of example realize, Or the computer that can be stored in the computer readable storage medium of computer program product according to an embodiment of the present invention Realization when instruction is run by computer.
Estimation method, device, system and the storage medium of voice gender according to an embodiment of the present invention, to voice data After carrying out feature extraction, estimate that model carries out voice estimation by the voice gender of foundation, thus realize voice complexity and Fast and accurately voice gender is estimated under the environment such as different phonetic, promotes user experience.
Those of ordinary skill in the art may be aware that list described in conjunction with the examples disclosed in the embodiments of the present disclosure Member and algorithm steps can be realized with the combination of electronic hardware or computer software and electronic hardware.These functions are actually It is implemented in hardware or software, the specific application and design constraint depending on technical solution.Professional technician Each specific application can be used different methods to achieve the described function, but this realization is it is not considered that exceed The scope of the present invention.
The above description is merely a specific embodiment or to the explanation of specific embodiment, protection of the invention Range is not limited thereto, and anyone skilled in the art in the technical scope disclosed by the present invention, can be easily Expect change or replacement, should be covered by the protection scope of the present invention.Protection scope of the present invention should be with claim Subject to protection scope.

Claims (10)

1. a kind of estimation method of voice gender, which is characterized in that the described method includes:
Obtain voice data to be identified;
Feature extraction is carried out to the voice data to be identified, obtains the phonetic feature of the voice data to be identified;
The phonetic feature is inputted into trained voice and estimates model, obtains the gender estimation knot of the voice data to be identified Fruit.
2. the method as described in claim 1, which is characterized in that obtain voice data to be identified further include: alignment and/or pre-add The weight voice data to be identified.
3. the method as described in claim 1, which is characterized in that carry out feature extraction to the voice data to be identified, obtain The phonetic feature of the voice data to be identified, comprising:
Framing is carried out to the voice data to be identified, and Hamming window is added to every frame voice data to be identified after framing;
Based on adding every frame voice data to be identified after Hamming window to carry out Fourier transformation or Fast Fourier Transform (FFT) or short When Fourier transformation obtain vector characteristics;
The amplitude frequency spectrum of the vector characteristics is converted into power spectrum;
Mel filtering is carried out to the power spectrum, the voice for obtaining Mel cepstrum feature as the voice data to be identified is special Sign.
4. the method as described in claim 1, which is characterized in that the method also includes:
Feature extraction is carried out to the voice training data with label, obtains training phonetic feature;
Neural network is trained to obtain the trained voice estimation mould based on training phonetic feature and corresponding label Type.
5. the method as described in claim 1, which is characterized in that the phonetic feature is inputted trained voice and estimates mould Type obtains the gender estimated result of the voice data to be identified, comprising:
The phonetic feature is inputted into trained voice and estimates model, obtains the label probability of the phonetic feature;
Using label corresponding to maximum probability in the label probability as the gender estimated result.
6. method as claimed in claim 4, which is characterized in that the trained voice estimation model includes convolutional Neural net Network.
7. the method as described in claim 1, which is characterized in that the gender estimated result includes male, women or without voice.
8. a kind of estimation device of voice gender, which is characterized in that described device includes:
Data acquisition module, for obtaining voice data to be identified;
Characteristic extracting module obtains the voice data to be identified for carrying out feature extraction to the voice data to be identified Phonetic feature;
Identification module estimates model for the phonetic feature to be inputted trained voice, obtains the voice number to be identified According to gender estimated result.
9. a kind of estimating system of voice gender, including memory, processor and it is stored on the memory and at the place The computer program run on reason device, which is characterized in that the processor realizes claim 1 when executing the computer program The step of to any one of 7 the method.
10. a kind of computer storage medium, is stored thereon with computer program, which is characterized in that the computer program is counted The step of calculation machine realizes any one of claims 1 to 7 the method when executing.
CN201910539105.7A 2019-06-20 2019-06-20 A kind of estimation method, device, system and the storage medium of voice gender Pending CN110136726A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910539105.7A CN110136726A (en) 2019-06-20 2019-06-20 A kind of estimation method, device, system and the storage medium of voice gender

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910539105.7A CN110136726A (en) 2019-06-20 2019-06-20 A kind of estimation method, device, system and the storage medium of voice gender

Publications (1)

Publication Number Publication Date
CN110136726A true CN110136726A (en) 2019-08-16

Family

ID=67578869

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910539105.7A Pending CN110136726A (en) 2019-06-20 2019-06-20 A kind of estimation method, device, system and the storage medium of voice gender

Country Status (1)

Country Link
CN (1) CN110136726A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110931023A (en) * 2019-11-29 2020-03-27 厦门快商通科技股份有限公司 Gender identification method, system, mobile terminal and storage medium
CN111312286A (en) * 2020-02-12 2020-06-19 深圳壹账通智能科技有限公司 Age identification method, age identification device, age identification equipment and computer readable storage medium
CN112581942A (en) * 2020-12-29 2021-03-30 云从科技集团股份有限公司 Method, system, device and medium for recognizing target object based on voice
WO2021175031A1 (en) * 2020-03-03 2021-09-10 深圳壹账通智能科技有限公司 Information prompting method and apparatus, electronic device, and medium
CN114049881A (en) * 2021-11-23 2022-02-15 深圳依时货拉拉科技有限公司 Voice gender recognition method, device, storage medium and computer equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017113680A1 (en) * 2015-12-30 2017-07-06 百度在线网络技术(北京)有限公司 Method and device for voiceprint authentication processing
CN108962223A (en) * 2018-06-25 2018-12-07 厦门快商通信息技术有限公司 A kind of voice gender identification method, equipment and medium based on deep learning
CN109545227A (en) * 2018-04-28 2019-03-29 华中师范大学 Speaker's gender automatic identifying method and system based on depth autoencoder network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017113680A1 (en) * 2015-12-30 2017-07-06 百度在线网络技术(北京)有限公司 Method and device for voiceprint authentication processing
CN109545227A (en) * 2018-04-28 2019-03-29 华中师范大学 Speaker's gender automatic identifying method and system based on depth autoencoder network
CN108962223A (en) * 2018-06-25 2018-12-07 厦门快商通信息技术有限公司 A kind of voice gender identification method, equipment and medium based on deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
黄珊: "基于深度学习的说话人性别特征识别研究", 《中国优秀博硕士论文全文数据库(硕士) 信息科技辑》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110931023A (en) * 2019-11-29 2020-03-27 厦门快商通科技股份有限公司 Gender identification method, system, mobile terminal and storage medium
CN110931023B (en) * 2019-11-29 2022-08-19 厦门快商通科技股份有限公司 Gender identification method, system, mobile terminal and storage medium
CN111312286A (en) * 2020-02-12 2020-06-19 深圳壹账通智能科技有限公司 Age identification method, age identification device, age identification equipment and computer readable storage medium
WO2021175031A1 (en) * 2020-03-03 2021-09-10 深圳壹账通智能科技有限公司 Information prompting method and apparatus, electronic device, and medium
CN112581942A (en) * 2020-12-29 2021-03-30 云从科技集团股份有限公司 Method, system, device and medium for recognizing target object based on voice
CN114049881A (en) * 2021-11-23 2022-02-15 深圳依时货拉拉科技有限公司 Voice gender recognition method, device, storage medium and computer equipment

Similar Documents

Publication Publication Date Title
WO2021208287A1 (en) Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium
CN110457432B (en) Interview scoring method, interview scoring device, interview scoring equipment and interview scoring storage medium
CN107731233B (en) Voiceprint recognition method based on RNN
CN110136726A (en) A kind of estimation method, device, system and the storage medium of voice gender
CN110909613A (en) Video character recognition method and device, storage medium and electronic equipment
CN106504768B (en) Phone testing audio frequency classification method and device based on artificial intelligence
TW201935464A (en) Method and device for voiceprint recognition based on memorability bottleneck features
CN110211565A (en) Accent recognition method, apparatus and computer readable storage medium
CN109658921B (en) Voice signal processing method, equipment and computer readable storage medium
CN110415701A (en) The recognition methods of lip reading and its device
CN112949708A (en) Emotion recognition method and device, computer equipment and storage medium
CN110570873A (en) voiceprint wake-up method and device, computer equipment and storage medium
CN110473552A (en) Speech recognition authentication method and system
CN114913859B (en) Voiceprint recognition method, voiceprint recognition device, electronic equipment and storage medium
CN110782902A (en) Audio data determination method, apparatus, device and medium
CN112992155B (en) Far-field voice speaker recognition method and device based on residual error neural network
CN114492579A (en) Emotion recognition method, camera device, emotion recognition device and storage device
CN109817223A (en) Phoneme marking method and device based on audio fingerprints
KR102220964B1 (en) Method and device for audio recognition
CN117935789A (en) Speech recognition method, system, equipment and storage medium
CN113823271B (en) Training method and device for voice classification model, computer equipment and storage medium
CN114333844A (en) Voiceprint recognition method, voiceprint recognition device, voiceprint recognition medium and voiceprint recognition equipment
CN114141271A (en) Psychological state detection method and system
Nguyen et al. Vietnamese speaker authentication using deep models
Alkhatib et al. ASR Features Extraction Using MFCC And LPC: A Comparative Study

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190816