CN110428842A

CN110428842A - Speech model training method, device, equipment and computer readable storage medium

Info

Publication number: CN110428842A
Application number: CN201910744145.5A
Authority: CN
Inventors: 陈昊亮; 罗伟航
Original assignee: Guangzhou National Acoustic Intelligent Technology Co Ltd
Current assignee: Guangzhou National Acoustic Intelligent Technology Co Ltd
Priority date: 2019-08-13
Filing date: 2019-08-13
Publication date: 2019-11-08

Abstract

The invention discloses a kind of speech model training method, device, equipment and computer readable storage mediums, which comprises obtains the voice data of target speaker, and the voice data is divided into multiple voice data sections；Vocal print feature is extracted respectively from the multiple voice data section, obtains multiple vocal print feature vectors；The multiple vocal print feature vector is ranked up according to predetermined order mode, and chooses target vocal print feature vector based on ranking results；Training pattern is treated based on the target vocal print feature vector to be trained, and obtains the speech recognition modeling of the target speaker.The present invention realizes when the voice data of target speaker is not pure, the vocal print feature vector for capableing of accurate Characterization target speaker's vocal print feature can be extracted from unpurified voice data, and the speech recognition modeling of target speaker is obtained based on the training of accurate vocal print feature vector, to improve the accuracy of target speaker's speech recognition modeling.

Description

Speech model training method, device, equipment and computer readable storage medium

Technical field

The present invention relates to identity identification technical field more particularly to a kind of speech model training method, device, equipment and meters Calculation machine readable storage medium storing program for executing.

Background technique

Speaker Identification is divided into two classes, i.e. speaker's identification and speaker verification.If the former is to judge certain section of voice It is " multiselect one " problem described in which of dry people；And the latter is to confirm whether certain section of voice is specified someone Described, it is " one-to-one differentiation " problem.Speaker Recognition Technology is in military affairs, national security, criminal investigation field and bank, security etc. Financial field has a wide range of applications.

The voice data that Speaker Identification needs to acquire a large number of users is trained, and obtains the speech recognition mould of each user Type is identified by way of pattern match when needing to carry out Speaker Identification.However, carrying out model training at present During, due to the limitation of acquisition cost, collected user voice data is simultaneously lack of standardization, and user speech is often impure , for example by the collected user voice data of calling record, often adulterate many noises, it is also possible to other people can be mixed into Voice therefore directly adopt the training that such voice data carries out speech model, will lead to the speech model that training obtains Recognition efficiency is low, accuracy is not high.

Summary of the invention

The main purpose of the present invention is to provide a kind of speech model training method, device, equipment and computer-readable deposit Storage media, it is intended to solve at present due to causing speech model to identify using unpurified user voice data training speech model The low technical problem of rate.

To achieve the above object, the present invention provides a kind of speech model training method, the speech model training method packet It includes:

The voice data of target speaker is obtained, and the voice data is divided into multiple voice data sections；

Vocal print feature is extracted respectively from the multiple voice data section, obtains multiple vocal print feature vectors；

The multiple vocal print feature vector is ranked up according to predetermined order mode, and chooses target based on ranking results Vocal print feature vector；

Training pattern is treated based on the target vocal print feature vector to be trained, and obtains the voice of the target speaker Identification model.

Optionally, the voice data for obtaining target speaker, and the voice data is divided into multiple voice numbers Include: according to the step of section

The voice data for obtaining target speaker, pre-processes the voice data；

The pretreated voice data is segmented according to prefixed time interval, obtains multiple voice data sections.

Optionally, described that the multiple vocal print feature vector is ranked up according to predetermined order mode, and based on sequence As a result the step of selection target vocal print feature vector includes:

Seek the average vector of the multiple vocal print feature vector；

Calculate the distance between the multiple vocal print feature vector and the average vector, and by the multiple vocal print feature Vector sorts according to the sequence of the average vector distance from big to small；

The vocal print feature vector for falling into preset range is chosen as target vocal print feature vector based on ranking results.

Optionally, it is described to training pattern be deep neural network model, it is described be based on the target vocal print feature vector The step for the treatment of training pattern to be trained, obtaining the speech recognition modeling of the target speaker include:

By the target vocal print feature input initialization treated the deep neural network model, it is iterated instruction Practice；

When detect the deep neural network model convergence after, using the deep neural network model after convergence as The speech recognition modeling of the target speaker.

Optionally, described to extract vocal print feature respectively from the multiple voice data section, obtain multiple vocal print features to The step of amount includes:

Based on preparatory trained universal background model self-adaptive processing is carried out to the multiple voice data section respectively, obtained To multiple vocal print feature vectors.

Optionally, described the multiple voice data section to be carried out respectively certainly based on preparatory trained universal background model Before the step of adaptation is handled, and obtains multiple vocal print feature vectors, further includes:

Trained voice data gathered in advance is pre-processed；

Training phonetic feature is extracted to the pretreated trained voice data；

Universal background model training is carried out by the trained phonetic feature, obtains universal background model；

Optionally, described to include: to the pretreated step of trained voice data progress gathered in advance

Preemphasis, framing, adding window and endpoint detection processing are successively carried out to trained voice data gathered in advance.

In addition, to achieve the above object, the present invention also provides a kind of speech model training device, the speech model training Device includes:

Segmentation module is divided into multiple voices for obtaining the voice data of target speaker, and by the voice data Data segment；

Extraction module obtains multiple vocal print features for extracting vocal print feature respectively from the multiple voice data section Vector；

Order models, for being ranked up according to predetermined order mode to the multiple vocal print feature vector, and based on row Sequence result chooses target vocal print feature vector；

Training module is trained for treating training pattern based on the target vocal print feature vector, obtains the mesh Mark the speech recognition modeling of speaker.

In addition, to achieve the above object, the present invention also provides a kind of speech model training equipment, the speech model training Equipment includes the speech model training that memory, processor and being stored in can be run on the memory and on the processor Program, the speech model training program realize the step of speech model training method as described above when being executed by the processor Suddenly.

In addition, to achieve the above object, it is described computer-readable the present invention also provides a kind of computer readable storage medium Speech model training program is stored on storage medium, the speech model training program realizes institute as above when being executed by processor The step of speech model training method stated.

In the present invention, by obtaining the voice data of target speaker, and the voice data of target speaker is divided into Multiple voice data sections；Vocal print feature is extracted respectively from multiple voice data sections, obtains multiple vocal print feature vectors；According to pre- If sortord is ranked up multiple vocal print feature vectors, and chooses target vocal print feature vector based on ranking results；It is based on Target vocal print feature vector is treated training pattern and is trained, and the speech recognition modeling of target speaker is obtained, realize even if When the voice data of target speaker is not pure, can be extracted from unpurified voice data can accurate Characterization target from the perspective of The vocal print feature vector of people's vocal print feature is talked about, and is known based on the voice that the training of accurate vocal print feature vector obtains target speaker Other model, to improve the accuracy of target speaker's speech recognition modeling.

Detailed description of the invention

Fig. 1 is the structural schematic diagram for the hardware running environment that the embodiment of the present invention is related to；

Fig. 2 is the flow diagram of speech model training method first embodiment of the present invention；

Fig. 3 is the functional schematic module map of speech model training device preferred embodiment of the present invention.

The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.

Specific embodiment

It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.

The present invention provides a kind of speech model training equipment, and referring to Fig.1, Fig. 1 is that the embodiment of the present invention is related to The structural schematic diagram of hardware running environment.

It should be noted that Fig. 1 can be the structural schematic diagram of the hardware running environment of speech model training equipment.This hair Bright embodiment speech model training equipment can be PC, be also possible to smart phone, intelligent TV set, tablet computer, portable meter The terminal device having a display function such as calculation machine.

As shown in Figure 1, speech model training equipment may include: processor 1001, such as CPU, network interface 1004, User interface 1003, memory 1005, communication bus 1002.Wherein, communication bus 1002 is for realizing between these components Connection communication.User interface 1003 may include display screen (Display), input unit such as keyboard (Keyboard), optional User interface 1003 can also include standard wireline interface and wireless interface.Network interface 1004 optionally may include standard Wireline interface, wireless interface (such as WI-FI interface).Memory 1005 can be high speed RAM memory, be also possible to stable Memory (non-volatile memory), such as magnetic disk storage.Memory 1005 optionally can also be independently of aforementioned The storage device of processor 1001.

Optionally, speech model training equipment can also include camera, RF (Radio Frequency, radio frequency) circuit, Sensor, voicefrequency circuit, WiFi module etc..It will be understood by those skilled in the art that the training of speech model shown in Fig. 1 is set Standby structure does not constitute the restriction to speech model training equipment, may include than illustrating more or fewer components or group Close certain components or different component layouts.

As shown in Figure 1, as may include that operating system, network are logical in a kind of memory 1005 of computer storage medium Believe module, Subscriber Interface Module SIM and speech model training program.

In speech model training equipment shown in Fig. 1, network interface 1004 is mainly used for connecting background server, and rear Platform server carries out data communication；User interface 1003 is mainly used for connecting client (user terminal), carries out data with client Communication；And processor 1001 can be used for calling the speech model training program stored in memory 1005, and execute following behaviour Make:

Further, the voice data for obtaining target speaker, and the voice data is divided into multiple voices The step of data segment includes:

The voice data for obtaining target speaker, pre-processes the voice data；

Further, described that the multiple vocal print feature vector is ranked up according to predetermined order mode, and based on row Sequence result choose target vocal print feature vector the step of include:

Seek the average vector of the multiple vocal print feature vector；

Further, it is described to training pattern be deep neural network model, it is described based on the target vocal print feature to Amount treats training pattern and includes: the step of being trained, obtain the speech recognition modeling of the target speaker

Further, described to extract vocal print feature respectively from the multiple voice data section, obtain multiple vocal print features The step of vector includes:

Further, described that the multiple voice data section is carried out respectively based on preparatory trained universal background model Self-adaptive processing, before the step of obtaining multiple vocal print feature vectors, processor 1001 can be used for calling in memory 1005 The speech model training program of storage, and execute following operation:

Trained voice data gathered in advance is pre-processed；

Training phonetic feature is extracted to the pretreated trained voice data；

Further, described to include: to the pretreated step of trained voice data progress gathered in advance

Based on above-mentioned hardware configuration, each embodiment of speech model training method of the present invention is proposed.

Referring to Fig. 2, speech model training method first embodiment of the present invention provides a kind of speech model training method, needs Illustrate, it, in some cases, can be to be different from sequence herein although logical order is shown in flow charts Execute shown or described step.The executing subject of each embodiment of speech model training method of the present invention can be PC, The terminal devices such as smart phone, intelligent TV set, tablet computer and portable computer, for ease of description, in following embodiment Middle omission executing subject is illustrated.The speech model training method includes:

Step S10, obtains the voice data of target speaker, and the voice data is divided into multiple voice data Section；

Obtain the voice data of collected target speaker in advance, wherein for needing to train speech recognition modeling People, as target speaker.In advance there are many kinds of the modes of acquisition target speaker voice data, it can such as be recorded by phone The mode of sound acquires the voice data of target speaker.The voice data of target speaker is divided into multiple voice data sections. Specifically, can it is prespecified by voice data divide number of segment, such as it is prespecified by one minute voice data, be divided into 20 Section.

Further, step S10 includes:

Step S101 obtains the voice data of target speaker, pre-processes to the voice data；

After the voice data for obtaining target speaker, voice data is pre-processed.Specific preprocessing process can be with Including preemphasis processing.Wherein, preemphasis processing is a kind of signal compensated in transmitting terminal to input signal high fdrequency component Processing mode.With the increase of signal rate, signal is damaged very greatly in transmission process, in order to be compared in reception terminal Good signal waveform, it is necessary to which impaired signal is compensated.The thought of pre-emphasis technique is exactly to increase at the beginning of transmission line The radio-frequency component of strong signal, to compensate excessive decaying of the high fdrequency component in transmission process.Preemphasis does not have an impact to noise, Therefore output signal-to-noise ratio can be effectively improved.Vocal cords and lip in speaker's voiced process can be eliminated by being handled using the preemphasis It is interfered Deng caused by, can be with the effective compensation training pent-up high frequency section of voice data, and voice data height can be highlighted The formant of frequency reinforces the signal amplitude of voice data, helps to extract phonetic feature.

Step S102 is segmented the pretreated voice data according to prefixed time interval, obtains multiple languages Sound data segment.

Pretreated voice data is segmented according to prefixed time interval, obtains multiple voice data sections, wherein Prefixed time interval can be configured as needed, such as be set as 10ms, then be divided into continuous voice data every 10ms One section, obtain multiple voice data sections.

Step S20 extracts vocal print feature from the multiple voice data section respectively, obtains multiple vocal print feature vectors；

Vocal print feature is extracted respectively from multiple voice data sections, obtains multiple vocal print feature vectors, i.e. a voice number A vocal print feature vector is extracted according to section.Vocal print feature vector can be the vector of the vocal print spectrum signature of identity user, In In the present embodiment, there are many kinds of the modes of extraction vocal print feature vector, is such as extracted in voice data by gauss hybrid models Vocal print feature vector.

Step S30 is ranked up the multiple vocal print feature vector according to predetermined order mode, and is based on ranking results Choose target vocal print feature vector；

Multiple vocal print feature vectors are ranked up according to predetermined order mode, and choose target vocal print based on ranking results Feature vector.Specifically, clustering processing can be carried out to multiple vocal print feature vectors by clustering algorithm, similar vocal print is special Sign vector is classified as one kind, and classification number can be prespecified, is divided into 5 classes Ru prespecified, then passes through clustering algorithm for multiple sound Line feature vector is classified.It is understood that it is every one kind in vocal print feature vector between most like, the sound in inhomogeneity Line feature vector similitude is low.It can be ranked up, and be chosen sound according to the population size of the vocal print feature vector of every one kind A kind of or a few class more than line feature vector, using the vocal print feature vector under the category as target vocal print feature vector, remaining Vocal print feature vector is then given up.It should be noted that due to may include more noise in the voice data of target speaker, or It is the sound comprising other people, voice data is divided into segment, and vocal print feature vector is extracted based on segment respectively, by phase As vocal print feature vector be divided into same class, can make characterize target speaker vocal print feature vocal print feature vector quilt Be divided into same class, due in voice data based on the voice of target speaker, by vocal print feature quantity it is most one The vocal print feature vector of class or a few classes as target speaker, and other are low with target speaker's vocal print feature vector similitude Noise data excluded, to purify the vocal print feature vector of target speaker.

Further, in the present embodiment, another feasible target vocal print feature vector acquisition modes, step are proposed S30 includes:

Step S301 seeks the average vector of the multiple vocal print feature vector；

After extraction obtains multiple vocal print feature vectors, the average vector of multiple vocal print feature vectors can also be sought, is put down The mode of seeking of equal vector seeks that the average mode of vector is identical with existing, seeks to each element of each vocal print feature vector Average value is to get the average vector for arriving multiple vocal print feature vectors.

Step is 302, calculates the distance between the multiple vocal print feature vector and the average vector, and will be described more A vocal print feature vector sorts according to the sequence of the average vector distance from big to small；

The distance between multiple vocal print feature vectors and average vector are calculated, specifically, calculates the side of distance between vector There are many formulas, can such as calculate the COS distance between vocal print feature vector and average vector, is also possible to calculate vocal print feature Euclidean distance between vector and average vector.After the distance between multiple vocal print feature vectors and average vector is calculated, Multiple vocal print feature vectors are sorted according to the sequence of average vector distance from small to large.

Step is 303, chooses the vocal print feature vector for falling into preset range as target vocal print feature based on ranking results Vector.

The vocal print feature vector for falling into preset range is chosen as target vocal print feature vector based on ranking results.Wherein, Preset range, which can be, presets vocal print feature vector using first few items are come as target vocal print feature vector, will such as come First 5 vocal print feature vectors are as target vocal print feature vector.In the present embodiment, due to the voice data of target speaker In based on the voice spoken with target, therefore, the average vector of the multiple vocal print feature vectors extracted can most characterize target The vocal print feature of speaker, using with average vector apart from nearest vocal print feature vector as the vocal print feature of target speaker to Amount, can purify the vocal print feature vector of target speaker.

Step S40 treats training pattern based on the target vocal print feature vector and is trained, and obtains the target and speaks The speech recognition modeling of people.

After selecting target vocal print eigen vector, training pattern is treated based on target vocal print feature vector and is trained, Obtain the speech recognition modeling of target speaker, wherein can be neural network model, such as deep neural network to training pattern Model, convolutional neural networks model.By using target vocal print feature vector as the output to training pattern, and treat trained mould Type is trained, and the speech recognition modeling identified to target speaker can be obtained.When training obtains target speaker's After speech recognition modeling, to identify whether the speaker of one section of voice is target speaker, for the voice of speaker to be identified Data extract the vocal print feature vector of speaker to be identified, using target speaker speech recognition modeling to vocal print feature vector It is identified, obtains identification probability value；If the identification probability value is greater than predetermined probabilities value, it is determined that be target speaker sheet People.

In the present embodiment, by obtaining the voice data of target speaker, and the voice data of target speaker is drawn It is divided into multiple voice data sections；Vocal print feature is extracted respectively from multiple voice data sections, obtains multiple vocal print feature vectors；It presses Multiple vocal print feature vectors are ranked up according to predetermined order mode, and choose target vocal print feature vector based on ranking results； Training pattern is treated based on target vocal print feature vector to be trained, and is obtained the speech recognition modeling of target speaker, is realized Even if the voice data of target speaker is not pure, can extract from unpurified voice data being capable of accurate Characterization mesh The vocal print feature vector of speaker's vocal print feature is marked, and the language of target speaker is obtained based on the training of accurate vocal print feature vector Sound identification model, to improve the accuracy of target speaker's speech recognition modeling.

Further, it is based on above-mentioned first embodiment, speech model training method second embodiment of the present invention provides one kind Speech model training method.It in the present embodiment, is deep neural network model to training pattern, the step S40 includes:

Step S401, by the target vocal print feature input initialization treated the deep neural network model, into Row iteration training；

After acquiring target vocal print feature vector, by target vocal print feature input initialization treated depth mind Training is iterated through network model.Wherein, in deep neural network (Deep Neural Networks, abbreviation DNN) model Including the input layer, hidden layer and output layer being made of neuron.It include each between each layer in the deep neural network model The weight of neuron connection and biasing, these weights and biasing determine the property and recognition effect of neural network model.Initially Change deep neural network model, the initial value of weight and biasing in deep neural network model is set, experience can be directly used Initial weight and biasing is arranged in value.

Target vocal print feature vector is first divided into the sample of preset group number, then is grouped to be input in neural network model and carry out Training, i.e., be separately input to neural network model for the sample after grouping and be trained.Propagated forward algorithm is according to nerve net A series of linear operations that weight, biasing and the input value of each neuron carry out in neural network model are connected in network model With activation operation, since input layer, operation from level to level, operation always to output layer, until obtaining output valve.According to preceding to biography Each layer of network in neural network model of output valve can be calculated by broadcasting algorithm, arrive last one layer of output valve until calculating.Pass through Output valve is compared with the standard output value pre-set, calculate penalty values, and Detectability loss value whether be less than it is default Penalty values, to judge whether deep neural network restrains, if be less than preset penalty values, it is determined that convergence, if being not less than, Determine not converged, output valve based on deep neural network model carries out error-duration model at this time, updates deep neural network model Target vocal print feature vector is substituted into deep neural network model again after update, obtains output valve by the weight of each layer and biasing, Loop iteration, until detecting deep neural network model convergence.

Step S402, after detecting deep neural network model convergence, by the depth nerve net after convergence Speech recognition modeling of the network model as the target speaker.

After detecting deep neural network model convergence, the deep neural network model after convergence is spoken as target The speech recognition modeling of people.That is, after detecting that penalty values are less than preset penalty values, by current depth neural network model The speech recognition modeling of target speaker is arrived in the weight of each layer and biasing as final weight with biasing.

In the present embodiment, depth nerve is substituted by the pure target vocal print feature vector of the target speaker that will be obtained It is trained in network model, obtains the speech recognition modeling that can accurately identify target speaker.

Further, it is based on above-mentioned second embodiment, speech model training method 3rd embodiment of the present invention provides one kind Speech model training method.In the present embodiment, the step S20 includes:

Step S201 carries out the multiple voice data section based on preparatory trained universal background model adaptive respectively It should handle, obtain multiple vocal print feature vectors.

What a universal background model, universal background model (Universal Background Model, letter are trained in advance Claim UBM) it is gauss hybrid models (the Gaussian Mixture for indicating a large amount of nonspecific speaker's phonetic feature distributions Model, abbreviation GMM), since the training of UBM generallys use largely, voice number that channel unrelated unrelated with speaker dependent According to, therefore generally it can be thought that UBM is the model unrelated with speaker dependent, it is only fitted the phonetic feature distribution of people, and Some specific speaker is not represented.Gauss hybrid models are accurate with Gaussian probability-density function (i.e. normal distribution curve) Ground quantifies things, and a things is decomposed into several moulds formed based on Gaussian probability-density function (i.e. normal distribution curve) Type.

Based on preparatory trained universal background model self-adaptive processing is carried out to multiple voice data sections respectively, obtained more A vocal print feature vector.It is for each voice data section, therefrom extracts a vocal print feature vector.Wherein, adaptively Processing refer to using in universal background model with a part of nonspecific speaker's phonetic feature similar in voice data section as mesh The processing method of speaker's voice data is marked, which can specifically use MAP estimation algorithm (Maximum A Posteriori, abbreviation MAP) it realizes.MAP estimation is the estimation that rule of thumb data obtain to the amount for being difficult to observe, and is estimated During meter, posterior probability, objective function (i.e. expression target vocal print feature mould need to be obtained using prior probability and Bayes' theorem The expression formula of type) for the likelihood function of posterior probability, parameter value when acquiring the likelihood function maximum (can be used gradient decline to calculate Method is found out like the maximum value for obtaining right function), also just realizing will be close with target speaker's voice data in universal background model The effect trained together as target speaker's voice data of a part of nonspecific speaker's phonetic feature, according to acquiring seemingly Parameter value when right function maximum gets target voiceprint feature model corresponding with voice data section.Wherein, target vocal print Characteristic model is for calculating target vocal print feature vector field homoemorphism type, and target vocal print feature vector refers to through target vocal print feature What model obtained, represent the feature vector of voice data section.

Further, before step S201, further includes:

Step S50 pre-processes trained voice data gathered in advance；

Trained voice data gathered in advance is pre-processed.Wherein, training voice data is a large amount of non-spies of acquisition Determine the voice data of user.Specific preprocessing process, which can be, successively carries out preemphasis, framing, adding window to training voice data And endpoint detection processing.Wherein, preemphasis processing is explained in the above-described embodiments, and in this not go into detail. Training voice data after preemphasis is subjected to sub-frame processing.Framing refers to that the voice signal by whole section is cut into the language of several segments Sound processing technique, the size of every frame are moved in the range of 10-30ms, using general 1/2 frame length as frame.Frame shifting refers to adjacent two frame Between overlapping region, can be avoided adjacent two frame and change excessive problem.Carrying out sub-frame processing to trained voice data can incite somebody to action Training voice data is divided into the voice data of several segments, can segment trained voice data, convenient for the extraction of training phonetic feature. Training voice data after framing is subjected to windowing process.After carrying out sub-frame processing to training voice data, each frame is risen Discontinuous place can all occur in beginning section and end end, so framing is mostly also bigger with the error of original signal.Using adding Window is able to solve this problem, and the training voice data after can making framing becomes continuously, and each frame is showed The feature of periodic function out.Windowing process specifically refers to handle training voice data using window function, and window function can be with Select Hamming window.Windowing process carried out to training voice data, training voice data after enabling to framing is in the time domain Signal becomes continuously to facilitate to extract the training phonetic feature of trained voice data.

Step S60 extracts training phonetic feature to the pretreated trained voice data；

Training phonetic feature is extracted to pretreated trained voice data.It specifically can be to pretreated training Voice data makees Fast Fourier Transform (FFT), obtains the frequency spectrum for training voice data, and is obtained training voice data according to frequency spectrum Power spectrum.Wherein, Fast Fourier Transform (FFT) (Fast Fourier Transformation, abbreviation FFT), refers to and utilizes computer Calculate efficient, quick calculation method the general designation of discrete Fourier transform.Using this calculation method can make computer calculate from It dissipates multiplication number required for Fourier transformation to be greatly reduced, the number of sampling points being especially transformed is more, fft algorithm calculation amount Saving it is more significant.The power spectrum that training voice data is handled using melscale filter group obtains training voice data Meier power spectrum.It wherein, using the power spectrum that melscale filter group handles training voice data is carried out to power spectrum Mel-frequency analysis, and mel-frequency analysis be based on human auditory perception analysis.Observation discovery human ear is filtered just as one Wave device group is the same, only focuses on certain specific frequency components (i.e. the sense of hearing of people is selective frequency), that is to say, that people Ear only allows the signal of certain frequencies to pass through, and directly ignores the certain frequency signals for being not desired to perception.Specifically, melscale filters Device group includes multiple filters, these filters are not but univesral distributions on frequency coordinate axis, is had much in low frequency region Filter, be distributed than comparatively dense, but in high-frequency region, the number of filter just becomes fewer, is distributed very sparse.It can manage Xie Di, in the high resolution of low frequency part, the auditory properties with human ear are consistent melscale filter group, this is also Meier Where the physical significance of scale.Cutting is carried out to frequency-region signal by using mel-frequency scale filter group, so that last every The corresponding energy value of a frequency band, if the number of filter is 22, then the Meier power spectrum that will obtain training voice data Corresponding 22 energy values.Mel-frequency analysis is carried out by the power spectrum to training voice data, so that the power spectrum is protected The frequency-portions closely related with human ear characteristic are kept, which can be well reflected out the spy for training voice data Sign.Cepstral analysis is carried out on Meier power spectrum, obtains the mel-frequency cepstrum coefficient for training voice data, and by mel-frequency Cepstrum coefficient is determined as training phonetic feature.

Step S70 carries out universal background model training by the trained phonetic feature, obtains universal background model；

After getting trained phonetic feature, universal background model training is carried out by training phonetic feature, is obtained general Background model.The training phonetic feature can indicate that vector can be read directly in computer equipment in the form of vector (matrix) The training voice data of form is inputted training phonetic feature by frame, and calculated using EM when carrying out universal background model training Method iterative calculation obtains the parameter in universal background model expression formula, to obtain universal background model, EM algorithm is to calculate to contain There is the common mathematical method of the probability density function of hidden variable, herein without repeating.

In the present embodiment, by extracting the voice feature data of multiple voice data sections using universal background model, make It must not need that the vocal print for capableing of each voice data section vocal print feature of accurate characterization can be obtained by a large amount of voice data section Feature vector.

Furthermore the embodiment of the present invention also proposes a kind of speech model training device, referring to Fig. 3, the speech model training cartridge It sets and includes:

Segmentation module 10 is divided into multiple languages for obtaining the voice data of target speaker, and by the voice data Sound data segment；

It is special to obtain multiple vocal prints for extracting vocal print feature respectively from the multiple voice data section for extraction module 20 Levy vector；

Order models 30 for being ranked up according to predetermined order mode to the multiple vocal print feature vector, and are based on Ranking results choose target vocal print feature vector；

Training module 40 is trained for treating training pattern based on the target vocal print feature vector, is obtained described The speech recognition modeling of target speaker.

Further, the segmentation module 10 includes:

Pretreatment unit pre-processes the voice data for obtaining the voice data of target speaker；

Segmenting unit obtains more for being segmented to the pretreated voice data according to prefixed time interval A voice data section.

Further, the order models 30 include:

Computing unit, for seeking the average vector of the multiple vocal print feature vector；

Sequencing unit, for calculating the distance between the multiple vocal print feature vector and the average vector, and by institute Multiple vocal print feature vectors are stated to sort according to the sequence of the average vector distance from big to small；

Selection unit, it is special as target vocal print for choosing the vocal print feature vector for falling into preset range based on ranking results Levy vector.

Further, the training module 40 includes:

Training unit, for by the target vocal print feature input initialization treated the deep neural network mould Type is iterated training；

Detection unit, for when detect the deep neural network model convergence after, by after convergence the depth mind Speech recognition modeling through network model as the target speaker.

Further, the extraction module 20 includes:

Processing unit, for being carried out respectively based on preparatory trained universal background model to the multiple voice data section Self-adaptive processing obtains multiple vocal print feature vectors.

Further, the speech model training device further include:

Preprocessing module, for being pre-processed to trained voice data gathered in advance；

The extraction model 20 is also used to extract training phonetic feature to the pretreated trained voice data；

The training pattern 40 is also used to carry out universal background model training by the trained phonetic feature, obtains general Background model；

Further, the preprocessing module be also used to successively carry out trained voice data gathered in advance preemphasis, Framing, adding window and endpoint detection processing.

The expansion content of the specific embodiment of speech model training device of the present invention and above-mentioned speech model training method Each embodiment is essentially identical, and this will not be repeated here.

In addition, the embodiment of the present invention also proposes a kind of computer readable storage medium, the computer readable storage medium On be stored with speech model training program, the speech model training program realizes voice mould as described above when being executed by processor The step of type training method.

The expansion content of speech model of the present invention training equipment and the specific embodiment of computer readable storage medium with Above-mentioned each embodiment of speech model training method is essentially identical, and this will not be repeated here.

It should be noted that, in this document, the terms "include", "comprise" or its any other variant are intended to non-row His property includes, so that the process, method, article or the system that include a series of elements not only include those elements, and And further include other elements that are not explicitly listed, or further include for this process, method, article or system institute it is intrinsic Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including being somebody's turn to do There is also other identical elements in the process, method of element, article or system.

The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, technical solution of the present invention substantially in other words does the prior art The part contributed out can be embodied in the form of software products, which is stored in one as described above In storage medium (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that terminal device (it can be mobile phone, Computer, server, air conditioner or network equipment etc.) execute method described in each embodiment of the present invention.

The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims

1. a kind of speech model training method, which is characterized in that the speech model training method includes:

The multiple vocal print feature vector is ranked up according to predetermined order mode, and chooses target vocal print based on ranking results Feature vector；

Training pattern is treated based on the target vocal print feature vector to be trained, and obtains the speech recognition of the target speaker Model.

2. speech model training method as described in claim 1, which is characterized in that the voice number for obtaining target speaker According to, and the step of voice data is divided into multiple voice data sections includes:

The voice data for obtaining target speaker, pre-processes the voice data；

3. speech model training method as described in claim 1, which is characterized in that it is described according to predetermined order mode to described Multiple vocal print feature vectors are ranked up, and the step of choosing target vocal print feature vector based on ranking results includes:

Seek the average vector of the multiple vocal print feature vector；

Calculate the distance between the multiple vocal print feature vector and the average vector, and by the multiple vocal print feature vector It sorts according to the sequence of the average vector distance from big to small；

4. speech model training method as described in claim 1, which is characterized in that it is described to training pattern be depth nerve net Network model, it is described to treat training pattern based on the target vocal print feature vector and be trained, obtain the target speaker's The step of speech recognition modeling includes:

By the target vocal print feature input initialization treated the deep neural network model, it is iterated training；

After detecting deep neural network model convergence, using the deep neural network model after convergence as described in The speech recognition modeling of target speaker.

5. such as the described in any item speech model training methods of Claims 1-4, which is characterized in that described from the multiple language The step of extracting vocal print feature respectively in sound data segment, obtaining multiple vocal print feature vectors include:

Based on preparatory trained universal background model self-adaptive processing is carried out to the multiple voice data section respectively, obtained more A vocal print feature vector.

6. speech model training method as claimed in claim 5, which is characterized in that described based on preparatory trained general back Before the step of scape model carries out self-adaptive processing respectively, obtains multiple vocal print feature vectors to the multiple voice data section, Further include:

Trained voice data gathered in advance is pre-processed；

Training phonetic feature is extracted to the pretreated trained voice data；

Universal background model training is carried out by the trained phonetic feature, obtains universal background model.

7. speech model training method as claimed in claim 6, which is characterized in that described to trained voice number gathered in advance Include: according to pretreated step is carried out

8. a kind of speech model training device, which is characterized in that the speech model training device includes:

Segmentation module is divided into multiple voice data for obtaining the voice data of target speaker, and by the voice data Section；

Extraction module obtains multiple vocal print feature vectors for extracting vocal print feature respectively from the multiple voice data section；

Order models, for being ranked up according to predetermined order mode to the multiple vocal print feature vector, and based on sequence knot Fruit chooses target vocal print feature vector；

Training module is trained for treating training pattern based on the target vocal print feature vector, obtains the target and say Talk about the speech recognition modeling of people.

9. a kind of speech model training equipment, which is characterized in that the speech model training equipment include memory, processor and It is stored in the speech model training program that can be run on the memory and on the processor, the speech model training journey The step of speech model training method as described in any one of claims 1 to 7 is realized when sequence is executed by the processor.

10. a kind of computer readable storage medium, which is characterized in that be stored with voice mould on the computer readable storage medium Type training program is realized as described in any one of claims 1 to 7 when the speech model training program is executed by processor The step of speech model training method.