CN108154371A

CN108154371A - Electronic device, the method for authentication and storage medium

Info

Publication number: CN108154371A
Application number: CN201810030621.2A
Authority: CN
Inventors: 郑斯奇; 王健宗; 肖京
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2018-01-12
Filing date: 2018-01-12
Publication date: 2018-06-12
Also published as: WO2019136912A1

Abstract

The present invention relates to a kind of electronic device, the method for authentication and storage medium, this method to include：After the voice data of target user is received, the preset kind vocal print feature of the voice data is extracted using Predetermined filter, and the corresponding vocal print feature vector of the voice data is built based on the preset kind vocal print feature；By the first model of vocal print feature vector input training in advance, the corresponding triphones feature of each frame voice of the voice data is determined, and construct the corresponding triphones feature vector of all triphones features of voice data；The second model that triphones feature vector input is trained in advance, to construct the current vocal print discriminant vectors of target user；The space length between the current vocal print discriminant vectors and the standard vocal print discriminant vectors of the target user to prestore is calculated, authentication is carried out, and generate verification result to the user based on the space length.The present invention can improve the accuracy rate of authentication, improve financial security.

Description

Electronic device, the method for authentication and storage medium

Technical field

The present invention relates to a kind of field of communication technology more particularly to electronic device, the method for authentication and storage mediums.

Background technology

At present, the scope of business of many large-scale financing corporations is related to multiple business such as insurance, bank, investment, almost Each business is usually required for same client to be linked up, and authentication or anti-fraud identification is carried out before communication, to ensure business Safety.In order to meet the real-time demand of business, some financing corporations by the way of speech recognition to the identity of client into Row verification or anti-fraud identification.In speech recognition, since the sounding (waveform) of a word is not only determined by phoneme, collaboration The presence of pronunciation (referring to that a sound is influenced by front and rear adjacent tone and changed) is so that the perception of phoneme differs with standard Sample, it is seen then that the sounding of word actually additionally depends on phoneme context this factor other than phoneme.Existing speech recognition side Application on Voiceprint Recognition model does not consider the phoneme context of voice data to be identified used by case, and speech recognition carries out authentication Accuracy rate it is not high, criminal carries out financial fraud possibly also with this weakness, and safety is low.

Invention content

The purpose of the present invention is to provide a kind of electronic device, the method for authentication and storage mediums, it is intended to improve body The accuracy rate of part verification, improves financial security.

To achieve the above object, the present invention provides a kind of electronic device, the electronic device include memory and with it is described The processor of memory connection, is stored with the processing system that can be run on the processor, the processing in the memory System realizes following steps when being performed by the processor：

Extraction step after the voice data of target user of pending authentication is received, utilizes Predetermined filter The preset kind vocal print feature of the voice data is extracted, and it is corresponding based on the preset kind vocal print feature to build the voice data Vocal print feature vector；

First construction step by the first model of vocal print feature vector input training in advance, determines the voice data The corresponding triphones feature of each frame voice, and the corresponding triphones of all triphones features for constructing the voice data are special Sign vector；

Second construction step by the second model of triphones feature vector input training in advance, is used with constructing target The current vocal print discriminant vectors at family；

Verification step, calculate the current vocal print discriminant vectors and the standard vocal print discriminant vectors of the target user that prestore it Between space length, authentication is carried out to the user based on the space length, and generates verification result.

Preferably, second model is gauss hybrid models, and the training process of second model includes the following steps：

The voice data sample of preset quantity is obtained, each voice data sample corresponds to a vocal print discriminant vectors；

The corresponding preset kind vocal print feature of each voice data sample is extracted respectively, and based on each voice data sample Corresponding preset kind vocal print feature builds the corresponding vocal print feature vector of each voice data sample；

Respectively by the first model of the vocal print feature vector of structure input training in advance, each voice data sample is determined The corresponding triphones feature of each frame voice, and all triphones features for constructing each voice data sample respectively are corresponding Triphones feature vector；

All triphones feature vectors constructed are divided into the verification of the training set and the second percentage of the first percentage Collection, the sum of first percentage and the second percentage are less than or equal to 1；

The gauss hybrid models are trained using the triphones feature vector in training set, and the profit after the completion of training It is verified with the accuracy rate of the gauss hybrid models after verification set pair training；

If accuracy rate is more than default accuracy rate, training terminates, using the gauss hybrid models after training as described second Model alternatively, if accuracy rate is less than or equal to default accuracy rate, increases the quantity of voice data sample, and based on increase after Voice data sample re -training.

Preferably, the verification step, specifically includes：

Calculate the cosine between the current vocal print discriminant vectors and the standard vocal print discriminant vectors of the target user to prestore Distance：It is describedIt is described for the standard vocal print discriminant vectorsFor current vocal print discriminant vectors；

If the COS distance is less than or equal to preset distance threshold, the information being verified is generated；

If the COS distance be more than preset distance threshold, generate verification not by information.

Preferably, first model be shot and long term memory network LSTM models, the training process packet of first model Include following steps：

The voice data sample of preset quantity is obtained, each voice data sample corresponds to a triphones feature vector；

All vocal print feature vectors of structure are divided into the training set of the first percentage and the verification collection of the second percentage, institute The sum of the first percentage and the second percentage are stated less than or equal to 1；

The shot and long term memory network LSTM models are trained, and in training using the vocal print feature vector in training set After the completion using verifying that the accuracy rate of shot and long term memory network LSTM models after set pair training is verified；

If accuracy rate is more than default accuracy rate, training terminates, and is made with the shot and long term memory network LSTM models after training For first model, if alternatively, accuracy rate is less than or equal to default accuracy rate, increase the quantity of voice data sample, and base Voice data sample re -training after increase.

To achieve the above object, the present invention also provides a kind of method of authentication, the method for the authentication includes：

S1, should using Predetermined filter extraction after the voice data of target user of pending authentication is received The preset kind vocal print feature of voice data, and the corresponding vocal print spy of the voice data is built based on the preset kind vocal print feature Sign vector；

S2 by the first model of vocal print feature vector input training in advance, determines each frame voice of the voice data Corresponding triphones feature, and construct the corresponding triphones feature vector of all triphones features of the voice data；

S3, the second model that triphones feature vector input is trained in advance, to construct the current sound of target user Line discriminant vectors；

S4 calculates the sky between the current vocal print discriminant vectors and the standard vocal print discriminant vectors of the target user to prestore Between distance, authentication is carried out to the user based on the space length, and generates verification result.

Preferably, the step S4, specifically includes：

Preferably, the step S1, specifically includes：

Preemphasis, framing and windowing process are carried out to the voice data, carrying out Fourier transform to each adding window obtains To corresponding frequency spectrum, the frequency spectrum is inputted into Meier wave filter to export to obtain Meier frequency spectrum；

Cepstral analysis is carried out on Meier frequency spectrum to obtain mel-frequency cepstrum coefficient MFCC, is fallen based on the mel-frequency Spectral coefficient MFCC forms corresponding vocal print feature vector.

The present invention also provides a kind of computer readable storage medium, processing is stored on the computer readable storage medium The step of system, the processing system realizes the method for above-mentioned authentication when being executed by processor.

The beneficial effects of the invention are as follows：The present invention extracts the voice first when carrying out authentication or anti-fraud identification The vocal print feature of data simultaneously builds corresponding vocal print feature vector, to training in advance the first mode input vocal print feature to Amount, to determine the triphones feature of the voice data and build corresponding triphones feature vector, to the second mould of training in advance Type inputs the triphones feature vector to obtain the current vocal print discriminant vectors of target user, calculate current vocal print discriminant vectors with Space length between the standard vocal print discriminant vectors of the target user to prestore, to carry out body to user using the space length Part verification, since the present embodiment is when speech recognition carries out authentication, in addition to phoneme in itself other than also consider and represent voice data Phoneme context triphones, therefore the accuracy rate of authentication can be improved, improve financial security.

Description of the drawings

Fig. 1 is the schematic diagram of the hardware structure of one embodiment of electronic device of the present invention；

Fig. 2 is the flow diagram of one embodiment of method of authentication of the present invention.

Specific embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, it is right The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, not For limiting the present invention.Based on the embodiments of the present invention, those of ordinary skill in the art are not before creative work is made All other embodiments obtained are put, shall fall within the protection scope of the present invention.

It should be noted that the description for being related to " first ", " second " etc. in the present invention is only used for description purpose, and cannot It is interpreted as indicating or implies its relative importance or imply the quantity of the technical characteristic indicated by indicating.Define as a result, " the One ", at least one this feature can be expressed or be implicitly included to the feature of " second ".In addition, the skill between each embodiment Art scheme can be combined with each other, but must can be implemented as basis with those of ordinary skill in the art, when technical solution Will be understood that the combination of this technical solution is not present with reference to there is conflicting or can not realize when, also not the present invention claims Protection domain within.

As shown in fig.1, Fig. 1 is the schematic diagram of the hardware structure of one embodiment of electronic device of the present invention.Electronic device 1 is It is a kind of to carry out numerical computations and/or the equipment of information processing automatically according to the instruction for being previously set or storing.The electricity Sub-device 1 can be computer, can also be single network server, the server group or base of multiple network servers composition In the cloud being made of a large amount of hosts or network server of cloud computing, wherein cloud computing is one kind of Distributed Calculation, by one One super virtual computer of the computer collection composition of group's loose couplings.

In the present embodiment, electronic device 1 may include, but be not limited only to, and can be in communication with each other connection by system bus Memory 11, processor 12, network interface 13, memory 11 are stored with the processing system that can be run on the processor 12.It needs , it is noted that Fig. 1 illustrates only the electronic device 1 with component 11-13, it should be understood that being not required for implementing all The component shown, what can be substituted implements more or less components.

Wherein, memory 11 includes memory and the readable storage medium storing program for executing of at least one type.Inside save as the fortune of electronic device 1 Row provides caching；Readable storage medium storing program for executing can be if flash memory, hard disk, multimedia card, card-type memory are (for example, SD or DX memories Deng), random access storage device (RAM), static random-access memory (SRAM), read-only memory (ROM), electric erasable can compile Journey read-only memory (EEPROM), programmable read only memory (PROM), magnetic storage, disk, CD etc. it is non-volatile Storage medium.In some embodiments, readable storage medium storing program for executing can be the internal storage unit of electronic device 1, such as the electronics The hard disk of device 1；In further embodiments, which can also be that the external storage of electronic device 1 is set Plug-in type hard disk that is standby, such as being equipped on electronic device 1, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) blocks, flash card (Flash Card) etc..In the present embodiment, the readable storage medium storing program for executing of memory 11 The operating system of electronic device 1 and types of applications software, such as the place in one embodiment of the invention are installed on commonly used in storage Program code of reason system etc..It has exported or will export each in addition, memory 11 can be also used for temporarily storing Class data.

The processor 12 can be in some embodiments central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor or other data processing chips.The processor 12 is commonly used in the control electricity The overall operation of sub-device 1, such as perform and carry out data interaction or communicate relevant control and processing with the other equipment Deng.In the present embodiment, the processor 12 is used to run the program code stored in the memory 11 or processing data, example Such as run processing system.

The network interface 13 may include radio network interface or wired network interface, which is commonly used in Communication connection is established between the electronic device 1 and other electronic equipments.

The processing system is stored in memory 11, including it is at least one be stored in it is computer-readable in memory 11 Instruction, which can be performed by processor device 12, to realize the method for each embodiment of the application；With And at least one computer-readable instruction is different according to the function that its each section is realized, can be divided into different logic moulds Block.

In one embodiment, following steps are realized when above-mentioned processing system is performed by the processor 12：

In the present embodiment, voice data is collected (voice capture device is, for example, microphone) by voice capture device. When acquiring voice data, the interference of ambient noise and voice capture device should be prevented as possible.Voice capture device is used with target Family keeps suitable distance, and does not have to be distorted big voice capture device as possible, and power supply keeps electric current steady it is preferable to use alternating current It is fixed；Sensor should be used when carrying out telephonograph.Before framing and sampling, voice data can be carried out at noise Reason, to be further reduced interference.In order to extract to obtain the vocal print feature of voice data, the voice data acquired is default The voice data of data length is the voice data more than preset data length.

Wherein, vocal print feature includes multiple types, such as broadband vocal print, narrowband vocal print, amplitude vocal print etc., and the present embodiment is pre- If type vocal print feature is preferably mel-frequency cepstrum coefficient (the Mel Frequency Cepstrum of speech sample data Coefficient, MFCC), Predetermined filter is Meier wave filter.When building corresponding vocal print feature vector, voice is adopted The vocal print feature composition characteristic data matrix of sample data, this feature data matrix be speech sample data vocal print feature to Amount.

In the present embodiment, triphones state represents that the phoneme of a speech frame adds its shape with front and rear phoneme relationship in itself State.First model is preferably shot and long term memory network LSTM (Long Short-Term Memory) model, is remembered using shot and long term Recall network LSTM model treatments vocal print feature vector, obtain the corresponding triphones feature of each frame voice and construct corresponding three Phoneme feature vector.

Wherein, triphones are characterized as the triphones shape probability of state representated by corresponding speech frame, voice data All triphones features correspond to a triphones eigenmatrix, and the triphones eigenmatrix and the vocal print feature of input are vectorial (i.e. Vocal print feature matrix) it is corresponding, and the corresponding triphones feature vector (i.e. three of all triphones features for constructing the voice data Phoneme eigenmatrix).

In a preferred embodiment, shot and long term memory network LSTM models include 1 input layer, 3 LSTM layers and 1 Classification layer, it is as shown in table 1 below：

Layer Name	Batch Size
		Input	913
LSTM1/HLSTM1	1024
		LSTM2/HLSTM2	1024
LSTM3/HLSTM3	1024
		Softmax	4773

Table 1

Wherein, Layer Name represent each layer of title, and Batch Size represent the input voice strip number of current layer, Input table shows input layer, the length that HLSTM (Highway Long Short-term Memory) expressions are connected based on mnemon (recurrent neural network introduces and is directly connected to adjacent memory list the length based on mnemon connection recurrent neural network in short-term in short-term The part of member so that the information in shot and long term memory network LSTM models can be transmitted directly in different interlayers, improve voice and know Other efficiency promotes recognition effect), Softmax represents Softmax graders.Vocal print feature vector is remembered successively by shot and long term After the processing of the above-mentioned each layer structure of network LSTM models, obtain the corresponding triphones feature of each frame voice and construct this three The corresponding triphones feature vector of phoneme feature.

In a preferred embodiment, first model includes for the training process of shot and long term memory network LSTM models Following steps：

The voice data sample (such as 10) of preset quantity is obtained, each voice data sample corresponds to a triphones spy Sign vector；The corresponding preset kind vocal print feature of each voice data sample is extracted respectively, and based on each voice data sample Corresponding preset kind vocal print feature builds the corresponding vocal print feature vector of each voice data sample；By all vocal prints of structure Feature vector is divided into the training set of the first percentage (such as 75%) and the verification collection of the second percentage (such as 20%), and described The sum of one percentage and the second percentage are less than or equal to 1；Net is remembered to the shot and long term using the vocal print feature vector in training set Network LSTM models are trained, and the shot and long term memory network LSTM models after verifying set pair training are utilized after the completion of training Accuracy rate, which is verified, (verifies the triphones feature vector of shot and long term memory network LSTM models output relative to voice data The accuracy rate of the corresponding triphones feature vector of sample)；If accuracy rate is more than default accuracy rate (such as 0.985), training knot Beam, using the shot and long term memory network LSTM models after training as first model, if alternatively, accuracy rate is less than or equal to preset Accuracy rate then increases the quantity of voice data sample, and based on the voice data sample re -training after increase, until accuracy rate More than default accuracy rate.

In the present embodiment, the second model is preferably gauss hybrid models, using the gauss hybrid models come to triphones spy Sign vectorial (i.e. triphones eigenmatrix) is calculated, and obtains corresponding current vocal print discriminant vectors (the i.e. i- of the voice data vector)。

Specifically, which includes：

1) Gauss model, is selected：First, each frame data are calculated using the parameter in the second model in different Gaussian modes The likelihood logarithm of type, by likelihood logarithm value matrix each column sorting in parallel, choosing top n Gauss model, finally obtaining one A matrix per frame data N before numerical value in mixed Gauss model：

Loglike=E (X) * D (X)^-1*X^T-0.5*D(X)^-1*(X.²)^T,

Wherein, Loglike is likelihood logarithm value matrix, and E (X) is the Mean Matrix that the second model training comes out, and D (X) is Covariance matrix, X are data matrix, X.²Each it is worth for matrix and is squared.

Wherein, likelihood logarithm calculation formula：loglikes_i=C_i+E_i*Cov_i ^-1*X_i-X_i ^T*X_i*Cov_i ^-1, loglikes_i For the i-th row vector of likelihood logarithm value matrix, C_iFor the constant term of i-th of model, E_iFor the Mean Matrix of i-th of model, Cov_i For the covariance matrix of i-th of model, X_iFor the i-th frame data.

2) posterior probability, is calculated：X*XT calculating will be carried out per frame data X, and obtain a symmetrical matrix, be reduced to herein Lower triangular matrix, and element is arranged as to 1 row in order, become a N frame and be multiplied by the such latitude of lower triangular matrix number One vector is calculated, and vector as all frames is combined into new data matrix, while by the calculating in the second model The covariance matrix of probability, each matrix are also reduced to lower triangular matrix, become with matrix as new data matrix class, passing through Mean Matrix and covariance matrix in second model calculate the likelihood logarithm under the Gauss model of the selection per frame data Then value carries out Softmax recurrence, operation is finally normalized, obtain posterior probability point of every frame in mixed Gauss model The ProbabilityDistribution Vector of every frame is formed probability matrix by cloth.

3) current vocal print discriminant vectors, are extracted：Carry out single order first, the calculating of second order coefficient, coefficient of first order calculates can be with It is obtained by probability matrix row summation：

Wherein, Gamma_iFor i-th of element of coefficient of first order vector, loglikes_jiFor The jth row of likelihood logarithm value matrix, i-th of element.

Second order coefficient can be multiplied by data matrix acquisition by the transposition of probability matrix：

X=Loglike^T* feats, wherein, X be second order coefficient matrix, loglike be likelihood logarithm value matrix, feats It is characterized data matrix.

It is being calculated single order, after second order coefficient, then parallel computation first order and quadratic term pass through first order and two Secondary item calculates current vocal print discriminant vectors：I-vector=quadratic^-1*linear。

Wherein,M_iFor the Mean Matrix of i-th of model in the second model, Σ_iIt is i-th The covariance matrix of a model, X_iThe i-th row vector for second order coefficient matrix；

M is vectorial for coefficient of first order, M_iFor the equal of i-th of model in the second model Value matrix, Σ_iCovariance matrix for i-th of model.

Preferably, the process of training gauss hybrid models includes：

The voice data sample of preset quantity (such as 100,000) is obtained, each voice data sample corresponds to a vocal print and differentiates Vector；The corresponding preset kind vocal print feature of each voice data sample is extracted respectively, and based on each voice data sample pair The preset kind vocal print feature answered builds the corresponding vocal print feature vector of each voice data sample；Respectively by the vocal print of structure spy First model of sign vector input training in advance determines that the corresponding triphones of each frame voice of each voice data sample are special Sign, and the corresponding triphones feature vector of all triphones features of each voice data sample is constructed respectively；It will construct All triphones feature vectors be divided into the training set of the first percentage (such as 75%) and the second percentage (such as 25%) Verification collection, the sum of first percentage and the second percentage are less than or equal to 1；Utilize the triphones feature vector pair in training set The gauss hybrid models are trained, and the accurate of the gauss hybrid models after verifying set pair training is utilized after the completion of training Rate, which is verified, (verifies that the current vocal print discriminant vectors of gauss hybrid models output are worked as relative to voice data sample is corresponding The accuracy rate of preceding vocal print discriminant vectors)；If accuracy rate is more than default accuracy rate (such as 0.98), training terminates, after training Gauss hybrid models as second model, if alternatively, accuracy rate is less than or equal to default accuracy rate, increase voice data The quantity of sample, and based on the voice data sample re -training after increase, until accuracy rate is more than default accuracy rate.

In the present embodiment, there are many vector and the distance between vectors, including COS distance and Euclidean distance etc., preferably Ground, the space length of the present embodiment is COS distance, and COS distance is utilizes two vectorial angle cosine values in vector space Measurement as the size for weighing two inter-individual differences.

Wherein, standard vocal print discriminant vectors are the vocal print discriminant vectors for being obtained ahead of time and storing, standard vocal print discriminant vectors The identification information of its corresponding user is carried in storage, is capable of the identity of the corresponding user of accurate representation.Calculating space Before distance, the standard vocal print discriminant vectors of storage are obtained according to the identification information that user provides.

Wherein, it when the space length being calculated is less than or equal to pre-determined distance threshold value, is verified, conversely, then verifying Failure.

In addition, the present embodiment also can be applicable in the anti-identification cheated, based on the space length identification being calculated Whether the user is black list user, to improve safety.

Compared with prior art, the present embodiment extracts the voice number first when carrying out authentication or anti-fraud identification According to vocal print feature and build corresponding vocal print feature vector, to the first mode input of training vocal print feature vector in advance, It is defeated to the second model of training in advance to determine the triphones feature of the voice data and build corresponding triphones feature vector Enter the triphones feature vector to obtain the current vocal print discriminant vectors of target user, calculate current vocal print discriminant vectors with prestoring The target user standard vocal print discriminant vectors between space length, with using the space length to user carry out identity test Card, since the present embodiment is when speech recognition carries out authentication, in addition to phoneme in itself other than also consider and represent voice data The triphones of phoneme context, therefore the accuracy rate of authentication can be improved, improve financial security.

In a preferred embodiment, on the basis of the embodiment of above-mentioned Fig. 1, the verification step specifically includes：

Calculate the cosine between the current vocal print discriminant vectors and the standard vocal print discriminant vectors of the target user to prestore Distance：It is describedIt is described for the standard vocal print discriminant vectorsFor current vocal print discriminant vectors；If institute COS distance is stated less than or equal to preset distance threshold, then generates the information being verified；If the COS distance is more than Preset distance threshold, then generate verification not by information.

In the present embodiment, the mark letter of target user can be carried in the standard vocal print discriminant vectors for storing target user Breath in the identity for verifying user, obtains corresponding standard vocal print according to the identification information match of current vocal print discriminant vectors and reflects It is not vectorial, and the COS distance between current vocal print discriminant vectors and the standard vocal print discriminant vectors matched is calculated, with remaining Chordal distance verifies the identity of target user, improves the accuracy of authentication.

As shown in Fig. 2, Fig. 2 is the flow diagram of one embodiment of method of authentication of the present invention, the authentication Method includes the following steps：

Step S1 after the voice data of target user of pending authentication is received, is carried using Predetermined filter The preset kind vocal print feature of the voice data is taken, and the corresponding sound of the voice data is built based on the preset kind vocal print feature Line feature vector；

Step S2 by the first model of vocal print feature vector input training in advance, determines each frame of the voice data The corresponding triphones feature of voice, and construct the corresponding triphones feature of all triphones features of the voice data to Amount；

In a preferred embodiment, shot and long term memory network LSTM models include 1 input layer, 3 LSTM layers and 1 Classification layer, as shown in Table 1 above, details are not described herein again.

Step S3, the second model that triphones feature vector input is trained in advance, to construct working as target user Preceding vocal print discriminant vectors；

Specifically, which includes：

Loglike=E (X) * D (X)^-1*X^T-0.5*D(X)^-1*(X.²)^T,

Preferably, the process of training gauss hybrid models includes：

Step S4 is calculated between the current vocal print discriminant vectors and the standard vocal print discriminant vectors of the target user to prestore Space length, authentication is carried out to the user based on the space length, and generates verification result.

In a preferred embodiment, on the basis of the embodiment of above-mentioned Fig. 2, the step S4 is specifically included：

In a preferred embodiment, on the basis of the embodiment of above-mentioned Fig. 2, the step S1 is specifically included：To institute It states voice data and carries out preemphasis, framing and windowing process, carrying out Fourier transform to each adding window obtains corresponding frequency spectrum, The frequency spectrum is inputted into Meier wave filter to export to obtain Meier frequency spectrum；Cepstral analysis is carried out on Meier frequency spectrum to obtain Meier Frequency cepstral coefficient MFCC forms corresponding vocal print feature vector based on the mel-frequency cepstrum coefficient MFCC.

In the present embodiment, after the voice data for receiving the user for carrying out authentication, voice data is handled. Wherein, preemphasis processing is really high-pass filtering processing, filters out low-frequency data so that the high frequency characteristics in voice data is more prominent Aobvious, specifically, the transmission function of high-pass filtering is：H (Z)=1- α Z^-1, wherein, Z is voice data, and α is constant factor, preferably Ground, the value of α is 0.97；Since only stationarity is presented in voice signal within a short period of time, one section of voice signal is divided into N The signal (i.e. N frames) of section short time, and lost in order to avoid the continuity Characteristics of sound, there is one section of duplicate block between consecutive frame Domain, repeat region are generally 1/2 per frame length；After framing is carried out to voice data, each frame signal all treats as stationary signal It handles, but the presence of Gibbs' effect, the start frame and end frame of voice data are discontinuous, after framing, more Away from raw tone, therefore, it is necessary to carry out windowing process to voice data.

Wherein, cepstral analysis is, for example, to take the logarithm, do inverse transformation, and inverse transformation comes generally by DCT discrete cosine transforms It realizes, takes the 2nd after DCT to the 13rd coefficient as MFCC coefficients.Mel-frequency cepstrum coefficient MFCC is this frame voice The vocal print feature of data, by the mel-frequency cepstrum coefficient MFCC composition characteristic data matrixes of every frame, this feature data matrix is Vocal print feature vector for voice data.

The present embodiment takes speech sample data mel-frequency cepstrum coefficient MFCC to form corresponding vocal print feature vector, due to Its than be used for the frequency band of the linear interval in normal cepstrum more can subhuman auditory system, therefore can improve The accuracy of authentication.

The embodiments of the present invention are for illustration only, do not represent the quality of embodiment.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can add the mode of required general hardware platform to realize by software, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on such understanding, technical scheme of the present invention substantially in other words does the prior art Going out the part of contribution can be embodied in the form of software product, which is stored in a storage medium In (such as ROM/RAM, magnetic disc, CD), used including some instructions so that a station terminal equipment (can be mobile phone, computer takes Be engaged in device, air conditioner or the network equipment etc.) perform method described in each embodiment of the present invention.

It these are only the preferred embodiment of the present invention, be not intended to limit the scope of the invention, it is every to utilize this hair The equivalent structure or equivalent flow shift that bright specification and accompanying drawing content are made directly or indirectly is used in other relevant skills Art field, is included within the scope of the present invention.

Claims

1. a kind of electronic device, which is characterized in that the electronic device includes memory and the processing being connect with the memory Device is stored with the processing system that can be run on the processor in the memory, and the processing system is by the processor Following steps are realized during execution：

Extraction step after the voice data of target user of pending authentication is received, is extracted using Predetermined filter The preset kind vocal print feature of the voice data, and the corresponding vocal print of the voice data is built based on the preset kind vocal print feature Feature vector；

First construction step by the first model of vocal print feature vector input training in advance, determines each of the voice data The corresponding triphones feature of frame voice, and construct the corresponding triphones feature of all triphones features of the voice data to Amount；

Second construction step, the second model that triphones feature vector input is trained in advance, to construct target user's Current vocal print discriminant vectors；

Verification step is calculated between the current vocal print discriminant vectors and the standard vocal print discriminant vectors of the target user to prestore Space length carries out authentication, and generate verification result based on the space length to the user.

2. electronic device according to claim 1, which is characterized in that second model is gauss hybrid models, described The training process of second model includes the following steps：

The corresponding preset kind vocal print feature of each voice data sample is extracted respectively, and is corresponded to based on each voice data sample Preset kind vocal print feature build the corresponding vocal print feature vector of each voice data sample；

Respectively by the first model of the vocal print feature vector of structure input training in advance, each of each voice data sample is determined The corresponding triphones feature of frame voice, and corresponding three sound of all triphones features of each voice data sample is constructed respectively Plain feature vector；

All triphones feature vectors constructed are divided into the training set of the first percentage and the verification collection of the second percentage, institute The sum of the first percentage and the second percentage are stated less than or equal to 1；

The gauss hybrid models are trained using the triphones feature vector in training set, and utilizes and tests after the completion of training The accuracy rate of the gauss hybrid models after card set pair training is verified；

If accuracy rate is more than default accuracy rate, training terminates, using the gauss hybrid models after training as second model, Alternatively, if accuracy rate is less than or equal to default accuracy rate, increase the quantity of voice data sample, and based on the voice number after increase According to sample re -training.

3. electronic device according to claim 1 or 2, which is characterized in that the verification step specifically includes：

Calculate the COS distance between the current vocal print discriminant vectors and the standard vocal print discriminant vectors of the target user to prestore：It is describedIt is described for the standard vocal print discriminant vectorsFor current vocal print discriminant vectors；

4. electronic device according to claim 1 or 2, which is characterized in that first model is shot and long term memory network LSTM models, the training process of first model include the following steps：

All vocal print feature vectors of structure are divided into the training set of the first percentage and the verification collection of the second percentage, described The sum of one percentage and the second percentage are less than or equal to 1；

The shot and long term memory network LSTM models are trained using the vocal print feature vector in training set, and are completed in training Afterwards using verifying that the accuracy rate of shot and long term memory network LSTM models after set pair training is verified；

If accuracy rate is more than default accuracy rate, training terminates, using the shot and long term memory network LSTM models after training as institute The first model is stated, if alternatively, accuracy rate increases the quantity of voice data sample, and based on increasing less than or equal to default accuracy rate Voice data sample re -training after adding.

A kind of 5. method of authentication, which is characterized in that the method for the authentication includes：

After the voice data of target user of pending authentication is received, the voice is extracted using Predetermined filter by S1 The preset kind vocal print feature of data, and based on the preset kind vocal print feature build the corresponding vocal print feature of the voice data to Amount；

S2 by the first model of vocal print feature vector input training in advance, determines that each frame voice of the voice data corresponds to Triphones feature, and construct the corresponding triphones feature vector of all triphones features of the voice data；

S3 by the second model of triphones feature vector input training in advance, is reflected with constructing the current vocal print of target user Not vector；

S4, calculate space between the current vocal print discriminant vectors and the standard vocal print discriminant vectors of the target user to prestore away from From carrying out authentication to the user based on the space length, and generate verification result.

6. the method for authentication according to claim 5, which is characterized in that second model is Gaussian Mixture mould Type, the training process of second model include the following steps：

7. the method for authentication according to claim 5 or 6, which is characterized in that the step S4 is specifically included：

8. the method for authentication according to claim 5 or 6, which is characterized in that first model is remembered for shot and long term Recall network LSTM models, the training process of first model includes the following steps：

9. the method for authentication according to claim 5 or 6, which is characterized in that the step S1 is specifically included：

Preemphasis, framing and windowing process are carried out to the voice data, Fourier transform is carried out to each adding window and is obtained pair The frequency spectrum is inputted Meier wave filter to export to obtain Meier frequency spectrum by the frequency spectrum answered；

Cepstral analysis is carried out on Meier frequency spectrum to obtain mel-frequency cepstrum coefficient MFCC, based on the mel-frequency cepstrum system Number MFCC forms corresponding vocal print feature vector.

10. a kind of computer readable storage medium, which is characterized in that be stored with processing system on the computer readable storage medium System, when the processing system is executed by processor the method for authentication of the realization as described in any one of claim 5 to 9 Step.