CN110010133A

CN110010133A - Vocal print detection method, device, equipment and storage medium based on short text

Info

Publication number: CN110010133A
Application number: CN201910167882.3A
Authority: CN
Inventors: 王健宗; 周新宇; 肖京
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-03-06
Filing date: 2019-03-06
Publication date: 2019-07-12
Also published as: WO2020177380A1

Abstract

The invention discloses a kind of vocal print detection method, device, equipment and storage medium based on short text, which comprises preset deep neural network is trained using training sample；Obtain voice signal to be identified；The voice signal to be identified is pre-processed, and feature extraction is carried out to the pretreated voice signal, obtains mel-frequency cepstrum coefficient；Using the mel-frequency cepstrum coefficient as the incoming trained deep neural network in advance of input, vocal print vector of the deep neural network in the output vector of the full articulamentum of the last layer, as the voice signal is obtained；The vocal print vector of the voice signal is compared with the vocal print vector that prestores in sound-groove model library, and vocal print testing result is exported according to comparison result；Wherein, the training sample and voice signal are short text.The present invention solves the problems, such as that voice signal is tediously long, amounts of specimen information is big, calculation resources are demanding in existing vocal print detection method.

Description

Vocal print detection method, device, equipment and storage medium based on short text

Technical field

The present invention relates to information technology field more particularly to a kind of vocal print detection method, device, equipment based on short text And storage medium.

Background technique

Vocal print detection is a kind of personal identification method that common are effect, can be applied to network payment, vocal print lock control, life A series of scenes needed in conjunction with authentication such as certification, internet of things equipment verifying are deposited, are especially verified not using video image In convenient remote validation, completely not by equipment limit.When being verified, dual test is carried out using content and vocal print detection Card can greatly improve the threshold attacked, and promote safety.When carrying out vocal print detection, currently used method includes But it is not limited to template matching method, probabilistic model method, artificial neural network method, I-vector modelling.However in these methods, by It in the structure for being limited to model itself, is difficult to complete text training using short text, therefore, it is more feature can only to be generallyd use Long text is as mode input vector.However, voice signal is more tediously long, the feature of carrying is more, the sample needed in training It contains much information, the computer resource of occupancy is more.

Summary of the invention

The embodiment of the invention provides a kind of vocal print detection method, device, equipment and storage medium based on short text, with Solve the problems, such as that voice signal is tediously long, amounts of specimen information is big, calculation resources are demanding in existing vocal print detection method.

A kind of vocal print detection method based on short text, comprising:

Training sample is obtained, preset deep neural network is trained using the training sample；

Obtain voice signal to be identified；

The voice signal to be identified is pre-processed, and feature is carried out to the pretreated voice signal and is mentioned It takes, obtains mel-frequency cepstrum coefficient；

Using the mel-frequency cepstrum coefficient as the incoming trained deep neural network in advance of input, the depth is obtained Spend neural network in the output vector of the full articulamentum of the last layer, as the vocal print vector of the voice signal, the vocal print to The feature of voice signal described in each element representation in amount；

The vocal print vector of the voice signal is compared with the vocal print vector that prestores in sound-groove model library, and according to than Vocal print testing result is exported to result；

Wherein, the training sample and voice signal are short text.

Optionally, the acquisition training sample is trained preset deep neural network using the training sample Include:

The speech samples of multiple users are obtained as training sample；

The training sample of each user is pre-processed, feature is carried out to pretreated training sample and is mentioned It takes, obtains mel-frequency cepstrum coefficient；

User tag is stamped the mel-frequency cepstrum coefficient of each user；

Using with user tag mel-frequency cepstrum coefficient as input vector be passed to preset deep neural network into Row training；

Each mel-frequency cepstrum coefficient is calculated by the deep neural network using preset loss function Error between recognition result and corresponding user tag, and modify according to the error parameter of the deep neural network；

The modified depth nerve of parameter is passed to using the mel-frequency cepstrum coefficient with user tag as input vector Network carries out next iteration training, until the deep neural network is to the recognition result of each mel-frequency cepstrum coefficient Accuracy rate reaches specified threshold, stops iteration.

Optionally, the deep neural network includes input layer, four layers of full articulamentum and output layer, each full articulamentum For 12 dimension inputs, using maxout excitation function, and the full articulamentum of third and the 4th full articulamentum are instructed using drop policy Practice.

Optionally, the vocal print vector by the voice signal compares with the vocal print vector that prestores in sound-groove model library It is right, and vocal print testing result is exported according to comparison result and includes:

The vocal print vector of the voice signal is compared with the vocal print vector that prestores in sound-groove model library；

If existing identical with the vocal print vector of the voice signal when prestoring vocal print vector in the sound-groove model library, obtain The corresponding user information of vocal print vector is prestored described in taking, exports the user information；

If in the sound-groove model library there is no it is identical with the vocal print vector of the voice signal prestore vocal print vector when, The prompt information of output detection failure.

Optionally, described that the voice signal to be identified is pre-processed, and the pretreated voice is believed Number carry out feature extraction, obtaining mel-frequency cepstrum coefficient includes:

Sub-frame processing is executed to the waveform diagram of the voice signal to be identified；

After sub-frame processing, windowing process is executed to each frame signal；

Discrete Fourier transform is executed to each frame signal after windowing process, obtains the corresponding frequency spectrum of the frame signal；

The power spectrum of the voice signal is calculated according to the corresponding frequency spectrum of all frame signals；

According to the spectra calculation Meier filter group；

Logarithm operation is executed the output of each Meier filter, obtains logarithmic energy；

Discrete cosine transform is executed to the logarithmic energy, obtains the mel-frequency cepstrum coefficient of the voice signal.

A kind of vocal print detection device based on short text, comprising:

Training module instructs preset deep neural network using the training sample for obtaining training sample Practice；

Signal acquisition module, for obtaining voice signal to be identified；

Characteristic extracting module, for being pre-processed to the voice signal to be identified, and to pretreated described Voice signal carries out feature extraction, obtains mel-frequency cepstrum coefficient；

Feature obtains module, for using the mel-frequency cepstrum coefficient as the incoming depth mind trained in advance of input Through network, sound of the deep neural network in the output vector of the full articulamentum of the last layer, as the voice signal is obtained Line vector, the feature of voice signal described in each element representation in the vocal print vector；

Detection module, for carrying out the vocal print vector that prestores in the vocal print vector of the voice signal and sound-groove model library It compares, and vocal print testing result is exported according to comparison result；

Wherein, the training sample and voice signal are short text.

Optionally, the detection module includes:

Comparing unit, for carrying out the vocal print vector that prestores in the vocal print vector of the voice signal and sound-groove model library It compares；

First result output unit, if for there is the vocal print vector phase with the voice signal in the sound-groove model library With when prestoring vocal print vector, the corresponding user information of vocal print vector is prestored described in acquisition, exports the user information；

Second result output unit, if for there is no the vocal print vectors with the voice signal in the sound-groove model library It is identical when prestoring vocal print vector, the prompt information of output detection failure.

A kind of computer equipment, including memory, processor and storage are in the memory and can be in the processing The computer program run on device, the processor realize the above-mentioned vocal print inspection based on short text when executing the computer program Survey method.

A kind of computer readable storage medium, the computer-readable recording medium storage have computer program, the meter Calculation machine program realizes the above-mentioned vocal print detection method based on short text when being executed by processor.

The embodiment of the present invention is suitable for the deep neural network of short text by redesigning in advance, then uses short text Training sample the preset deep neural network is trained；Obtain voice signal to be identified；To described to be identified Voice signal pre-processed, and feature extraction is carried out to the pretreated voice signal, obtains mel-frequency cepstrum Coefficient；Using the mel-frequency cepstrum coefficient as the incoming trained deep neural network in advance of input, the depth is obtained Neural network the full articulamentum of the last layer output vector, as the vocal print vector of the voice signal, the vocal print vector In each element representation described in voice signal feature；By the vocal print vector of the voice signal with it is pre- in sound-groove model library It deposits vocal print vector to be compared, and vocal print testing result is exported according to comparison result；Wherein, the training sample and voice signal It is short text；To realize the vocal print detection based on short text, the input vector of model is greatly reduced, is solved existing Voice signal is tediously long in sound marks detection method, amounts of specimen information is big, the demanding problem of calculation resources.

Detailed description of the invention

In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by institute in the description to the embodiment of the present invention Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings Obtain other attached drawings.

Fig. 1 is a flow chart of the vocal print detection method in one embodiment of the invention based on short text；

Fig. 2 is a flow chart of step S101 in vocal print detection method in one embodiment of the invention based on short text；

Fig. 3 is a flow chart of step S103 in vocal print detection method in one embodiment of the invention based on short text；

Fig. 4 is a flow chart of step S105 in vocal print detection method in one embodiment of the invention based on short text；

Fig. 5 is a functional block diagram of the vocal print detection device in one embodiment of the invention based on short text；

Fig. 6 is a schematic diagram of computer equipment in one embodiment of the invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall within the protection scope of the present invention.

Vocal print detection method provided in an embodiment of the present invention based on short text is applied to server.The server can be with It is realized with the server cluster of the either multiple server compositions of independent server.In one embodiment, as shown in Figure 1, A kind of vocal print detection method based on short text is provided, is included the following steps:

In step s101, training sample is obtained, preset deep neural network is instructed using the training sample Practice.

Herein, the embodiment of the present invention has redesigned the deep neural network suitable for short text, the depth nerve Network includes input layer, four layers of full articulamentum and output layer, and each full articulamentum is 12 dimension inputs, using maxout excitation function Number, and the full articulamentum of third and the 4th full articulamentum are trained using drop policy.In this way, the deep neural network can It, can be using short text as training sample, input vector, to reduce the requirement to data not limited by model structure.Wherein, The shorter voice signal of short text, that is, length.For example, the voice signal of a sentence length.It is alternatively possible to pass through specified length Spend come the short text, such as less than or equal to the designated length voice signal.The embodiment of the present invention collects multiple users Speech samples preset deep neural network is trained as training sample, and based on the training sample.Optionally, As shown in Fig. 2, the step S101 includes:

In step s 201, the speech samples of multiple users are obtained as training sample.

In the present embodiment, for practical application scene, multiple users couple can be collected under concrete application scene in advance The speech samples answered, for example the corresponding voice sample of each user can be collected by channels such as specialized knowledge base, network data bases This, as training sample.

In step S202, the training sample of each user is pre-processed, to pretreated trained sample This progress feature extraction obtains MFCC feature.

Herein, MFCC feature (Mel Cepstral Frequency Coefficients, Mel-scale Frequency Cepstral Coefficients, abbreviation MFCC) it is a kind of ingredient in voice signal with identification, it is to be extracted in Mel scale frequency domain Cepstrum parameter out, parameter consider human ear to the impression degree of different frequency, especially suitable for voice recognition and language Person's identification.The embodiment of the present invention is based on the MFCC characteristic Design deep neural network, refreshing using the MFCC feature as depth Input through network.Before the training deep neural network, user's sample is pre-processed first and feature mentions It takes, obtains corresponding MFCC feature.The training sample of the user is pre-processed and feature extraction is identical as step S103, The narration of step S103 is specifically referred to, details are not described herein again.

The embodiment of the present invention obtains the training sample by carrying out feature extraction to the pretreated training sample Corresponding one group of 128 dimension MFCC feature.Input vector of the 128 dimension MFCC feature as the deep neural network.

In step S203, user tag is stamped the MFCC feature of each user.

In embodiments of the present invention, the user tag is for identifying speaker belonging to the MFCC feature.Difference is used Family, the user tag that corresponding MFCC feature is beaten are different.Before to deep neural network training, need to each described The 128 dimension MFCC features of user stamp corresponding user tag.In order to make it easy to understand, illustrated below.Assuming that there are three User, user 1, user 2, user 3 stamp user tag " 01 " by MFCC feature of the step S203 to user 1, to user 2 MFCC feature stamp user tag " 02 ", user tag " 03 " is stamped to the MFCC feature of user 3.More than it should be appreciated that only It for an example of the invention, is not intended to restrict the invention, in other embodiments, the user tag can also be other The label of form.

In step S204, the MFCC feature with user tag is passed to preset depth nerve net as input vector Network is trained.

In training, for each user, the 128 dimension MFCC features with same user tag are inputted as one Vector is passed to preset deep neural network and is trained, obtains the recognition result of the user.

Herein, the preset deep neural network includes input layer, four layers of full articulamentum and output layer.It is each complete Articulamentum is 12 dimension inputs, uses maxout excitation function, the output expression formula of hidden layer node are as follows:

In above formula, b indicates bias, and W indicates the three-dimensional matrice being made of parameter, and having a size of d × m × k, d indicates defeated Enter the node number of layer, m indicates the node number of hidden layer, and k indicates the node of the corresponding hidden hidden layer of each hidden layer node The node of number, the k hidden hidden layers is all linear convergent rate.

Each node of maxout excitation function is the maximum value taken in the k hidden hidden layer node output valves.

In embodiments of the present invention, the node number m of each full articulamentum is each of 12,12 nodes node, The maximum value in k hidden hidden layer node output valves for taking maxout excitation function to generate, it is corresponding most to combine 12 nodes Big value, the output vector as the full articulamentum.The embodiment of the present invention is by using maxout excitation function, so that depth is neural The full articulamentum of network is non-linear conversion.

Further, in embodiments of the present invention, the deep neural network includes four layers of full articulamentum, is denoted as the respectively One full articulamentum, the second full articulamentum, the full articulamentum of third, the 4th full articulamentum.When being trained, had by described in first The MFCC feature of user tag passes through the first full articulamentum, then using the output vector of the first full articulamentum as the second full connection The input vector of layer connects third using the output vector of the second full articulamentum as the input vector of the full articulamentum of third entirely Input vector of the output vector of layer as the 4th full articulamentum, using the output vector of the 4th full articulamentum as the defeated of output layer Incoming vector.When the full articulamentum of third and the 4th full articulamentum are trained, the embodiment of the present invention uses drop policy, i.e., Dropout strategy.It is random according to default first drop probability when the output vector of second full articulamentum is passed to third full articulamentum Abandon the element in the output vector of the full articulamentum of third.These elements " are erased " from network it should be appreciated that abandoning and referring to, It is equivalent in this training, the element of these " being erased " is not involved in this training.Then using the full articulamentum of third Maxout excitation function is trained remaining element, generates the output vector of the full articulamentum of third.According still further to default second The element in output vector that the full articulamentum of drop probability random drop third obtains, by the connection entirely of remaining element input the 4th Layer is trained.Herein, first drop probability and the second drop probability are set according to actual needs, the embodiment of the present invention Preferably 0.5.By using dropout strategy, the simultaneous adaptation between hidden layer node is effectively attenuated, is enhanced extensive Ability is conducive to the training effect for promoting deep neural network to prevent deep neural network over-fitting in the training process Fruit.

In step S205, each MFCC feature is calculated using preset loss function and passes through the depth nerve net Error between the recognition result of network and corresponding user tag, and modify according to the error ginseng of the deep neural network Number.

The deep neural network is after four layers of full articulamentum, and the output vector of the 4th full articulamentum is as output layer Input.Output layer is softmax layers, and softmax layers can classify according to the output vector of the 4th full articulamentum, obtain The recognition result of MFCC feature.The recognition result is that the deep neural network predicts user belonging to the MFCC feature. As previously mentioned, each full articulamentum uses maxout excitation function, maxout excitation function includes a three-dimensional parameter matrix W With bias b.The corresponding knowledge of the MFCC feature is being obtained to the training of each MFCC feature by step S204 completion After other result, calculated between the recognition result and corresponding user tag of each MFCC feature using preset loss function Error, and based on the error return to modify maxout excitation function in the deep neural network parameter matrix W and Bias b.Optionally, the loss function includes but is not limited to cross-entropy loss function, quadratic loss function.

In step S206, the modified depth of parameter is passed to using the MFCC feature with user tag as input vector Neural network carries out next iteration training, until the deep neural network is to the accurate of the recognition result of every MFCC feature Rate reaches specified threshold, stops iteration.

Deep neural network after modifying parameter by step S205 will have user for being trained next time The MFCC feature of label as input vector again be passed to the modified deep neural network of parameter be trained, training process and Step S204's is identical, and referring specifically to narration above, details are not described herein again.Iteration step S204, S205, S206, directly Specified threshold is reached to the accuracy rate of the recognition result of the MFCC feature of all users to the deep neural network, i.e., the described depth The recognition result probability identical with corresponding user tag of the degree each MFCC feature of neural network reaches the specified threshold Value, then illustrate that the parameters in the deep neural network have been adjusted to position, determine that the deep neural network has been trained It completes, stops iteration.

Trained deep neural network can be used for extracting vocal print vector to voice signal.

In step s 102, voice signal to be identified is obtained.

The voice signal to be identified is short text, i.e. the shorter voice signal of length, such as sentence length Voice signal, to reduce the requirement to data.In identification process each time, acquired voice signal to be identified is should be One user's to be identified.The voice signal to be identified can be a voice signal or a plurality of voice signal.

In step s 103, the voice signal to be identified is pre-processed, and to the pretreated voice Signal carries out feature extraction, obtains MFCC feature.

Before using deep neural network, feature extraction is carried out to voice signal to be identified first, is obtained corresponding MFCC feature.Optionally, as shown in figure 3, the step S103 includes:

In step S301, sub-frame processing is executed to the waveform diagram of the voice signal to be identified.

Herein, sub-frame processing refers to that the waveform diagram by the voice signal of indefinite length is cut into the fixed segment of length, Usually taking 10-30 milliseconds is a frame.Since voice signal is fast-changing, and Fourier transformation is suitable for analysis smoothly letter Number.Framing is carried out by the waveform diagram to voice signal, the intensity of secondary lobe after Fourier transformation can be reduced, improve the frequency of acquisition Compose quality.

In step s 302, after sub-frame processing, windowing process is executed to each frame signal.

The embodiment of the present invention is by carrying out windowing process to each frame signal, with the smooth voice signal.It is alternatively possible to It is subject to smoothly using Hamming window, compared to rectangular window function, Hamming window strengthens the continuity of voice signal left end and right end, can Effectively to weaken the intensity and spectral leakage of secondary lobe after Fourier transformation.

In step S303, discrete Fourier transform is executed to each frame signal after windowing process, obtains the frame signal Corresponding frequency spectrum.

Since the variation of voice signal in the time domain is difficult to find out the characteristic of voice signal, it is therefore desirable to turn voice signal The Energy distribution on frequency domain is changed into observe.Different Energy distributions indicates the characteristic of different phonetic.Believe to each frame voice After number carrying out windowing process, then discrete Fourier transform is carried out, obtain Energy distribution of the frame signal on frequency spectrum.To framing plus Each frame signal after window carries out discrete Fourier transform and obtains the frequency spectrum of each frame, and then obtains the frequency spectrum of voice signal.

In step s 304, the power spectrum of the voice signal is calculated according to the corresponding frequency spectrum of all frame signals.

After completing discrete Fourier transform, obtained Energy distribution is frequency-region signal.The energy of each frequency range Not of uniform size, the energy spectrum of different phonemes is also different, needs to obtain institute's predicate to the frequency spectrum modulus square of the voice signal The power spectrum of sound signal.

In step S305, according to the spectra calculation Meier filter group.

Herein, Meier filter group is the filter group of one group of nonlinear Distribution, densely distributed in low frequency part, High frequency section distribution is sparse, can preferably meet human hearing characteristic.One group is included n triangle filtering by the embodiment of the present invention The filter group of device is applied to the voice signal, i.e., by the power spectrum of the voice signal multiplied by one group of n triangular filter, To convert n-dimensional vector for the power spectrum of the voice signal.Herein, the triangular filter is capable of the work of harmonic carcellation With highlighting the formant of original voice signal, and then reduce data volume.

In step S306, logarithm operation is executed the output of each Meier filter, obtains logarithmic energy.

Pass through the Meier filtering that each of the obtained n-dimensional vector of step S305 element is in Meier filter group The output of device, the embodiment of the present invention further are carried out taking logarithm operation, be obtained to each of obtained n-dimensional vector element The logarithmic energy of the Meier filter group output, i.e. log-mel filer bank energies.The logarithmic energy application In subsequent carry out cepstral analysis.

In step S307, discrete cosine transform is executed to the logarithmic energy, the MFCC for obtaining the voice signal is special Sign.

By obtaining the logarithmic energy of the voice signal to above-mentioned steps S306, the embodiment of the present invention is to the logarithm Energy carries out discrete cosine transform, and takes the coefficient of low 128 dimension in output result, and the MFCC as the voice signal is special Sign.Herein, there is good energy accumulating effect by the output result that discrete cosine transform obtains, biggish value concentrates on Close to the upper left corner low energy part, rest part generate a large amount of 0 or close to 0 number.The embodiment of the present invention takes output to tie The value of low 128 dimension in fruit, as MFCC feature, so as to further amount of compressed data.

Wherein, property of the MFCC feature independent of signal does not do any limitation, Shandong with higher to input signal Stick meets the sense of hearing coefficient of human ear, still has preferable recognition performance when signal-to-noise ratio reduces, with MFCC feature work It for the sound characteristic of the voice signal to be identified, is passed in deep neural network and is identified, depth nerve can be improved The accuracy of Network Recognition.

In step S104, using the MFCC feature as the incoming trained deep neural network in advance of input, obtain The deep neural network the full articulamentum of the last layer output vector, it is described as the vocal print vector of the voice signal The feature of voice signal described in each element representation in vocal print vector.

After obtaining the MFCC feature of the voice signal, it is passed to the MFCC feature as input to preparatory training Good deep neural network is based on the MFCC feature by the deep neural network and identifies to the voice signal. Herein, described in trained deep neural network includes in advance four layers of full articulamentum, and each full articulamentum of layer includes 12 A node obtains the output vector of one 12 dimension by excitation function maxout function.When deep neural network completion pair After the identification of the voice signal, the neural network is obtained in the output vector of the full articulamentum of the last layer, as institute's predicate The d-vector vector of sound signal.The d-vector vector is the vocal print vector of the voice signal, each element therein Indicate the vocal print feature of the voice signal.

In step s105, the vocal print vector that prestores in the vocal print vector of the voice signal and sound-groove model library is carried out It compares, and vocal print testing result is exported according to comparison result.

Herein, the sound-groove model library combines the application scenarios of authentication to be configured as needed, such as network Payment, vocal print lock control, existence certification etc..Have in the sound-groove model library it is multiple prestore vocal print vector and its corresponding user letter Breath.In specific application scenarios, the deep neural network is first passed through in advance, the user authenticated is identified, mentioned Vocal print vector is taken, and typing is into the sound-groove model library.

It, will be in the vocal print vector of the voice signal to be identified and the sound-groove model library when carrying out vocal print detection It prestores vocal print vector to be compared, to execute the language person discrimination to the voice signal.Optionally, as shown in figure 4, the step S105 includes:

In step S401, the vocal print vector that prestores in the vocal print vector of the voice signal and sound-groove model library is carried out It compares.

Herein, each in the vocal print vector of the voice signal and sound-groove model library is prestored sound by the embodiment of the present invention Line vector is compared, and judges whether the element in the two is identical.

In step S402, if there is prestore identical with the vocal print vector of the voice signal in the sound-groove model library When vocal print vector, the corresponding user information of vocal print vector is prestored described in acquisition, exports the user information.

If existing identical with the vocal print vector of the voice signal when prestoring vocal print vector in sound-groove model library, show institute The speaker for stating voice signal to be identified has been entered into sound-groove model library, and the voice signal belongs to the sound-groove model library In the user that has authenticated, prestore the corresponding user information of vocal print vector described in acquisition, export the user information, to complete pair The identification of the voice signal to be identified.

In step S403, if there is no identical pre- with the vocal print vector of the voice signal in the sound-groove model library When depositing vocal print vector, the prompt information of output detection failure.

If in the sound-groove model library there is no it is identical with the vocal print vector of the voice signal prestore vocal print vector when, Show that the speaker of the voice signal to be identified is not entered into sound-groove model library, the voice signal is not belonging to the sound The user authenticated in line model library, the then prompt information that output verification fails.

Vocal print detection method described in the embodiment of the present invention based on short text can be applied to network payment, vocal print lock control, A series of application scenarios needed in conjunction with authentication such as existence certification, it can also be used in internet of things equipment verifying.Especially exist Using in the inconvenient remote validation of video image verifying, is not limited completely by equipment, identity can be confirmed by phone, it can Greatly to reduce the cost of remote validation.

In conclusion the embodiment of the present invention is suitable for the deep neural network of short text by redesigning in advance, then Preset deep neural network is trained using the training sample of short text；When carrying out vocal print detection, obtain to be identified Voice signal, the voice signal be short text；The voice signal to be identified is pre-processed, and to pretreatment after The voice signal carry out feature extraction, obtain MFCC feature；The MFCC feature is passed to as input and is trained in advance Deep neural network, obtain the deep neural network in the output vector of the full articulamentum of the last layer, as the voice The vocal print vector of signal, the feature of voice signal described in each element representation in the vocal print vector；By the voice signal Vocal print vector be compared with the vocal print vector that prestores in sound-groove model library, and according to comparison result export vocal print detection knot Fruit；To realize the vocal print detection based on short text, the input vector of model is greatly reduced, solves existing vocal print inspection Voice signal is tediously long in survey method, amounts of specimen information is big, the demanding problem of calculation resources.

It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present invention constitutes any limit It is fixed.

In one embodiment, a kind of vocal print detection device based on short text is provided, should be detected based on the vocal print of short text Vocal print detection method in device and above-described embodiment based on short text corresponds.As shown in figure 5, should the sound based on short text Line detection device includes training module, data obtaining module, characteristic extracting module, feature acquisition module, detection module.Each function Detailed description are as follows for module:

Training module 51 is used for training module, for obtaining training sample, using the training sample to preset depth Neural network is trained；

Signal acquisition module 52, for obtaining voice signal to be identified；

Characteristic extracting module 53, for being pre-processed to the voice signal to be identified, and to pretreated institute Predicate sound signal carries out feature extraction, obtains mel-frequency cepstrum coefficient；

Feature obtains module 54, for using the mel-frequency cepstrum coefficient as the incoming trained depth in advance of input Neural network obtains the deep neural network in the output vector of the full articulamentum of the last layer, as the voice signal Vocal print vector, the feature of voice signal described in each element representation in the vocal print vector；

Detection module 55, for by the vocal print vector of the voice signal and sound-groove model library prestore vocal print vector into Row compares, and exports vocal print testing result according to comparison result；

Wherein, the training sample and voice signal are short text.

Optionally, the training module 51 includes:

Sample acquisition unit, for obtaining the speech samples of multiple users as training sample；

Feature extraction unit is pre-processed for the training sample each user, to pretreated instruction Practice sample and carry out feature extraction, obtains mel-frequency cepstrum coefficient；

Tag unit stamps user tag for the mel-frequency cepstrum coefficient each user；

Training unit, for the mel-frequency cepstrum coefficient for having user tag to be passed to preset depth as input vector Degree neural network is trained；

Parameter modifying unit passes through institute for calculating each mel-frequency cepstrum coefficient using preset loss function The error between the recognition result of deep neural network and corresponding user tag is stated, and the depth is modified according to the error The parameter of neural network；

The training unit is also used to, using the mel-frequency cepstrum coefficient with user tag as the incoming ginseng of input vector The modified deep neural network of number carries out next iteration training, until the deep neural network falls each mel-frequency The accuracy rate of the recognition result of spectral coefficient reaches specified threshold, stops iteration.

Optionally, the deep neural network includes input layer, four layers of full articulamentum and output layer, each full articulamentum For 12 dimension inputs, using maxout excitation function, and the full articulamentum of third and the 4th full articulamentum are carried out using dropout strategy Training.

Optionally, the characteristic extracting module 53 includes:

Framing unit executes sub-frame processing for the waveform diagram to the voice signal to be identified；

Windowing unit, for executing windowing process to each frame signal after sub-frame processing；

Converter unit obtains the frame signal for executing discrete Fourier transform to each frame signal after windowing process Corresponding frequency spectrum；

Spectra calculation unit, for calculating the power spectrum of the voice signal according to the corresponding frequency spectrum of all frame signals；

Filter group computing unit, for according to the spectra calculation Meier filter group；

To counting unit, logarithm operation is executed for the output each Meier filter, obtains logarithmic energy；

Cosine transform unit obtains the plum of the voice signal for executing discrete cosine transform to the logarithmic energy That frequency cepstral coefficient.

Optionally, the detection module 55 includes:

Specific restriction about the vocal print detection device based on short text may refer to above for based on short text The restriction of vocal print detection method, details are not described herein.Modules in the above-mentioned vocal print detection device based on short text can be complete Portion or part are realized by software, hardware and combinations thereof.Above-mentioned each module can be embedded in the form of hardware or independently of calculating In processor in machine equipment, it can also be stored in a software form in the memory in computer equipment, in order to processor It calls and executes the corresponding operation of the above modules.

In one embodiment, a kind of computer equipment is provided, which can be server, internal junction Composition can be as shown in Figure 6.The computer equipment include by system bus connect processor, memory, network interface and Database.Wherein, the processor of the computer equipment is for providing calculating and control ability.The memory packet of the computer equipment Include non-volatile memory medium, built-in storage.The non-volatile memory medium is stored with operating system, computer program and data Library.The built-in storage provides environment for the operation of operating system and computer program in non-volatile memory medium.The calculating The network interface of machine equipment is used to communicate with external terminal by network connection.When the computer program is executed by processor with Realize a kind of vocal print detection method based on short text.

In one embodiment, a kind of computer equipment is provided, including memory, processor and storage are on a memory And the computer program that can be run on a processor, processor perform the steps of when executing computer program

Obtain voice signal to be identified；

The voice signal to be identified is pre-processed, and feature is carried out to the pretreated voice signal and is mentioned It takes, obtains MFCC feature；

Using the MFCC feature as the incoming trained deep neural network in advance of input, the depth nerve net is obtained Network the full articulamentum of the last layer output vector, it is each in the vocal print vector as the vocal print vector of the voice signal The feature of voice signal described in a element representation；

Wherein, the training sample and voice signal are short text.

In one embodiment, a kind of computer readable storage medium is provided, computer program is stored thereon with, is calculated Machine program performs the steps of when being executed by processor

Obtain voice signal to be identified；

Wherein, the training sample and voice signal are short text.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the computer program can be stored in a non-volatile computer In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, To any reference of memory, storage, database or other media used in each embodiment provided by the present invention, Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhancing Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..

It is apparent to those skilled in the art that for convenience of description and succinctly, only with above-mentioned each function Can unit, module division progress for example, in practical application, can according to need and by above-mentioned function distribution by different Functional unit, module are completed, i.e., the internal structure of described device is divided into different functional unit or module, more than completing The all or part of function of description.

Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations；Although referring to aforementioned reality Applying example, invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned each Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features；And these are modified Or replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution should all It is included within protection scope of the present invention.

Claims

1. a kind of vocal print detection method based on short text characterized by comprising

Obtain voice signal to be identified；

The voice signal to be identified is pre-processed, and feature extraction is carried out to the pretreated voice signal, Obtain mel-frequency cepstrum coefficient；

Using the mel-frequency cepstrum coefficient as the incoming trained deep neural network in advance of input, the depth mind is obtained Through network the full articulamentum of the last layer output vector, as the vocal print vector of the voice signal, in the vocal print vector Each element representation described in voice signal feature；

The vocal print vector of the voice signal is compared with the vocal print vector that prestores in sound-groove model library, and is tied according to comparing Fruit exports vocal print testing result；

Wherein, the training sample and voice signal are short text.

2. the vocal print detection method based on short text as described in claim 1, which is characterized in that the acquisition training sample, Preset deep neural network is trained using the training sample and includes:

The speech samples of multiple users are obtained as training sample；

The training sample of each user is pre-processed, feature extraction is carried out to pretreated training sample, is obtained To mel-frequency cepstrum coefficient；

User tag is stamped the mel-frequency cepstrum coefficient of each user；

Mel-frequency cepstrum coefficient with user tag is passed to preset deep neural network as input vector to instruct Practice；

Identification of each mel-frequency cepstrum coefficient Jing Guo the deep neural network is calculated using preset loss function As a result the error between corresponding user tag, and modify according to the error parameter of the deep neural network；

The modified deep neural network of parameter is passed to using the mel-frequency cepstrum coefficient with user tag as input vector Next iteration training is carried out, until the deep neural network is to the accurate of the recognition result of each mel-frequency cepstrum coefficient Rate reaches specified threshold, stops iteration.

3. the vocal print detection method based on short text as claimed in claim 2, which is characterized in that the deep neural network packet Input layer, four layers of full articulamentum and output layer are included, each full articulamentum is 12 dimension inputs, using maxout excitation function, and The full articulamentum of third and the 4th full articulamentum are trained using drop policy.

4. the vocal print detection method as described in any one of claims 1 to 3 based on short text, which is characterized in that described by institute The vocal print vector of predicate sound signal is compared with the vocal print vector that prestores in sound-groove model library, and according to comparison result output sound Line testing result includes:

If existing identical with the vocal print vector of the voice signal when prestoring vocal print vector in the sound-groove model library, institute is obtained It states and prestores the corresponding user information of vocal print vector, export the user information；

If in the sound-groove model library there is no it is identical with the vocal print vector of the voice signal prestore vocal print vector when, output Detect the prompt information of failure.

5. the vocal print detection method as described in any one of claims 1 to 3 based on short text, which is characterized in that described to institute It states voice signal to be identified to be pre-processed, and feature extraction is carried out to the pretreated voice signal, obtain Meier Frequency cepstral coefficient includes:

According to the spectra calculation Meier filter group；

6. a kind of vocal print detection device based on short text characterized by comprising

Training module is trained preset deep neural network using the training sample for obtaining training sample；

Signal acquisition module, for obtaining voice signal to be identified；

Characteristic extracting module, for being pre-processed to the voice signal to be identified, and to the pretreated voice Signal carries out feature extraction, obtains mel-frequency cepstrum coefficient；

Feature obtains module, for using the mel-frequency cepstrum coefficient as the incoming trained depth nerve net in advance of input Network, obtains the deep neural network in the output vector of the full articulamentum of the last layer, as the voice signal vocal print to It measures, the feature of voice signal described in each element representation in the vocal print vector；

Detection module, for comparing the vocal print vector of the voice signal and the vocal print vector that prestores in sound-groove model library It is right, and vocal print testing result is exported according to comparison result；

Wherein, the training sample and voice signal are short text.

7. the vocal print detection device based on short text as claimed in claim 6, which is characterized in that the detection module includes:

Comparing unit, for comparing the vocal print vector of the voice signal and the vocal print vector that prestores in sound-groove model library It is right；

First result output unit, if identical with the vocal print vector of the voice signal for existing in the sound-groove model library When prestoring vocal print vector, the corresponding user information of vocal print vector is prestored described in acquisition, exports the user information；

Second result output unit, if for there is no identical as the vocal print vector of the voice signal in the sound-groove model library When prestoring vocal print vector, output detection failure prompt information.

8. the vocal print detection device based on short text as claimed in claims 6 or 7, which is characterized in that the depth nerve net Network includes input layer, four layers of full articulamentum and output layer, and each full articulamentum is 12 dimension inputs, using maxout excitation function Number, and the full articulamentum of third and the 4th full articulamentum are trained using drop policy.

9. a kind of computer equipment, including memory, processor and storage are in the memory and can be in the processor The computer program of upper operation, which is characterized in that the processor realized when executing the computer program as claim 1 to 5 described in any item vocal print detection methods based on short text.

10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, and feature exists In realization such as the vocal print described in any one of claim 1 to 5 based on short text when the computer program is executed by processor Detection method.