CN110223429A

CN110223429A - Voice access control system

Info

Publication number: CN110223429A
Application number: CN201910534516.7A
Authority: CN
Inventors: 沈希忠; 孙陈影
Original assignee: Shanghai Institute of Technology
Current assignee: Shanghai Institute of Technology
Priority date: 2019-06-19
Filing date: 2019-06-19
Publication date: 2019-09-10

Abstract

The present invention provides a kind of voice access control systems, carry out real voice identification using phonetic anti-fake turing test and deep learning, realize gate function.The system includes: the processor for being loaded with turing test module, generating confrontation network, two-way GRU neural network, and the processor and gate inhibition's driving mechanism communicate to connect, realize according to phonetic feature, open or close gate inhibition.The present invention is suitable under the specific environment of speaker not at the scene, turing test is carried out to acquired voice, it is determined as after real speech again, speech enhan-cement processing is carried out by production confrontation network, feature extraction then is carried out to enhanced voice using parameters such as Mel cepstrum (MFCC), completes Speaker Identification via depth bidirectional valve controlled cycling element (GRU) network.

Description

Voice access control system

Technical field

The present invention relates to voice processing technology fields, and in particular, to voice access control system.

Background technique

With the rapid development of electronic information technology, conventional door lock constantly develops to high-tech, intelligent direction, with biology Feature identification combines the intelligent identifying system of conventional lock to progress into people's lives.More and more enterprises by this Kind intelligent identifying system is applied to entrance guard management and attendance management.

Currently, the voiceprint in voice is incorporated into access control system as biological characteristic.

But existing voice access control system is easy to be recorded, the modes such as editing sound crack, there are certain safety is hidden Suffer from.

Summary of the invention

For the defects in the prior art, the object of the present invention is to provide a kind of voice access control systems.

A kind of voice access control system provided according to the present invention, comprising: voice acquisition module, processor, gate inhibition's driving machine Structure is loaded with turing test module in the processor, generates confrontation network, two-way GRU neural network, the voice collecting mould Block is connect with the processor communication, and the processor and gate inhibition's driving mechanism communicate to connect；Wherein:

The voice acquisition module gives the processor for acquiring voice messaging, and by the transmission of speech information；

The turing test module, for analyzing the voice messaging, with the determination voice messaging whether be The voice messaging is then inputted the generation and fights network by real voice information if real voice information；

The generation fights network, for carrying out enhancing processing to the voice messaging received, obtains enhancing processing Voice messaging afterwards；

The two-way GRU neural network obtains language for carrying out feature extraction to enhancing treated the voice messaging Sound feature；And judge whether the phonetic feature meets the phonetic feature of target person；Wherein, the two-way GRU neural network Feature extraction is carried out to enhancing treated the voice messaging by MFCC；The two-way GRU neural network is pre- first passes through The trained learning model for having phonetic feature recognition capability；

The processor when for meeting the phonetic feature of target person in the phonetic feature, controlling the gate inhibition and driving Motivation structure opening gate.

Optionally, the turing test module, is specifically used for: when receiving the voice messaging, random generate is preset The problem of quantity, receives the corresponding correct option of described problem, then passes through turing test within a preset time.

Optionally, the generation fights network, including arbiter and generator, and the arbiter is for judging the generation Treated whether voice messaging is real speech for the enhancing of device output；The generator is for increasing the voice messaging Strength reason, and will enhancing treated that voice messaging is input in the arbiter.

Optionally, the two-way GRU neural network is extracted 39 dimension MFCC characteristic parameters using mel-frequency cepstrum coefficient and is made For the phonetic feature.

Optionally, the processor uses TMS320DM8168 development board, adds on the TMS320DM8168 development board It carries turing test module, generate confrontation network, two-way GRU neural network.Compared with prior art, the present invention has with following Beneficial effect:

Voice access control system provided by the invention can differentiate the true and false property of identified person, improve precision of identifying speech, from And realize that accurately access control, safety are higher, user experience is good.

Detailed description of the invention

Upon reading the detailed description of non-limiting embodiments with reference to the following drawings, other feature of the invention, Objects and advantages will become more apparent upon:

Fig. 1 is the schematic illustration for the voice access control system that one embodiment of the invention provides；

Fig. 2 is the flow diagram of the training method of SEGAN；

Fig. 3 is the two-way GRU structure chart of depth in one embodiment of the invention；

Fig. 4 is the hardware realization block diagram in one embodiment of the invention.

Specific embodiment

The present invention is described in detail combined with specific embodiments below.Following embodiment will be helpful to the technology of this field Personnel further understand the present invention, but the invention is not limited in any way.It should be pointed out that the ordinary skill of this field For personnel, without departing from the inventive concept of the premise, several changes and improvements can also be made.These belong to the present invention Protection scope.

Fig. 1 is the schematic illustration for the voice access control system that one embodiment of the invention provides, as shown in Figure 1, acquiring first Speaker's voice messaging starts turing test then when receiving voice messaging, if being judged as true by turing test Then people's voice messaging will be input to deep learning module by the voice messaging of turing test.The deep learning module includes: Generate confrontation network and the two-way GRU network of depth.

In the present embodiment, judges that identified speaker is true, real-time by turing test first, rather than record Sound or machine simulation sound.The method of turing test is exactly to allow tester and testee, by a kind of special mode to quilt Tester arbitrarily puts question to.It carries out after repeatedly testing, if there is being more than that 30% tester cannot determine that testee is that people goes back It is machine, then this machine has just passed through test.The present invention in certain circumstances, judged using the method for turing test, The problem of speaking ability by random question, makes recording that cannot answer real time problem, and machine person's development is also less than the true mankind.So far, Very rare robot can be by turing test, so turing test can accurately judge that identified speaker is true people Or other.

In the present embodiment, speaker's voice is enhanced using GAN.Production fights network (GAN, Generative Adversarial Networks) it is a kind of deep learning model, it is unsupervised learning most prospect in complex distributions in recent years One of method.GAN network contains two confrontation models: generating model (G) input is that band is made an uproar picture, and output generates one and sees Get up as genuine picture, confuses discrimination model；Discrimination model (D) is for judging that a given picture is true picture (including the picture obtained in data set and the output picture for generating network).In rigid start, two models are all not pass through Training is crossed, dual training, generation model generate picture and remove deception discrimination model two models together, and then discrimination model goes to sentence Break the true and false of this picture, final two model capabilities are more and more stronger, reach stable state.The present invention uses the language based on confrontation network Sound Enhancement Method SEGAN (Speech Enhancement GAN), the advantage of this method is: 1) providing a Rapid Speech Enhancing process, no causality are required, no recursive operations as RNN；2) it is processed based on original audio. Manual feature is not extracted, specific hypothesis is not made to initial data；3) from different speakers and noise type middle school It practises, and merges them into identical shared parameter, so that system is simple and generalization ability is stronger.The input of SEGAN is to contain Noisy speech signal and potential characterization signal, output are enhanced signals.It is to be entirely convolutional layer (without complete by Generator Design Articulamentum), training parameter can be reduced so as to shorten the training time by doing so.An important feature for generating network is that end is arrived End structure, directly processing primary speech signal, avoid and extract acoustic feature by intermediate conversion.In the training process, identify Device is responsible for sending true and false information in input data to generator, and generator is allowed to output it waveform towards true distribution Fine tuning, to eliminate interference signal.Speaker Identification is carried out again by the enhanced voice signal of SEGAN.

SEGAN whole network is made of CNN, it is a codec (encoder-decoder), the structure of D It is encoder (encoder), above connects a dimensionality reduction layer.8 × 1024 parameters are reduced to 8.Encoder is by 1 dimension that step-length is 2 Convolutional layer is constituted.It inputs as Noisy Speech Signal and isOutput is that enhancing voice signal isSpeech enhan-cement process is complete with G network At,It is input, output with implicit function zG is a full convolutional network, similar with autocoder.In coding, Input signal is projected and compressed with a series of great-leap-forward convolutional layer activation primitives, every N walks to obtain a convolution results.Experiment card Bright, great-leap-forward convolution is more preferable than pond method in GAN network training.Great-leap-forward connection is the wave directly skipped in decoding process The fine tuning line information of shape, and its gradient can flow intensification in total, and this operation prevents low level details to exist Reconstructed speech waveform is lost.Decoding process and cataloged procedure on the contrary, replaced convolution with a small amount of great-leap-forward, function used with it is encoded Journey uses identical activation primitive.

There are three the stages for the training process of SEGAN.(1) arbiter D inputs noisy speech and corresponding clean speech, by it Label is set as very, the parameter of training D；(2) generator G inputs noisy speech, generates enhancing voice, inputs together with noisy speech D, label are set as false, update the parameter of D；(3) D is fixed, and is repeated step (2), is updated the parameter of G.It completes the above three steps, G The as network of speech enhan-cement.Specifically, training process is as shown in Figure 2.

Specifically, Fig. 3 is the two-way GRU structure chart of depth in one embodiment of the invention, as shown in figure 3, double using depth Speech recognition is carried out to GRU.Speech characteristic parameter uses mel-frequency cepstrum coefficient (Mel Frequency Cepstral Coefficent, MFCC), take input of the 39 dimension MFCC characteristic parameters as deep learning.The present invention uses the two-way GRU of depth (BiGRUs) Speaker Identification is carried out, being proposed to of LSTM (Long Short Time Memory) overcomes RNN (Recurrent Neural Network) can not handle remote dependence ground problem, GRU (Gated Recurrent well Unit) be LSTM a variant, GRU by analysis LSTM framework which be partially improving of really needing, will Forget door and input gate has synthesized a update door.It is equally also mixed with cell state and hidden state, final model is than double It will succinctly efficiently to LSTM model.Only there are two doors, respectively update door z by GRU^(t)With resetting door r^(t).It is previous to update door control The status information at moment is brought into the degree in current state, and the status information of the bigger previous moment of value is brought into more.Resetting The degree of the status information of previous moment is ignored in door control, and the smaller explanation of value is ignored more.Propagated forward formula is defined as follows:

net_{R, t}=w_rhh_t-1+w_rxx_t+b_r

net_{Z, t}=w_zhh_t-1+w_zxx_t+b_z

net_{G, t}=w_gh(r_t*h_t-1)+w_gxx_t+b_g (1)

Wherein: net_{R, t}Indicate the resetting door network state in t moment, w_rhIndicate the resetting door weight at the t-1 moment, h_t-1 Indicate the hidden state at the t-1 moment, w_rxIndicate the resetting door weight in t moment, x_tIndicate the input in t moment, b_rIt indicates It is biased in the resetting door of t moment, net_{Z, t}Indicate the update door network state in t moment, w_zhIndicate the update door at the t-1 moment Weight, w_zxIndicate the update door weight in t moment, b_zIt indicates to bias in the update door of t moment.

It can be obtained according to the definition of GRU network structure:

r_t=sigmod (net_{R, t})

z_t=sigmod (net_{Z, t})

g_t=tanh (net_{G, t})

h_t=(1-z_t)*h_t-1+z_t*g_t (2)

Wherein: r_tIndicate the resetting door output state in t moment, z_tIndicate the update door output state in t moment, g_tTable Show the resetting door state of a control in t moment, h_tIndicate the hidden state in t moment, h_t-1Indicate the hidden state at the t-1 moment.

For GRU network layer l, as t=T,For l+1 layers of Feedback error, when t ∈ [0, T) whenBy two It is grouped as, first is that l+1 layers of t moment of error reversely comes intoSecond is that the Feedback error at t+1 momentError is fixed Justice: in t moment, the output valve of GRU is h_t, the error of t moment are as follows:

The net known to formula (1) (2)_{R, t}, net_{Z, t}, net_{G, t}, h_tIt is all h_t-1Function, defined according to error and entirely led Formula can obtain:

Wherein: E indicates unit matrix, δ_t-1Indicate the reversed error at the t-1 moment,Indicate the l-1 at the t-1 moment The reversed error of layer, δ_{Z, t}It indicates to update the reversed error of output of door, δ in t moment_{R, t}Indicate reversed in the output of t moment resetting door Error, δ_{G, t}Indicate the reversed error of control in t moment resetting door, δ_tIndicate the reversed error in t moment.

Each unit t moment error delta_{Z, t}, δ_{R, t}, δ_{G, t}Formula is as follows:

Formula (4) circulation is brought into The error at each moment can be found out.

According to each moment error delta_{Z, t}, δ_{R, t}, δ_{G, t}Calculate weight and biasing gradient, first calculating Δ w_{Zh, t}, Δ w_{Rh, t}, Δ w_{Gh, t}。

Wherein: w⁺Indicate that the Error weight in t moment, w indicate the weight in t moment, Δ w indicates the weight in t moment Gradient, η indicate coefficient, b⁺It indicates in t moment error offset, b indicates the biasing in t moment, and Δ b indicates to bias ladder in t moment Degree, Δ w_{Zh, t}It indicates to update the output weight gradient of door, Δ w in t moment_{Rh, t}Indicate the output weight ladder in t moment resetting door Degree, Δ w_{Gh, t}Indicate the control weight gradient in t moment resetting door.

The gradient at each moment is added together, it is as follows gradient to be obtained:

The input of GRUIt is upper one layer of network output, is defined asWherein f^l-1It is l-1 layers Activation primitive.By the definition of formula (1) it is found thatIt isFunction, can according to total derivative formula :

It is the calculating process of two-way GRU above.Deep learning has good application in terms of speech recognition, and depth is two-way GRU can more efficiently realize Speaker Identification, so present invention uses BiGRUs.

Specifically, the training and differentiation process of two-way GRU neural network:

(1) characteristic parameter using mel-frequency cepstrum coefficient (Mel Frequency Cepstral Coefficent, MFCC), input of the 39 dimension MFCC characteristic parameters as deep learning is taken.

(2) pass through full articulamentum processing feature data.

(3) in Bi-GRU, propagated forward and backpropagation are combined and are trained to data.

(4) softmax classifier classification output differentiates result.

Fig. 4 is the hardware realization block diagram in one embodiment of the invention, as shown in figure 3, obtaining voice messaging, then base first Turing test is carried out in computer, if test passes through, in the TMS320DM8168 development board (journey for being pre-loaded with program instruction Sequence instruction also can store in DDR or NVRAM, at runtime, be called by the processing chip of TMS320DM8168 development board) It executes the analysis to phonetic feature and compares and operate, obtain differentiating result.When differentiating that result is collected phonetic feature and pre- When the phonetic feature of the target person first stored is consistent, then gate inhibition's driving mechanism is transferred to by bus (PCI), is driven by gate inhibition Motivation structure drives gate inhibition to open.If differentiating, result is the phonetic feature of collected phonetic feature and pre-stored target person When not meeting, gate inhibition is remained turned-off.

In the present embodiment, programmed algorithm can be read in TMS320DM8168 development board, software and hardware combining makes test more It is convenient and efficient.TMS320DM8168 is a high-end Floating-point DSP+ARM double-core development board, has the spies such as stable, convenient, reliable Speaker Identification can be effectively performed in point.

The present invention enhances the voice collected using production confrontation network, weakens noise, and double using depth Speaker Identification is carried out to GRU, adaptive is strong, and universality is high, while possessing high efficiency；So as to improve the peace of access control system Quan Xing, reliability.It is more fast and reliable when realizing the above method on TMS320DM8168.

Specific embodiments of the present invention are described above.It is to be appreciated that the invention is not limited to above-mentioned Particular implementation, those skilled in the art can make a variety of changes or modify within the scope of the claims, this not shadow Ring substantive content of the invention.In the absence of conflict, the feature in embodiments herein and embodiment can any phase Mutually combination.

Claims

1. a kind of voice access control system characterized by comprising voice acquisition module, processor, gate inhibition's driving mechanism, it is described It is loaded with turing test module in processor, generates confrontation network, two-way GRU neural network, the voice acquisition module and institute Processor communication connection is stated, the processor and gate inhibition's driving mechanism communicate to connect；Wherein:

Whether the turing test module is true man with the determination voice messaging for analyzing the voice messaging The voice messaging is then inputted the generation and fights network by voice messaging if real voice information；

The generation fights network, and for carrying out enhancing processing to the voice messaging received, obtaining enhancing, treated Voice messaging；

The two-way GRU neural network obtains voice spy for carrying out feature extraction to enhancing treated the voice messaging Sign；And judge whether the phonetic feature meets the phonetic feature of target person；Wherein, the two-way GRU neural network passes through MFCC carries out feature extraction to enhancing treated the voice messaging；The two-way GRU neural network is pre- to first pass through training The learning model for having phonetic feature recognition capability；

The processor when for meeting the phonetic feature of target person in the phonetic feature, controls gate inhibition's driving machine Structure opening gate.

2. voice access control system according to claim 1, which is characterized in that the turing test module is specifically used for: When receiving the voice messaging, random the problem of generating preset quantity, within a preset time, it is corresponding to receive described problem Correct option then passes through turing test.

3. voice access control system according to claim 1, which is characterized in that the generation fights network, including arbiter And generator, the arbiter is used to judge the enhancing of generator output, and treated whether voice messaging is true language Sound；The generator is used to the voice messaging carrying out enhancing processing, and will enhancing treated that voice messaging is input to institute It states in arbiter.

4. voice access control system according to claim 1, which is characterized in that the two-way GRU neural network uses Meier Frequency cepstral coefficient extracts 39 dimension MFCC characteristic parameters as the phonetic feature.

5. voice access control system according to claim 1, which is characterized in that the processor is opened using TMS320DM8168 Plate is sent out, loading figure spirit test module, generation confrontation network, two-way GRU neural network on the TMS320DM8168 development board.