CN107784215B

CN107784215B - Audio unit based on intelligent terminal carries out the user authen method and system of labiomaney

Info

Publication number: CN107784215B
Application number: CN201710952236.9A
Authority: CN
Inventors: 俞嘉地; 卢立
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2017-10-13
Filing date: 2017-10-13
Publication date: 2018-10-26
Anticipated expiration: 2037-10-13
Also published as: CN107784215A

Abstract

A kind of audio unit based on intelligent terminal carries out the user authen method of labiomaney, user's lip motion feature of different level is extracted expressing the specificity of user using depth self-encoding encoder network in registration phase, then grader and imitator's detector after multiple graders and imitator's detector, then the authentication framework tissue training based on binary tree are trained with the feature extracted；In entry stage, the authentication result under the corresponding single word of lip motion feature to be measured is obtained using the authentication mode based on binary tree, user authentication is realized after last fused multiple authentication results.The present invention solves the problems, such as the user authentication based on sound wave labiomaney, and the behavior special features of user's lip motion are modeled and predicted by ripe acoustic signals treatment technology and deep learning method.

Description

Audio unit based on intelligent terminal carries out the user authen method and system of labiomaney

Technical field

The present invention relates to a kind of technology in speech processes field, specifically a kind of audio units based on intelligent terminal Carry out the user authen method and system of labiomaney.

Background technology

The user authen method main function that audio unit based on intelligent terminal carries out labiomaney is sent out using intelligent apparatus The lip motion when voice signal gone out speaks to user captures, and finds out the behavior of the lip motion between different user Feature is different, finally carries out identification to different users using it and imitator detects.It is carried out based on user biological feature The method of user authentication is widely studied, wherein it is relatively broad applied have fingerprint recognition, face recognition, sound Line identifies authentication mode.Product around the exploitation of this 3 kinds of methods also has very much, such as Apple Touch ID, Alipay brush face It logs in and wechat vocal print logs in.However, these methods can not all resist replay attack well.Although the technology of living body authentication It is adopted in these authentication methods and is gone for improving authentication precision quickly, as Alipay is carrying out requiring user when brush face logs in Blink is to judge that the user of certification is not a photo, but these methods are still largely by environment system About.

Invention content

The present invention proposes a kind of for the deficiency of the user authentication technique based on biological characteristic of existing combination living body authentication Audio unit based on intelligent terminal carries out the user authen method and system of labiomaney, solves the user based on sound wave labiomaney and recognizes Card problem, and it is specifically special come the behavior to user's lip motion by ripe acoustic signals treatment technology and deep learning method Sign is modeled and is predicted.

The present invention is achieved by the following technical solutions：

The present invention relates to a kind of, and the audio unit based on intelligent terminal carries out the user authen method of labiomaney, in registration phase User's lip motion feature of different level is extracted using depth self-encoding encoder network to express the specificity of user, then After multiple graders and imitator's detector, then the authentication framework tissue training based on binary tree being trained with the feature extracted Grader and imitator's detector；In entry stage, it is special to obtain lip motion to be measured using the authentication mode based on binary tree The authentication result under corresponding single word is levied, user authentication is realized after last fused multiple authentication results.

The depth self-encoding encoder network includes：Input layer, noise reduction layer, hidden layer and output layer, wherein：Hidden layer by The self-encoding encoder of three-decker forms, and is encoded successively from coarseness word level, fine granularity word level and user level, and Lip motion feature is finally exported by output layer.

The lip motion feature specifically refers to：It is reflected through speaker's lip by the voice signal that intelligent terminal is sent out The subtle Doppler effect signal constituted afterwards passes through obtained coded sequence after depth self-encoding encoder network code.

The grader is specially two graders that support vector machines is realized.

Imitator's detector is specially the one-class classifier that support vector description domain is realized.

The authentication mode based on binary tree refers to：In the environment of registering user at n, first by lip motion spy It levies and is input to the corresponding grader of nth user, i.e. in n-th of grader, when：

1. n-th of grader by the lip motion tagsort be nth user when, then further use imitator detect Device judges whether the lip motion feature belongs to imitator；Otherwise：

2. n-th of grader is by one that the lip motion tagsort is in preceding n-1 user, then further by n-th- 1 grader is classified.

And so on, when i-th of grader judges that the lip motion is characterized in i-th of user, then it is known that at least should Lip motion feature owning user will not be any one in preceding i-1 user；Simultaneously because i+1~n grader is The login user is crossed judged not be any one in i+1~n user, therefore judge to use belonging to the lip motion feature Family is i-th of user；Otherwise the grader before will constantly passing through the lip motion feature judges.

Particularly, to the 1st user, then directly judge that its lip motion is characterized in belonging to user also with imitator's detector It is imitator.

The fusion is realized using the voting mechanism of weighting；I.e. using the authentication precision under single word as weights, to every A registration user and imitator's class all calculate a confidence level, are considered as the maximum user class of confidence level with login user.

Technique effect

Compared with prior art, the present invention is special come the special behavior for capturing lip motion when user speaks using technology of acoustic wave Sign, this method can either distinguish different user well, and be adapted to different complex environments, such as resist ambient light, The influence of noise, i.e., with lower false acceptance rate while with higher recognition accuracy (91.7% or more) (1.2%) with shorter system response time (0.67s) and while false rejection rate (1.6%).

Description of the drawings

Fig. 1 is present system structural schematic diagram；

Fig. 2 is flow diagram of the present invention；

Fig. 3 is the lip perception data segmentation result figure under one section of password in embodiment；

Fig. 4 is three layers of self-encoding encoder network diagram that feature extraction is used in embodiment；

Fig. 5 is the authentication framework schematic diagram based on binary tree that user authentication is used in embodiment；

Fig. 6 is the average confusion matrix figure under 4 scenes in embodiment；

Fig. 7 is the certification accuracy rate comparison diagram logged in wechat voiceprint lock, Alipay brush face in embodiment；

Fig. 8 is the false acceptance rate and false rejection rate figure under 4 scenes in embodiment；

Fig. 9 is the cumulative probability density map that required number is successfully logged in embodiment；

Figure 10 is the cumulative probability density map of response time in embodiment.

Specific implementation mode

As shown in Figure 1, recognizing for the user that a kind of audio unit based on intelligent terminal that the present embodiment is related to carries out labiomaney Card system comprising：Perception of sound module, password segmentation module, the characteristic extracting module based on deep learning, grader and prison Device training module and user's identification and imitator's detection module are surveyed, wherein：Perception of sound module is connected with password segmentation module And transmit the collected original sound information that intelligent terminal microphone is emitted to by lip motion, password segmentation module be based on The characteristic extracting module of deep learning be connected and transmit one section of password correspond to voice signal be segmented into single word correspond to sound letter Number information, characteristic extracting module based on deep learning and grader and monitor training module and user's identification and imitate Person's detection module is connected and transmits the coding characteristic information that single word corresponds to voice signal, grader and monitor training module with User's identification is connected with imitator's detection module and transmits trained sorter model.

As shown in Fig. 2, the verification process of above system includes：It registers (the lip motion data of typing user) and logs in (user authentication) two benches：

It in registration phase, is acquired first by the reflected acoustic signals of movement lip, then carries out password segmentation and profit Feature is extracted from each word with depth self-encoding encoder network, the method for finally utilizing support vector machines and support vector description domain Training grader and imitator's detector.

In entry stage, the present embodiment equally acquires the reflected acoustic signals of user's lip motion, and is segmented simultaneously Then password and feature extraction utilize the authentication framework based on binary tree to obtain the authentication result under single word, finally combine more The classification results of a word, the present embodiment obtain final authentication result in the way of Nearest Neighbor with Weighted Voting.

Specifically, the present embodiment requires user to say one section of password for including multiple words, therefore firstly the need of to collected Acoustic signals are segmented.User says one section of password to intelligent terminal first, while intelligent terminal sends out the sound letter of 20kHz Number.Then intelligent terminal receives the reflection signal from ambient enviroment, and which includes the reflection letters from user movement lip Number.Since the speed of lip motion is usually relatively slow, caused Doppler frequency shift is also usually in a relatively narrow range, probably In the range of+40Hz, therefore the range of signal of the present embodiment concern is also in the frequency band.

In addition, during user says one section of password, there can be the interval of about 300ms between each word, therefore can be with It goes to capture these intervals using a sliding window.

The present embodiment converts the signal being collected into time frequency signal first with Short Time Fourier Transform.When user does not say When words, the signal that intelligent terminal can receive is from the signal that the object of range farther out reflects back.And these signals Intensity is generally much smaller than the signal strength for moving lip reflection.Therefore between a threshold value is arranged to judge to speak in the present embodiment Every.Specifically, the present embodiment is first converted to certain section of time-domain signal by Short Time Fourier Transform with a sliding window Frequency-region signal, then judges whether value of this frequency-region signal in each frequency is both less than the threshold value of setting.If it does, Then the time of sliding window covering is exactly the part at interval of speaking.Otherwise, which is exactly that user effectively speaks the time.

As shown in figure 3, for the segmentation result of one section of password lower lip perceptual signal data.The present embodiment can see, when When signal strength values under four frequencies are both less than threshold value, which will be regarded as an inactive phase, between both speaking Every.

After the signal subsection to receiving, the present embodiment will extract effective and reliable feature to express the spy of user It is anisotropic.The present embodiment proposes one three layers of depth self-encoding encoder network to extract validity feature：

As shown in figure 4, for three layers of self-encoding encoder network for feature extraction.First, in order to ensure extract feature Shandong Stick, the present embodiment carries out noise reduction process to the signal being originally inputted first randomly will be original that is, according to certain Probability p Some data points in data are set as 0.Next, the data after noise reduction are input in three layers of self-encoding encoder network.

For every layer of self-encoding encoder, input X can be mapped as to one group of specifically compression expression C, mapping mode C =σ (wX+b), wherein σ () are Logistic function, i.e.,W and b is the weights and bias of self-encoding encoder simultaneously. The self-encoding encoder is trained by following target： Wherein N is the number of training data, X⁽ⁱ⁾With X '⁽ⁱ⁾=σ (w^TX+b ') it is i-th for being originally inputted and reconstructing input A element, Ω_weightsAnd Ω_sparsityFor the canonical of parameter and sparsity, λ and the coefficient that β is canonical.The target is actually The gap for being originally inputted and reconstructing input is exactly minimized, that is, enables compaction table best to be expressed up to C and is originally inputted X.

As shown in figure 4, for three layers of self-encoding encoder network for feature extraction.Collected reflected acoustic signal G is given, Each layer of network all contains a self-encoding encoder h as described above_i(i=1,2,3), different layers are expressed as by input signal The feature in face is describing the specificity of user.Wherein, the first layer self-encoding encoder of network will be originally inputted G and be expressed as coarseness The feature C of word level₁, the output of first layer is further expressed as the feature C of fine granularity word level by second layer self-encoding encoder₂, the The output of the second layer is finally expressed as the special features of user level by three layers of self-encoding encoder, and the output is also by as final defeated Go out for model training later.

Effectively reliable feature can be extracted for expressing user's using above-mentioned three layer depths self-encoding encoder network Specificity.Then, the present embodiment builds grader and imitator using support vector machines and the method in support vector description domain Detector.To a single user system, when a user is registered to by this system in system, system utilizes perception of sound Method collects lip motion when user says password, and obtains corresponding feature by password segmentation and feature extracting method. Since in single user system, the present embodiment only has the information of the user, lack the information of other imitators, therefore the present embodiment It is used for distinguishing unique registration user and other imitations using the method in support vector description domain to build an oneclass classification device Person.However, for intelligent terminal, more same terminals of personal use are very possible.In such a multi-user system, User is typically sequentially to be registered in system one by one.Since the construction process of multi-categorizer would generally introduce apparent calculating Burden, the present embodiment transfer to realize polytypic function using two grader groups, and as each registration user trains one Two graders form two grader groups.Assuming that there is n-1 user to be registered to system by this system, meanwhile, nth user It to be ready being registered in the system.To the user, collected data when the present embodiment is using the user's registration, and n-1 before Collected data when a user's registration train two graders to distinguish the use by support vector machine method for the user Family and the user registered before.Further, to the user of the last one registration, the present embodiment additionally utilizes registration user's Data train imitator's detector, for distinguishing the user and imitator by the method in support vector description domain.

In entry stage, this system distinguishes user identity using above-mentioned trained grader and imitator's detector. It is similar to entry stage, ask the user logged in that can say that scheduled password, intelligent terminal utilize perception of sound to intelligent terminal Method collects lip motion when user speaks, and carries out password segmentation and feature extraction.To each registration user, this implementation Example all has trained two graders to distinguish the user and register user before.Therefore, the present embodiment proposition one is based on The authentication framework of binary tree.

As shown in figure 5, for the authentication framework based on binary tree for user authentication.Assuming that there is n registration to use in system Family.When a user wants to log on to system, the lip motion feature of system acquisition to the user, and it is self-editing using depth Code device network is extracted corresponding feature.This feature is first enter into the corresponding grader of nth user, i.e., n-th classification In device.If the login user is considered as nth user by n-th of grader, this feature will be further input to one It is gone in a imitator's detector to judge whether the login user is an imitator；Otherwise, i.e. n-th of grader is by the login User is considered as one in preceding n-1 user, then this feature will be further input in (n-1)th grader.With This analogizes, if i-th of grader judges that the login user is i-th of user, system is known that at least login user Will not be any one in preceding i-1 user.Simultaneously because i+1~n grader is it is determined that cross the login user not It is any one in i+1~n user, therefore, system may determine that the login user is i-th of user；Otherwise, system In grader before constantly the feature of the login user will be input to.Particularly, to first user, system can profit Judge that the login user is first user or imitator with imitator's detector.

In order to ensure that the robustness of system authentication, system usually require that user says one section of password for including multiple words.To every One word, system may be by above-mentioned authentication framework and obtain an authentication result.In order to integrate the authentication result of multiple words, this Embodiment proposes the voting mechanism of a weighting.For different words, corresponding authentication precision all different froms, this implementation Example is using the authentication precision of different words as weights.Assuming that one section of password contains m word, then pass through the method for front, this reality M authentication result (L can be obtained by applying example₁,…,L_m) and m corresponding precision (w₁,…,w_m).The present embodiment is to each registration User and imitator's class all calculate a confidence level, i.e. con f_i=∑_jw_j,j∈{k|L_k=U_i}.Based on these confidence levels, it is The login user can be considered as that maximum user class of confidence level by system.

The present embodiment is mainly that the audio unit of intelligent terminal carries out the customer certification system of labiomaney, therefore the present embodiment master Assess the accuracy of the accuracy and imitator's detection of user identity identification.The present embodiment is respectively in 4 kinds of different environment, i.e., Common laboratory, railway station, unglazed laboratory and bar, have respectively employed 12 volunteers to be tested.12 people In 10 people be registered in system, remaining 2 people test as imitator.As shown in fig. 6, to be average mixed in 4 environment Confuse matrix.The present embodiment can be seen that this system can obtain the accurate rate more than 83.7% in user's discrimination.This system is used Family distinguishes that average accuracy is 90.21%, and standard deviation is only 3.52%.Meanwhile this system can also take in imitator's detection Obtain 93.1% accuracy rate.As shown in fig. 7, the certification accuracy rate logged in wechat voiceprint lock, Alipay brush face for the present embodiment Comparison.It can be seen that in optimal environment, this system can obtain 95.3% accuracy rate, be sufficiently close to voiceprint lock 96.1% and brush face log in 97.2% accuracy rate.Meanwhile in varying environment, this system can obtain 91.7% or more standard True rate, and voiceprint lock logs in dark environment in noisy environment, brush face and can then decline by apparent performance.

As shown in figure 8, for the false acceptance rate and false rejection rate of this system.The present embodiment can see, and this system is total The false acceptance rate of body is only 1.2%, this demonstrate that this system is that safe enough is reliable.Meanwhile the mistake of this system totality Reject rate is only 1.6%, this proves that the user experience of this system is good enough, and the refusal that mistake will not always occur logs in situation. As shown in figure 9, being the cumulative probability density for successfully logging in required number.It can be seen that 95% user is successfully logged onto in system All only need password 4 times or less.This further demonstrates this system to be capable of providing a preferable user experience.

As shown in Figure 10, it is the cumulative probability density of time required for this system logs in.It can be seen that 90% user For, the response time of system is all in 0.73s or less.And average system response time is then in 0.67s or less.This result institute Show, the logging request of user is enable to respond quickly for this system, apparent time delay will not be caused.

Above-mentioned specific implementation can by those skilled in the art under the premise of without departing substantially from the principle of the invention and objective with difference Mode carry out local directed complete set to it, protection scope of the present invention is subject to claims and not by above-mentioned specific implementation institute Limit, each implementation within its scope is by the constraint of the present invention.

Claims

1. a kind of audio unit based on intelligent terminal carries out the user authen method of labiomaney, which is characterized in that in registration phase User's lip motion feature of different level is extracted using depth self-encoding encoder network to express the specificity of user, then After multiple graders and imitator's detector, then the authentication framework tissue training based on binary tree being trained with the feature extracted Grader and imitator's detector；In entry stage, it is special to obtain lip motion to be measured using the authentication mode based on binary tree The authentication result under corresponding single word is levied, user authentication is realized after last fused multiple authentication results；

The depth self-encoding encoder network includes：Input layer, noise reduction layer, hidden layer and output layer, wherein：Hidden layer is by three layers The self-encoding encoder of structure forms, and is encoded successively from coarseness word level, fine granularity word level and user level, and final Lip motion feature is exported by output layer；

The lip motion feature specifically refers to：The voice signal sent out by intelligent terminal structure after the reflection of speaker's lip At subtle Doppler effect signal, pass through obtained coded sequence after depth self-encoding encoder network code；

The authentication mode based on binary tree refers to：It is first that lip motion feature is defeated in the environment of registering user at n Enter grader corresponding to nth user, i.e. in n-th of grader, when：

1. n-th of grader by the lip motion tagsort be nth user when, then further sentenced using imitator's detector Whether the lip motion feature of breaking belongs to imitator；Otherwise：

2. n-th of grader is by one that the lip motion tagsort is in preceding n-1 user, then further by (n-1)th Grader is classified；

And so on, when i-th of grader judges that the lip motion is characterized in i-th of user, then it is known that at least lip Motion feature owning user will not be any one in preceding i-1 user；Simultaneously because i+1~n grader has been sentenced The login user that broke is not any one in i+1~n user, therefore judges that the lip motion feature owning user is I-th of user；Otherwise the grader before will constantly passing through the lip motion feature judges, to the 1st user, Then directly judge that its lip motion is characterized in belonging to user or imitator with imitator's detector；

The fusion is realized using the voting mechanism of weighting；I.e. using the authentication precision under single word as weights, to each note Volume user and imitator's class all calculate a confidence level, are considered as the maximum user class of confidence level with login user.

2. according to the method described in claim 1, it is characterized in that, the grader be specially support vector machines realize two points Class device；Imitator's detector is specially the one-class classifier that support vector description domain is realized.

3. a kind of user authentication realized the audio unit based on intelligent terminal of claims 1 or 2 the method and carry out labiomaney System, which is characterized in that including：Perception of sound module, the characteristic extracting module based on deep learning, is divided password segmentation module Class device and monitor training module and user's identification and imitator's detection module, wherein：Perception of sound module is segmented with password Module is connected and transmits the collected original sound information for being emitted to intelligent terminal microphone by lip motion, and password is segmented mould Block is connected with the characteristic extracting module based on deep learning and transmit the corresponding voice signal of one section of password is segmented into single word pair The information for answering voice signal, the characteristic extracting module based on deep learning are known with grader and monitor training module and user It is not connected with imitator's detection module and transmits the coding characteristic information that single word corresponds to voice signal, grader and monitor instruction Practice module to be connected with user's identification and imitator's detection module and transmit trained sorter model.