CN112188487B

CN112188487B - Method and system for improving user authentication accuracy

Info

Publication number: CN112188487B
Application number: CN202011374374.1A
Authority: CN
Inventors: 邵俊
Original assignee: Soxinda Beijing Data Technology Co ltd
Current assignee: Soxinda Beijing Data Technology Co ltd
Priority date: 2020-12-01
Filing date: 2020-12-01
Publication date: 2021-03-12
Anticipated expiration: 2040-12-01
Also published as: CN112188487A

Abstract

The invention discloses a method and a system for improving user authentication accuracy. The method comprises the following steps: constructing a variational self-encoder based on a local database and a third-party database; based on the tags, dividing the mass historical user data into three types, namely first type user data, second type user data and third type user data; performing sample increment operation on the third type of user data based on the variational self-encoder and the third type of user data; establishing a binary model based on the mass historical user data and the third type user data obtained by sample increment operation; receiving a user request; and authenticating the user based on the two classification models. The method provided by the invention can generate more negative samples by fully utilizing the known information, and the generated negative samples can bring more information to the model. Meanwhile, the VAE also fully utilizes the information of rejected samples, and the encoder part can realize the dimension reduction of characteristic dimensions so as to improve the effect and stability of the model.

Description

Method and system for improving user authentication accuracy

Technical Field

The invention belongs to the field of big data analysis and data mining, and particularly relates to a method and a system for improving user authentication accuracy.

Background

The rapid development of the mobile internet promotes the rapid development of the mobile phone end service, and a user can enjoy the corresponding application function at a high speed only by submitting application data on the mobile phone APP. Meanwhile, a set of authentication measures can be deployed on the side of the operator server to ensure the rights and interests of legal users and prevent the loss of the operators caused by bad users. Due to the competitive market, it becomes important whether the server side can feed back the result quickly and accurately.

Generally, authentication is performed as follows: receiving application data of a user at a mobile phone end, wherein the application data comprises information such as gender, age, occupational information, education background, residential area and the like of the user, mobile phone related information such as IP address, number of mobile phone APP, mobile phone brand and the like, and inquiring third party data of the user under authorization of the user to obtain a 360-degree panorama of user characteristics. After the business is accumulated to a certain extent, a historical database is built based on past historical data, and an authentication model is built and optimized based on the historical data so as to better control risks.

This authentication is actually a sort-through task, i.e. predicting whether a user is a bad user. The problem of the second classification is solved by using a supervised learning algorithm, and in practical application, the number of negative samples is very small, and the sample classes are quite unbalanced. This can make it difficult for the binary algorithm to accurately perform model training. Therefore, for such a binary problem, undersampling or oversampling, and a SMOTE method are commonly used in the art to solve the sample imbalance problem. However, the information is lost by the undersampling method, and although the oversampling generates more negative samples, the excessive negative samples are still generated based on the known negative samples and cannot bring more information to the model, which eventually results in insufficient authentication accuracy, and the user who should be granted the authority is prohibited from accessing, and the user who should not be granted the authority can freely access instead, and enjoys corresponding resources, which is obviously undesirable.

Disclosure of Invention

Based on the above background, the present invention proposes an authentication method based on modeling of a Variable Auto-Encoder (VAE) to solve such a problem. Therefore, the invention provides a method for improving the accuracy of user authentication, which comprises the following steps:

collecting mass historical user data based on a local database and a third-party database;

constructing a variation self-encoder based on the mass historical user data;

dividing mass historical user data into three types based on tags, wherein the three types are first type user data, second type user data and third type user data, the third type user represents a negative sample tag user, and the sample incremental operation refers to generating more different third type user data based on the existing third type user data;

performing sample increment operation on the third type of user data based on the variational self-encoder and the third type of user data;

establishing a binary model based on the mass historical user data and the third type user data obtained by sample increment operation;

receiving a user request;

and authenticating the user based on the two classification models.

Wherein the constructing a variational self-encoder based on the massive historical user data comprises:

establishing a historical user data set based on the mass historical user data;

and constructing a variational self-encoder based on the historical user data set.

The first type of user data is non-tag user data, the second type of user data is user data with a tag of 1, and the third type of user data is user data with a tag of 0.

Wherein performing a sample increment operation on the third type of user data based on the variational auto-encoder and the third type of user data comprises:

extracting the characteristics of third-class user data, inputting the characteristics into an encoder of the variational self-encoder, and outputting the mean value and the variance of normal distribution;

generating a plurality of random noise samples based on the normal distribution of the mean and the variance;

and inputting the plurality of random noise samples into a decoder of the variational self-encoder to generate new sample data, wherein the new sample data is third type user data different from the third type user data.

And marking the new sample data as the additional third type user data, and constructing an additional third type user data set.

The establishing of the two-classification model based on the historical user data and the third-class user data obtained by sample increment operation comprises the following steps:

and fusing the historical user data set and the additional third-class user data set, and training by using a two-class supervised learning algorithm to obtain a two-class model.

The mass historical user data at least comprises the gender, the age, the professional information, the education background, the residential area information, the mobile phone IP address, the number of mobile phone APPs, the mobile phone brand and/or the third-party data of the user.

Wherein the variational self-encoder comprises an encoder and a decoder.

Wherein the two classification models are logistic regression models, the logistic regression models being represented by the following formula:

x denotes the user's characteristics, m is the dimension of x,

and b is the model parameter, f (x) the probability that x is a negative sample is output, exp represents an exponential function.

The invention also provides a system for improving the accuracy of user authentication, which comprises:

the variation self-encoder is used for carrying out sample increment operation on the third type of user data based on the mass historical user data, wherein the mass historical user data is divided into three types based on the label, namely first type of user data, second type of user data and third type of user data;

the model establishing module is used for establishing a two-classification model based on the mass historical user data and the third-class user data obtained by sample increment operation;

an input module for receiving a user request;

an authentication model for authenticating the user based on the two classification models.

Compared with the prior art, the method provided by the invention can generate more negative samples by fully utilizing the known information, and the generated negative samples can bring more information to the model. Meanwhile, the VAE also fully utilizes the information of rejected samples, and the encoder part can realize the dimension reduction of characteristic dimensions so as to improve the effect and stability of the model.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

FIG. 1 is a flow chart illustrating a method for improving user authentication accuracy in accordance with an embodiment of the present invention;

FIG. 2 is a flow diagram illustrating a sample increment operation on a third type of user data according to an embodiment of the invention;

FIG. 3 is a diagram illustrating a sample increment operation on a third type of user data according to an embodiment of the invention;

FIG. 4 is a schematic diagram illustrating the establishment of a neural network at a self-encoder according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating the generation of random variables according to one embodiment of the invention;

FIG. 6 is a schematic diagram illustrating decoder generated samples according to an embodiment of the present invention; and

fig. 7 is a schematic diagram illustrating a system for improving user authentication accuracy according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and "a plurality" typically includes at least two.

It should be understood that although the terms first, second, third, etc. may be used to describe … … in embodiments of the present invention, these … … should not be limited to these terms. These terms are used only to distinguish … …. For example, the first … … can also be referred to as the second … … and similarly the second … … can also be referred to as the first … … without departing from the scope of embodiments of the present invention.

Alternative embodiments of the present invention are described in detail below with reference to the accompanying drawings.

A conventional self-encoder consists of an encoder and a decoder, wherein the encoder can achieve the purpose of feature dimension reduction. However, conventional encoders do not have the capability to generate new samples. The variational self-encoder assumes that the posterior distribution of the implicit space follows the normal distribution on the basis of the self-encoder, so that the invention can sample in the corresponding implicit space through the observed negative samples and generate new samples by using a decoder. By the variational self-encoder, the invention can generate more negative samples to balance the number of the positive and negative samples.

The first embodiment,

As shown in fig. 1, the present invention discloses a method for improving the accuracy of user authentication, which comprises the following steps:

constructing a variation self-encoder based on the mass historical user data;

receiving a user request;

and authenticating the user based on the two classification models.

Example II,

A method for improving the accuracy of user authentication comprises the following steps:

constructing a variation self-encoder based on the mass historical user data;

receiving a user request;

and authenticating the user based on the two classification models.

In one embodiment, many APPs are currently freely available to a user, and some APPs need to authenticate the user's identity to determine the user's right to use the APP. It is usually considered to comprehensively consider the historical usage records of the user, including the usage records of other APPs of the same type, and the historical usage records in other systems, such as telecommunications, to determine whether the user has some or part or all of the rights to use the APP. Generally, when a user uses the APP for the first time, the user receives a use request of the user, and after obtaining user authorization based on the current database content, extracts user information and converts the user information into vector features to substitute the vector features into a trained model to authenticate the user, so as to finally confirm the use authority of the user.

establishing a historical user data set based on the mass historical user data;

As shown in fig. 2 and fig. 3, performing a sample increment operation on the third type of user data based on the variational self-encoder and the third type of user data includes:

The mass historical user data at least comprises the gender, the age, the occupation information, the education background and the residential area information of the user, the IP address of the mobile phone, the number of the mobile phone APP, the brand of the mobile phone and/or third-party data of the user.

Wherein the variational self-encoder comprises an encoder and a decoder.

x denotes the user's characteristics, m is the dimension of x,

Example III,

Historical data preparation for modeling:

a large amount of user data is accumulated during the business process to form a historical data set for modeling. For each user, the invention collects the information including the gender, age, professional information, education background, residential area and the like of the user, the information related to the mobile phone, such as IP address, the number of mobile phone APP, the brand of the mobile phone and the like, and inquires the data of the user under the authorization of the user, and the third party data of the user, such as communication data and the like.

When some kind of APP, such as shopping APP, video browsing APP, RPG game APP, chess game APP, etc., is used by the users, the tags of the users can be obtained, that is, whether the users are good users or bad users, and the value is 1 or 0. Bad users are users with low honesty or reputation, since such users are all a few users, and are therefore marked as 0.

And those users who have not used some kind of APP have no tag data, but these untagged feature data can still be used by us.

The above-mentioned small amount of tagged data is recorded as

And a large amount of unlabeled data (historically rejected users) is scored as

Where the superscripts of X represent different features and the subscripts represent different observations. And n observation data in total, wherein the numbers from 1 to l are labeled data, the numbers from l +1 to n are unlabeled data, l < < n, y are labels corresponding to the characteristics, the value is 0 or 1, 0 represents that the user is a bad client, and 1 represents a good client.

Example four,

In the embodiment of the invention, the variational self-encoder is difficult to perform gradient backward propagation due to the posterior probability distribution assumption in the training process, and the variational self-encoder performs sampling by using a re-parameterization (parameterization) skill. Specifically, two neural networks are constructed for the conditional expectation and conditional variance of the normal distribution, respectively, and a new normal distribution is generated by the method shown in fig. 5, and then taken into the decoder to generate more samples.

The variational self-encoder consists of an encoder module and a decoder module. The variational self-encoder assumes that the hidden variable is a random variable, and the posterior distribution follows the normal distribution, i.e.

X is the number of the observation samples,

represents a mean value of

Variance is

The normal distribution of (c),

so for each different observation sample, Z follows a different normal distribution. As shown in fig. 4, since a normal distribution can be described by two variables, i.e., the mean and the variance, the present invention establishes two neural networks in the encoder part, one for generating the mean and one for generating the variance.

With the mean and variance obtained, the present invention generates (samples) a random variable z using a method such that the random variable satisfies the mean of

Variance is

Normal distribution of (2); randomly generating a variable e which follows standard normal distribution

Namely, wherein,

is the standard deviation.

After this step is completed, the decoder can proceed to generate a new sample through a neural network using the generated variables, the dimensions of which are the same as those of the original X sample, as shown in fig. 6.

The purpose of the training of the variational self-encoder is to enable the samples obtained by the decoder to restore the original samples to the maximum extent, so as to achieve the purpose of reducing the dimension. But because the variations are derived from the middle of the sample introduced by the encoder, the present invention can transform this sample to obtain various variations of the original variable that have similar properties as the original sample.

The present invention also generates more negative samples based on this principle, and generates various variants of the original negative samples by using a variational self-encoder for the original negative samples.

Example V,

the variation self-encoder is used for performing sample increment operation on the third type of user data based on the mass historical user data, wherein the mass historical user data is divided into three types based on the label, the three types are respectively the first type of user data, the second type of user data and the third type of user data, and the sample increment operation refers to generating more different third type of user data based on the existing third type of user data;

an input module for receiving a user request;

Example six,

The disclosed embodiments provide a non-volatile computer storage medium having stored thereon computer-executable instructions that may perform the method steps as described in the embodiments above.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a local Area Network (AN) or a Wide Area Network (WAN), or the connection may be made to AN external computer (for example, through the internet using AN internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of an element does not in some cases constitute a limitation on the element itself.

The foregoing describes preferred embodiments of the present invention, and is intended to provide a clear and concise description of the spirit and scope of the invention, and not to limit the same, but to include all modifications, substitutions, and alterations falling within the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for improving the accuracy of user authentication is characterized by comprising the following steps:

constructing a variation self-encoder based on the mass historical user data;

dividing mass historical user data into three types based on tags, namely first type user data, second type user data and third type user data, wherein the first type user data are non-tag user data, the second type user data are user data with tags of 1, and the third type user data are user data with tags of 0;

based on the variational self-encoder and third-class user data, performing sample increment operation on the third-class user data, wherein the third-class users represent negative sample label users, and the sample increment operation refers to generating more different third-class user data based on the existing third-class user data;

establishing a two-classification model based on the third type user data obtained by massive historical user data and sample increment operation, wherein the two-classification model is a logistic regression model, and the logistic regression model is represented by the following formula:

x denotes the user's characteristics, m is the dimension of x,

and b is the model parameter, f (x) the output x is the probability of the negative sample, exp represents the exponential function;

receiving a user request;

authenticating the user based on the two classification models;

wherein performing a sample increment operation on a third type of user data based on the variational autocoder and the third type of user data comprises:

generating a plurality of random noise samples based on the normal distribution of the mean and the variance, wherein the random noise samples are random variable z, and then

Wherein the content of the first and second substances,

is taken as the mean value of the average value,

is the standard deviation, e is a randomly generated variable that follows the standard normal distribution;

2. The method of claim 1, wherein said constructing a variational self-coder based on said mass historical user data comprises:

establishing a historical user data set based on the mass historical user data;

3. The method of claim 1, wherein the new sample data is marked as additional third type user data and an additional third type user data set is constructed.

4. The method of claim 3, wherein building a classification model based on the historical user data and the third type of user data from the sample incremental operations comprises:

5. The method of claim 1, wherein the mass historical user data comprises at least gender, age, professional information, education background, residential area information, mobile phone IP address, number of mobile phone APPs, mobile phone brand, and/or third party data of the user.

6. The method of claim 1, wherein the variational self-encoder comprises an encoder and a decoder.

7. A system for improving the accuracy of user authentication implementing the method of any one of claims 1 to 6, comprising:

the variation self-encoder is used for performing sample increment operation on the third type of user data based on the mass historical user data, wherein the mass historical user data is divided into three types based on the label, the three types of user data are respectively the first type of user data, the second type of user data and the third type of user data, the third type of user represents a negative sample label user, and the sample increment operation refers to generating more different third type of user data based on the existing third type of user data;

an input module for receiving a user request;