Background
With the development of mobile internet, smart phones have been commonly used for the lives of the public. The mobile internet brings convenience and is accompanied by the threat of user information loss. APP on most smart mobile phones is applied, when scenes such as login and payment are carried out, whether the user operates himself or not needs to be identified by inputting a password, but only one password is still in great risk, and after the password is stolen by an invader through modes such as shoulder surfing, the invader only needs to input a correct password to steal all information of the APP of the user, and even money in the APP is transferred to other accounts.
Although some of the current APPs have been identified using methods such as fingerprint identification, face identification, etc., these methods have their own limitations. The fingerprint identification accuracy is high, but the influence of environmental factors is large. When the user has dirty substances on the finger, the fingerprint identification system is difficult to judge; the fingerprints of other users are fuzzy and difficult to identify; even a user fingerprint may be stolen. The face recognition also has great restriction, when the surrounding environment is darker, or the face wears other ornaments, just difficult discernment is limited to current technical scheme, and someone replaces face recognition with the photo and succeeds, and this brings the very big threat to user's privacy. The behavior identification is not influenced by surrounding environment factors, the uniqueness is realized, the accuracy is high, only a sensor of the mobile phone is needed, no additional equipment is needed, the behavior identification can be carried out secretly as a second authentication mode, and the behavior identification and the password input are simultaneously identified, so that the privacy of the user is doubly guaranteed.
With the development of machine learning, the machine learning algorithm has many applications in the identification field, such as SVM, RandomForest, XGBoost and the like in the supervised classification algorithm. However, in the identity recognition system of the present invention, the classification algorithm is not suitable, because in the classification algorithm, in order to ensure sample balance, a training process generally requires a lot of abnormal samples, and for most target users, adding samples of non-target users with the same password into the training samples is difficult to achieve. Therefore, the abnormal value detection algorithm is considered for identity recognition.
There are no abnormal samples for the training data, and two cases can be divided:
one is that based on the known information, we know that there are both data of normal behavior and data of abnormal behavior in the data, which belongs to the problem of abnormal value detection in unsupervised learning. Common algorithms are isolationf orest and localoulilierfactor, among others.
And secondly, if the data of normal behaviors exist, the situation belongs to the problem of singular value detection of semi-supervised learning. Common algorithms are onecapsassvm and the like.
In the identity recognition system of the invention, the used historical data belong to the target user and do not have the data of the non-target user, so the OneClassSVM algorithm (hereinafter called OCSVM for short) is adopted.
The basic idea of the OCSVM algorithm (one-class support vector machine algorithm) is as follows: under linear conditions, a plane is found in space, the distance from the origin to the plane is as large as possible, and the plane is to divide the normal sample points from the abnormal sample points as much as possible: under nonlinear conditions, sample points in the original space are mapped into a high-dimensional feature space.
Zheng et al propose a non-intrusive user authentication mechanism in the young Are How You Touch. The authors study identification by analyzing the way a user touches a cell phone, while taking into account data such as accelerometers, gyroscopes, and other touch screen sensors. Finally, they use the Z-score method to perform a classification and get better effect.
In the prior art, data of pressing intensity and pressing area are used, and sensors corresponding to the two types of data cannot be obtained in most Android mobile phones and are not suitable for production environments. Even though there are relevant sensors in the IOS system handset, we have found that these two types of data are not accurate. Therefore, in this context, we would cull the compression force and compression area data.
The prior art does not consider device orientation sensor data, which is a good dimension to our study to distinguish different users.
The prior art uses Z-score as a classification algorithm, and although the Z-score also achieves better effect, the technology has weaker generalization capability than OCSVM. We have therefore introduced the OCSVM algorithm.
In research, the judgment accuracy and stability of a target user are improved by adding a mechanism of a preset behavior habit and combining an OCSVM model.
Disclosure of Invention
In order to solve the problem of identity recognition based on sensor data of a smart phone, the invention provides an identity recognition method based on abnormal behavior detection, only a target user sample is used in the training process, and whether a new sample is a target user can be accurately and stably judged, so that the judgment accuracy and stability can be improved.
An identity recognition method based on abnormal behavior detection comprises the following steps:
step 1): collecting data of each sensor of the same password input by a user on the same smart phone for multiple times;
step 2): collecting the reaction results of the user to a preset behavior habit mechanism before and after inputting the user password;
step 3): step 1) obtaining characteristic values of each sensor data after cleaning and conversion to obtain normalization:
step 4): cleaning and converting the reaction result of the behavior habit mechanism obtained in the step 2) to obtain a characteristic value of the behavior habit mechanism;
step 5):repeating the steps 1), 2) and 3) to obtain a plurality of normalized characteristic values, and training a training set sample by using an OCSVM algorithm by using a sklern packet of Python language to obtain a model
;
Step 6): repeating the steps 1), 2) and 4) to obtain characteristic values of a plurality of behavior habit mechanisms, constructing a behavior baseline of each habit of the target user, and integrating all behavior baselines to construct a model
;
Step 7): integrating the models obtained in the step 5) and the step 6), and distributing weights to obtain a fusion model:
wherein a and b are assigned weights;
step 8): and detecting abnormal behaviors by adopting a fusion model so as to identify whether the identity of the user is the owner or not.
The method comprises the steps of pre-embedding a plurality of behavior habit mechanisms, collecting behavior data based on a sensor, generating characteristics by adopting a standardized preprocessing method, obtaining the probability of the behavior abnormality by adopting an OCSVM algorithm training model according to the historical behavior characteristics of passwords input by a target user, combining the results of the plurality of preset behavior habit mechanisms, and finally obtaining the result of whether the behavior belongs to the target user or not in a linear combination mode.
In the step 1), each sensor comprises an equipment direction sensor, an acceleration sensor, a magnetic field sensor and a gyroscope sensor;
each sensor data comprises equipment direction sensor, acceleration sensor, magnetic field sensor and gyroscope sensor axis data;
in step 2), the reaction result of the preset behavior habit mechanism specifically includes:
when the input user name has numbers, the numbers in the keyboard are used as the reaction result of the first behavior habit mechanism of the user according to whether the user uses the numeric keyboard or directly uses the numbers in the alphabetic keyboard;
and the data of the acceleration sensor of the mobile phone equipment in a short time after the user finishes inputting the password is used as a reaction result of the second behavior habit mechanism of the user. The short time is 5-6 s, specifically 5.5 s.
Step 3): step 1) obtaining characteristic values of each sensor data after cleaning and conversion to obtain normalization, and the method specifically comprises the following steps:
A) removing the data with empty values in the sensor data obtained in the step 1) according to the timestamp sequence to obtain effective sensor data;
B) the valid sensor data in step a) requires a demolding length,
wherein, in the step (A),
the length of the mold is shown as,
a value representing the x-axis of the sensor,
a value representing the y-axis of the sensor,
representing the value of the z axis of the sensor to obtain the modular length corresponding to the time stamp sequence;
C) calculating five-dimensional values of the mould length of each sensor when the sensor is pressed down, the mould length when the sensor is released, the maximum mould length of the release period, the minimum mould length of the release period and the average mould length of the release period according to the time stamp sequence and the mould length corresponding to the time stamp sequence obtained in the step B);
D) processing time stamp sequence into one time by adopting 6-bit digital passwordThe characteristic sequence:
wherein, in the step (A),
representing the sticky time when the number is pressed,
representing the blank time during which the digit is released,
representing the sticky time when the 1 st digit is pressed,
representing the blank time during the release of the 1 st digit,
representing the sticky time when the 2 nd digit is pressed,
representing the blanking time during the 2 nd digit release,
representing the sticky time when the 3 rd digit is pressed,
representing the blanking time during the 3 rd digital release,
representing the sticky time when the 4 th digit is pressed,
representing the blanking time during the 4 th digit release,
representing the sticky time when the 5 th digit is pressed,
representing the blanking time during the 5 th digit release,
representing the sticky time when the 6 th digit is pressed,
representing the blank time during the 6 th digit release;
E) normalizing the values of the five dimensions of each sensor obtained in the step C) and the time characteristic sequence obtained in the step D) to obtain normalized characteristic values;
step 4): cleaning and converting the reaction result of the behavior habit mechanism obtained in the step 2) to obtain a characteristic value of the habit mechanism;
cleaning and converting the reaction result of the behavior habit mechanism obtained in the step 2) to obtain a characteristic value of the behavior habit mechanism, which specifically comprises the following steps:
converting the reaction result of the first behavior habit mechanism into Boolean values of-1 and 1 according to whether the user switches to a numeric keyboard or directly uses the numbers in an alphabetic keyboard;
and calculating the Euclidean distance between each group of vectors and the initial sensor vector for the acceleration sensor data of the user after the password is input as a reaction result of the second behavior habit mechanism:
wherein
,
,
Values representing the three axes of each set of acceleration sensors,
,
,
representing an initial value of the acceleration sensor, as Euclidean distance
And if the value is larger than the preset threshold value, the value is 1, otherwise, the value is-1.
Step 6) specifically comprises the following steps:
a) obtaining the weight of each behavior habit:
wherein, in the step (A),
a weight representing the jth behavioral habit mechanism,
the characteristic value of the jth behavior habit mechanism is the ith characteristic value of the jth behavior habit mechanism, and n is the number of the characteristic values of each behavior habit mechanism;
b) normalizing the weights of all the behavior habit mechanisms obtained in the last step to obtain the normalized weights of the behavior habits
;
c) Model obtained by synthesizing all behavior baselines
Wherein
,
Is as follows
The (n + 1) th characteristic value of each behavior habit mechanism.
Compared with the prior art, the invention has the following beneficial effects:
1) implicit identification: the traditional password input verification is still available for the user, but behavior recognition is implicitly embedded, so that the user is unaware and friendly to the user.
2) The precision is high: the invention utilizes OCSVM algorithm and behavior habit mechanism to identify the identity, and the accuracy reaches 96%.
3) Quick response: the invention can rapidly identify whether the user is legal or not under the limited data set;
4) low overhead: the invention utilizes the sensor data of the smart phone without other expenses;
5) difficult forgery: even if the behavior of the user is observed and imitated by an attacker, the attacker is difficult to cheat;
6) environment independent: the recognition result is not limited by the environment of the device, such as an application program, a user gesture and the like;
7) and (3) expandability: the behavior habit mechanism can be preset according to actual conditions.
Detailed Description
As shown in fig. 3, an identity recognition method based on abnormal behavior detection includes the following steps:
step 1): collecting data of each sensor of the same password input by a user on the same smart phone for multiple times;
each sensor comprises an equipment direction sensor, an acceleration sensor, a magnetic field sensor and a gyroscope sensor;
each sensor data comprises equipment direction sensor, acceleration sensor, magnetic field sensor and gyroscope sensor axis data;
step 2): collecting the reaction results of the user to a preset behavior habit mechanism before and after inputting the user password;
the preset reaction result of the behavior habit mechanism specifically comprises:
when the input user name has numbers, the numbers in the keyboard are used as the reaction result of the first behavior habit mechanism of the user according to whether the user uses the numeric keyboard or directly uses the numbers in the alphabetic keyboard;
and the data of the acceleration sensor of the mobile phone equipment in a short time after the user finishes inputting the password is used as a reaction result of the second behavior habit mechanism of the user. The short time is 5-6 s, specifically 5.5 s.
Step 3): step 1) obtaining characteristic values of each sensor data after cleaning and conversion to obtain normalization, and the method specifically comprises the following steps:
A) removing the data with empty values in the sensor data obtained in the step 1) according to the timestamp sequence to obtain effective sensor data;
B) the valid sensor data in step a) requires a demolding length,
wherein, in the step (A),
the length of the mold is shown as,
representing the square of the value of the x-axis of the sensor,
representing the square of the value of the y-axis of the sensor,
representing the square of the value of the z axis of the sensor to obtain the modular length corresponding to the time stamp sequence;
C) calculating five-dimensional values of the mould length of each sensor when the sensor is pressed down, the mould length when the sensor is released, the maximum mould length of the release period, the minimum mould length of the release period and the average mould length of the release period according to the time stamp sequence and the mould length corresponding to the time stamp sequence obtained in the step B);
D) and processing the time stamp sequence into a time characteristic sequence by adopting a 6-bit digital password:
wherein, in the step (A),
representing the sticky time when the number is pressed,
representing the blank time during the digital release.
E) Normalizing the values of the five dimensions of each sensor obtained in the step C) and the time characteristic sequence obtained in the step D) to obtain normalized characteristic values;
step 4): cleaning and converting the reaction result of the behavior habit mechanism obtained in the step 2) to obtain a characteristic value;
cleaning and converting the reaction result of the behavior habit mechanism obtained in the step 2) to obtain a characteristic value of the behavior habit mechanism, which specifically comprises the following steps:
converting the reaction result of the first behavior habit mechanism into Boolean values of-1 and 1 according to whether the user switches to a numeric keyboard or directly uses the numbers in an alphabetic keyboard;
and calculating the Euclidean distance between each group of vectors and the initial sensor vector for the acceleration sensor data of the user after the password is input as a reaction result of the second behavior habit mechanism:
wherein the content of the first and second substances,
values representing the three axes of each set of acceleration sensors,
representing the acceleration sensor initial value. Current Euclidean distance
And if the value is larger than the preset threshold value, the value is 1, otherwise, the value is-1.
Step 5): as shown in fig. 1, repeating steps 1), 2) and 3) to obtain a plurality of normalized eigenvalues, and training a training set sample by using a sklern packet of Python language and an OCSVM algorithm to obtain a model
;
Step 6): as shown in fig. 2, repeating steps 1), 2) and 4) to obtain characteristic values of a plurality of behavior habit mechanisms, constructing a behavior baseline of each habit of a target user, and integrating all behavior baselines to construct a model
The method specifically comprises the following steps:
a) obtaining the weight of each behavior habit:
wherein, in the step (A),
a weight representing the jth behavioral habit mechanism,
for the jth behavioral habit mechanism
A secondary eigenvalue;
b) normalizing the weights of all the behavior habit mechanisms obtained in the last step to obtain the normalized weights of the behavior habits
;
c) Model obtained by synthesizing all behavior baselines
Wherein
,
Is as follows
The (n + 1) th characteristic value of each behavior habit mechanism.
Step 7): integrating the models obtained in the step 5) and the step 6), and distributing weights to obtain a fusion model:
wherein a and b are assigned weights;
step 8): as shown in fig. 4, inputting new sample data, preprocessing the data, detecting abnormal behavior by using a fusion model, performing identity recognition, and ending the recognition.
Specifically, the identity recognition method based on abnormal behavior detection specifically comprises the following steps:
collecting data
The method comprises the following steps: and collecting data of each sensor of which the user inputs the same password on the same smart phone for multiple times.
1) Our SDK service is embedded in the APP that needs to enter the password.
2) The SDK is used to collect sensor data at a certain frequency while a user inputs a password, and to hold down a number and release the number of sensor data during the password input.
Step two: collecting the reaction results of the user to some preset behavior habit mechanisms before and after inputting the user password, for example:
1) behavioral habit mechanism 1: when the user name input has the number, the user switches to the numeric keyboard or directly uses the number in the alphabetic keyboard.
2) Behavior habit mechanism 2: and after the password is input, the data of the acceleration sensor of the mobile phone equipment in a short time.
Second, data preprocessing
The collected sensor data is cleaned, converted and the like:
the method comprises the following steps: data cleaning, screening data meeting conditions:
1) screening data conforming to the same password;
2) time sequence data and sensor data which are required to be used in the screening algorithm model, such as an equipment direction sensor, an acceleration sensor, a magnetic field sensor, a gyroscope sensor and the like, wherein the sensor data cannot be null, otherwise, the sensor data are rejected;
step two: and data conversion, namely processing and converting the raw data into a format capable of entering the model.
1) The sensors such as acceleration, equipment direction, magnetic field, gyroscope and the like collect data of all axes, the length of a model needs to be calculated,
。
2) and then, the values of the five dimensions of the pressing time value, the releasing period maximum value, the releasing period minimum value and the releasing period average value of the sensor data are calculated.
3) For time series numberAccording to the processing, taking 6 as an example of a digital password, time series data is processed into a time characteristic sequence:
(
representing the sticky time when the digit is pressed and the blank time during the digit release, respectively).
4) In order to ensure convenience of data processing and speed up convergence in a later model training process, all characteristic values are subjected to normalization processing, and meanwhile, experiments prove that the model accuracy rate can be improved through the normalization processing. The normalization method is as follows:
wherein the content of the first and second substances,
and
the data before and after the normalization are respectively obtained,
and
minimum and maximum values of the sample data, respectively.
Step three: processing behavior habit related data:
1) behavioral habit mechanism 1: the user switches to the numeric keypad or directly converts to boolean values of-1 and 1 using the numbers in the alphanumeric keypad.
2) Behavior habit mechanism 2: calculating the Euclidean distance between each group of vectors and the initial sensor vector for the acceleration sensor data of the user after the password is input:
wherein
Values representing the three axes of each set of acceleration sensors,
representing the acceleration sensor initial value. When in use
And if the value is larger than the preset threshold value, the value is 1, otherwise, the value is-1.
Third, model establishment
The method comprises the following steps: as shown in fig. 1, based on the historical sensor data of the target user, an OCSVM algorithm is used to train the training set samples by using a sklern packet of Python language to obtain a model.
1) And determining an evaluation index, and evaluating the quality of the model by using three indexes of a false positive rate (FAR), a false negative rate (FRR) and an Equal Error Rate (EER).
2) And determining a kernel function, and finally selecting a Gaussian kernel (RBF) through multiple experimental studies.
3) A training error (nu) is set.
4) OCSVM model for obtaining corresponding target user based on training set training
。
Step two: as shown in fig. 2, based on the historical behavior habit data of the target user, a behavior baseline of each habit of the target user is constructed, and a model is constructed by integrating all behavior baselines:
1) obtaining the weight of each behavior habit:
wherein
Is the value of the (i) th time,
is to sum up all the values of the sum,
the number of all the values is calculated,
and (6) calculating an absolute value.
2) Normalizing all the weights obtained in the last step to obtain the weight of each behavior habit
。
3) Model obtained by synthesizing all behavior baselines
Is as follows
The (n + 1) th characteristic value of each behavior habit mechanism.
Step three: integrating the models obtained in the first step and the second step, and distributing weights to obtain a fusion model:
fourthly, detection and identification
As shown in fig. 4, after one or more new samples to be tested enter the system, a pre-trained fusion model is called. And the new sample data is converted into a data format capable of entering the fusion model through the steps of data cleaning and data preprocessing again, and finally a fusion model result is obtained.
And when the result output by the fusion model is greater than or equal to the threshold value, the sample is considered to belong to the target user, and when the model result is smaller than the threshold value, the sample is considered not to belong to the target user.