CN117521042B - High-risk authorized user identification method based on ensemble learning - Google Patents
High-risk authorized user identification method based on ensemble learning Download PDFInfo
- Publication number
- CN117521042B CN117521042B CN202410014208.2A CN202410014208A CN117521042B CN 117521042 B CN117521042 B CN 117521042B CN 202410014208 A CN202410014208 A CN 202410014208A CN 117521042 B CN117521042 B CN 117521042B
- Authority
- CN
- China
- Prior art keywords
- user
- risk
- data
- model
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 74
- 230000006399 behavior Effects 0.000 claims abstract description 40
- 230000008569 process Effects 0.000 claims abstract description 27
- 238000012549 training Methods 0.000 claims description 89
- 238000000605 extraction Methods 0.000 claims description 26
- 238000012360 testing method Methods 0.000 claims description 22
- 238000012545 processing Methods 0.000 claims description 19
- 230000002159 abnormal effect Effects 0.000 claims description 12
- 230000006870 function Effects 0.000 claims description 11
- 230000005856 abnormality Effects 0.000 claims description 6
- 230000000903 blocking effect Effects 0.000 claims description 5
- 238000013486 operation strategy Methods 0.000 claims description 5
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 238000013075 data extraction Methods 0.000 claims description 3
- 230000009191 jumping Effects 0.000 claims description 2
- 238000013507 mapping Methods 0.000 claims description 2
- 125000004122 cyclic group Chemical group 0.000 claims 1
- 238000012795 verification Methods 0.000 abstract description 6
- 206010000117 Abnormal behaviour Diseases 0.000 abstract description 3
- 238000003066 decision tree Methods 0.000 description 10
- 238000013461 design Methods 0.000 description 8
- 238000013475 authorization Methods 0.000 description 5
- 230000010354 integration Effects 0.000 description 3
- 238000004140 cleaning Methods 0.000 description 2
- 238000002790 cross-validation Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 230000004931 aggregating effect Effects 0.000 description 1
- 238000000540 analysis of variance Methods 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010219 correlation analysis Methods 0.000 description 1
- 230000009193 crawling Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/30—Authentication, i.e. establishing the identity or authorisation of security principals
- G06F21/31—User authentication
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computer Security & Cryptography (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Computer Hardware Design (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention belongs to the technical field of safety protection, and particularly discloses a high-risk authorized user identification method based on integrated learning, which constructs a high-risk authorized user identification model in a fort machine in an integrated learning mode, wherein the integrated learning can obtain higher prediction performance than a single model, and can capture a potential risk mode more accurately; meanwhile, when the identification is carried out, the multidimensional characteristics comprising user behaviors, personal information and equipment information are comprehensively considered; in this way, when the abnormal behavior is detected using the model trained by ensemble learning, it is possible to accurately identify users who may have high-risk authorized behaviors; therefore, the invention applies the integrated learning to the fort machine, and can effectively identify potential high-risk authorized users in real time in the actual access process; based on the method, the system security and the identity verification process are enhanced, and compared with the traditional technology, the method has the advantages that the real-time performance and the accuracy are greatly improved.
Description
Technical Field
The invention belongs to the technical field of safety protection, and particularly relates to a high-risk authorized user identification method based on ensemble learning.
Background
Along with the evolution and expansion of various applications, it becomes more and more important to effectively manage user authorization and prevent high-risk authorization of users, and the application management service in the prior art can screen out high-risk authorized users from users in advance, and perform security restriction operation on the screened actual high-risk authorized users; in this way, the management capability of high-risk authorized users is improved; however, with the continuous development of the prior art and threats, the identification method can not realize the accurate identification of each user in the actual access process, so that the missing report is easily caused by missing some abnormal characteristics of the user, and the real-time performance and the accuracy are low, and the conversation operation of the user can be influenced because the access user cannot be accurately judged whether the access user is a high-risk authorized user or not; based on this, how to provide a method for identifying high-risk authorized users with high real-time performance and accuracy has become a problem to be solved.
Disclosure of Invention
The invention aims to provide a high-risk authorized user identification method based on ensemble learning, which is used for solving the problem of low real-time performance and accuracy in the prior art.
In order to achieve the above object, the present invention provides a high risk authorized user identification method based on ensemble learning, comprising:
Acquiring user data of a target user, performing ETL operation on the user data of the target user, and performing data extraction, data conversion and data loading steps to eliminate errors and repeated user data; wherein the user data includes behavior data, personal information data, and device information data of the target user;
Performing feature extraction on the behavior data, the personal information data and the equipment information data in the user data to obtain user features of the target user;
Feature selection is carried out on the user features of the target user, and the dimension of user data is reduced by eliminating irrelevant or redundant features, so that the computational complexity is reduced;
mapping user data from a high dimension to a low dimension space using a Linear Discriminant Analysis (LDA);
Acquiring a high-risk authorized user identification model based on ensemble learning, wherein the high-risk authorized user identification model is obtained by training with user characteristics of a plurality of samples as input and user identification results of the samples as output;
Inputting the user characteristics of the target user into the high-risk authorized user identification model to obtain a user identification result of the target user; the high-risk authorized user identification model gives a score according to the difference between the quantized abnormal data and the normal data in the user identification result, and the smaller the score is, the higher the degree of abnormality is. And determining whether the user is a high-risk authorized user according to the sources of the score statistical abnormal data and the influence of the data characteristics.
Based on the above disclosure, the invention constructs the high-risk authorized user identification model based on integrated learning in advance, wherein the high-risk authorized user identification model is trained by taking the sample user characteristics of a large number of sample users as input and the user identification results of all sample users as output; thus, when the method is applied, only behavior data, personal information data and equipment information data of a target user are required to be obtained, ETL data cleaning is carried out on the data, and then feature extraction is carried out on the data to obtain user features of the target user; the user characteristics are subjected to characteristic selection, so that the data volume to be processed is reduced; and finally, inputting the target user to the high-risk authorized user identification model, and judging whether the target user is the high-risk authorized user or not through further optimization according to the obtained abnormal score. The authorized user receives the risk score of the authorized user when logging in each time, so that the authorized user is managed and reminded.
Through the design, the high-risk authorized user identification model is built in the fort machine through the integrated learning, wherein the integrated learning can obtain higher prediction performance than other existing models, can be suitable for anomaly detection of different data types, can more effectively treat difficult anomaly identification problems in a high-dimensional nonlinear separable data space, and can further accurately capture potential risk modes without missing anomaly information; meanwhile, when the identification is carried out, the multidimensional characteristics comprising user behaviors, personal information and equipment information are comprehensively considered; in this way, when the trained model is used to detect abnormal behaviors, users who may have high risk authorized behaviors can be accurately identified; therefore, the method applies the high-risk authorized user identification model based on the gradient lifting tree to the gradient lifting decision tree (Gradient Boosting Decision Tree, GBDT) in the fort machine, and the method for using the gradient lifting tree has obvious advantages when the high-risk user is identified in the fort machine. In one aspect, the gradient lifting tree is an additive model based on boosting ensemble learning ideas, which trains a series of weak classifiers through multiple iterations, and in each iteration, fits a new decision tree according to the residuals of the previous classifier. On the other hand, the gradient lifting decision tree model can process a large number of features and samples, has good capability of solving nonlinear problems, and can be effectively applied to the scene to help us identify potential high-risk users from massive data. This gives the gradient-lifting tree model excellent predictive performance and generalization ability. In the actual access process, a potential high-risk authorized user is effectively identified in real time, and a corresponding alarm is carried out; based on the method, the system security and the identity verification process are enhanced, compared with the traditional technology, the method has the advantages that the real-time performance and the accuracy are greatly improved, the accuracy is greatly improved compared with the prior art, and the method is very suitable for large-scale application and popularization in the technical field of system security protection.
In one possible design, the user identification result of the target user further includes a high-risk user confidence level, and the high-risk user confidence level is securely authenticated, and the method includes:
Determining the risk level of the target user based on the high-risk user confidence level of the target user;
And according to the risk level of the target user, adopting an operation strategy corresponding to the risk level of the target user to carry out security authentication on the target user.
In one possible design, determining the risk level of the target user based on the high-risk user confidence level of the target user includes:
if the high-risk user confidence coefficient of the target user is between the confidence coefficient threshold value and the first risk threshold value, determining that the risk level of the target user is a three-level risk user;
If the high-risk user confidence of the target user is between a first risk threshold and a second risk threshold, determining that the risk level of the target user is a secondary risk user;
If the high-risk user confidence coefficient of the target user is larger than a second risk threshold, determining that the risk level of the target user is a first-level risk user, wherein the risk levels of the first-level risk user, the second-level risk user and the third-level risk user are sequentially reduced.
In one possible design, after the target user is securely authenticated, the method further includes:
Judging whether the target user passes the security authentication;
If not, generating a blocking instruction of the target user, and executing the blocking instruction to block the access of the target user to the sensitive system or the resource.
In one possible design, before acquiring the user data of the target user, the method further includes:
Acquiring historical user data of a plurality of sample users, wherein the historical user data of any sample user comprises historical behavior data, historical personal information data and historical equipment information data of any sample user and a user tag of any sample user, and the user tag comprises a high-risk authorized user or a low-risk authorized user;
Performing feature extraction processing on historical behavior data, historical personal information data and historical equipment information data in each piece of historical user data to obtain sample user features of each sample user;
Carrying out association processing on sample user characteristics of each sample user and user labels of each sample user to obtain a plurality of association characteristic data, and dividing the plurality of association characteristic data into a training set and a testing set;
And training the gradient lifting tree model by taking each training data in the training set as input and taking a user identification result of a corresponding sample user of each training data as output, and testing the trained gradient lifting tree model by using a testing set after training is completed, so that when the testing result meets the preset condition, the trained gradient lifting tree model is used as the high-risk authorized user identification model.
In one possible design, the feature extraction process is performed on the historical behavior data, the historical personal information data and the historical equipment information data in each historical user data to obtain sample user features of each sample user, including:
For the historical user data corresponding to any sample user, performing primary feature extraction processing on the historical behavior data, the historical personal information data and the historical equipment information data in the historical user data corresponding to any sample user to obtain initial user features of any sample user;
Performing feature selection processing on each feature in the initial user features corresponding to any sample user to extract key features in the initial user features corresponding to any sample user;
and using the extracted key features to form sample user features corresponding to any sample user.
In one possible design, each training data in the training set is taken as an input, and a user identification result of a corresponding sample user of each training data is taken as an output, so as to train the gradient lifting tree model, including:
1. Initializing: first, basic parameters of the model are determined, and log data of authorized user operations such as login time, login place, login equipment, operation frequency and the like are extracted as input variables of the learning model.
2. Iterative training decision tree: a series of weak classifiers are trained iteratively and the results of these weak classifiers are weighted and combined to obtain the final strong classifier. Specifically, in each iteration, the gradient lifting tree fits a new decision tree according to the residual error of the classifier of the previous iteration. In particular, the goal of the gradient-lifted tree is to minimize the loss function, which is solved by the gradient descent method. The gradient lifting tree calculates the residual error of the current model, and then fits a new decision tree according to the residual error, thereby updating the model parameters.
3. Calculating an anomaly score: and determining the information entropy and the conditional entropy of the feature data. Information entropy is a measure of the degree of uncertainty or confusion in a data set, while conditional entropy represents a measure of uncertainty in data given a certain feature. According to the information gain method, the quantized information contribution degree of the representative feature data to the training data set exists, and each abnormal data has a unique abnormal score. After the abnormality is identified, the final abnormality score is evaluated according to which type of abnormality data the high-risk behavior is brought by.
The beneficial effects of the invention are as follows: the invention constructs the high-risk authorized user identification model in the fort machine through an integrated learning mode, wherein the integrated learning can obtain higher prediction performance than a single model, and can capture a potential risk mode more accurately; meanwhile, when the identification is carried out, the multidimensional characteristics comprising user behaviors, personal information and equipment information are comprehensively considered; in this way, when the abnormal behavior is detected using the model trained by ensemble learning, it is possible to accurately identify users who may have high-risk authorized behaviors; therefore, the invention applies the high-risk authorized user identification model based on integrated learning to the fort machine, can effectively identify potential high-risk authorized users in real time and perform corresponding alarm in the actual access process; based on the method, the system security and the identity verification process are enhanced, and compared with the traditional technology, the method has the advantages that the real-time performance and the accuracy are greatly improved, and the method is very suitable for large-scale application and popularization in the technical field of system security protection.
Drawings
Fig. 1 is a schematic flow chart of steps of a high-risk authorized user identification method based on ensemble learning according to an embodiment of the present invention.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the present invention will be briefly described below with reference to the accompanying drawings and the description of the embodiments or the prior art, and it is obvious that the following description of the structure of the drawings is only some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art. It should be noted that the description of these examples is for aiding in understanding the present invention, but is not intended to limit the present invention.
It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another element. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of example embodiments of the present invention.
It should be understood that for the term "and/or" that may appear herein, it is merely one association relationship that describes an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a alone, B alone, and both a and B; for the term "/and" that may appear herein, which is descriptive of another associative object relationship, it means that there may be two relationships, e.g., a/and B, it may be expressed that: a alone, a alone and B alone; in addition, for the character "/" that may appear herein, it is generally indicated that the context associated object is an "or" relationship.
Examples: referring to fig. 1, the method for identifying high-risk authorized users based on ensemble learning provided by the embodiment uses a gradient lifting tree algorithm in ensemble learning to identify high-risk authorized users, that is, by aggregating a plurality of decision trees, the performance of a model is gradually lifted, so that users with high-risk authorized behaviors can be accurately identified, in each iteration, the gradient lifting tree is gradually focused on the characteristics of a high-risk authorized mode according to the error of the previous iteration, and the accuracy and stability of the model are further optimized, so that the application of the integration method in a fort machine can effectively identify potential high-risk authorized users, and compared with the traditional technology, the practicality and accuracy of the integration method are greatly improved, and the safety of a system can be effectively enhanced; the method may be, for example, but not limited to, running in a fort machine, and it is to be understood that the foregoing execution subject is not limited to the embodiment of the present application, and accordingly, the running steps of the method may be, but not limited to, as shown in the following steps S1 to S4.
S1, obtaining user data of a target user, wherein the user data comprises behavior data, personal information data and equipment information data of the target user; in this embodiment, the example behavior data may include, but is not limited to, login data, operation data, and transaction data; the personal information characteristics may include, but are not limited to, user authentication information, profile information, account type, and the like; the device information data may include, but is not limited to, device identification information (such as a type of device the user logs in, an operating system, a browser, etc.), an IP address, a history of devices (a list of devices and frequencies the user has used in the past), etc.; of course, the user data can be obtained by crawling, and the user data can be obtained before each authorization or access of the target user, so that the real-time identification of the high-risk authorization behavior of the target user can be performed according to the user data, and the ETL operation is performed on the user data of the target user, wherein the process comprises data extraction, data conversion and data loading so as to eliminate error and repeated data. After the user data of the target user is obtained, feature extraction may be performed to perform subsequent identification of the high-risk authorized behavioral user based on the extracted features, where the feature extraction process is as follows in step S2.
Before acquiring the user data of the target user, the method further comprises:
Acquiring historical user data of a plurality of sample users, wherein the historical user data of any sample user comprises historical behavior data, historical personal information data and historical equipment information data of any sample user and a user tag of any sample user, and the user tag comprises a high-risk authorized user or a low-risk authorized user;
Performing feature extraction processing on historical behavior data, historical personal information data and historical equipment information data in each piece of historical user data to obtain sample user features of each sample user;
Carrying out association processing on sample user characteristics of each sample user and user labels of each sample user to obtain a plurality of association characteristic data, and dividing the plurality of association characteristic data into a training set and a testing set;
And training the gradient lifting tree model by taking each training data in the training set as input and taking a user identification result of a corresponding sample user of each training data as output, and testing the trained gradient lifting tree model by using a testing set after training is completed, so that when the testing result meets the preset condition, the trained gradient lifting tree model is used as the high-risk authorized user identification model.
Further, performing feature extraction processing on historical behavior data, historical personal information data and historical equipment information data in each historical user data to obtain sample user features of each sample user, including:
For the historical user data corresponding to any sample user, performing primary feature extraction processing on the historical behavior data, the historical personal information data and the historical equipment information data in the historical user data corresponding to any sample user to obtain initial user features of any sample user;
Performing feature selection processing on each feature in the initial user features corresponding to any sample user to extract key features in the initial user features corresponding to any sample user;
s2, performing feature extraction processing on behavior data, personal information data and equipment information data in the user data to obtain user features of the target user; in this embodiment, the selection of important features (i.e., key features) is equivalent to; for example, extracting hour, minute, date, etc. related features from the login time in the login data, or extracting keywords from the text data, etc.; of course, the key features are determined during model training, and the determining process of the key features is described in detail in the following model training process; thus, the user characteristics of the target user can be formed after the key characteristics comprising the behaviors, the personal information and the equipment information are extracted from the user data.
After the extraction of key features in the user data is completed, a high-risk authorized user identification model based on integrated learning can be obtained so as to identify the high-risk authorized user based on the model and the extracted key features later; the model acquisition process may be, but is not limited to, as shown in step S3 below.
S3, acquiring a high-risk authorized user identification model based on integrated learning, wherein the high-risk authorized user identification model is obtained by training with sample user characteristics of a plurality of sample users as input and user identification results of the sample users as output; in this embodiment, the essence is to train the gradient lifting tree model in the ensemble learning by using the user features of a plurality of sample users, so as to obtain the high-risk authorized user identification model; wherein the gradient lifting tree is a heterogeneous integration method that obtains a final model by training multiple decision trees on different subsets of data; meanwhile, the trees are sequentially constructed, and each tree is used for correcting the prediction error of the previous tree, so that in the iteration process, different feature subsets are used for training the model, and in each iteration, the gradient lifting tree can be gradually focused on the features of the high-risk authorization mode, and the accuracy and the stability of the model are further optimized.
In specific applications, one of the following training methods for the disclosed gradient-lifted tree may, but is not limited to, the following steps:
Step one: acquiring historical user data of a plurality of sample users, wherein the historical user data of any sample user comprises historical behavior data, historical personal information data and historical equipment information data of any sample user and a user tag of any sample user, and the user tag comprises a high-risk authorized user or a low-risk authorized user; these tags may also be multi-classified, for example, into different classes of high-risk properties, as the application is concerned; meanwhile, the content contained in the historical behavior data, the historical personal information data and the historical equipment information data is the same as the user data, and the details are not repeated here.
After obtaining the historical user data of the plurality of sample users, feature extraction may be performed, as shown in step two below.
Step two: performing feature extraction processing on historical behavior data, historical personal information data and historical equipment information data in each piece of historical user data to obtain sample user features of each sample user; in this embodiment, the initial feature extraction may be performed from each historical user data, and then the key feature is extracted from the initial feature, so as to form a final sample user feature; alternatively, since the feature extraction processes of the respective historical behavior data are the same, the foregoing two feature extraction processes will be specifically described below by taking any sample user as an example, and may be, but not limited to, as shown in the following steps (1) to (3).
Step (1): for the historical user data corresponding to any sample user, performing primary feature extraction processing on the historical behavior data, the historical personal information data and the historical equipment information data in the historical user data corresponding to any sample user to obtain initial user features of any sample user; in specific applications, effective information extraction (such as login time, operation behavior in login data, device name, IP address in device information, etc.) may be performed on each of the historical behavior data, the historical personal information data, and the historical device information data, and the extracted effective information may be used as an initial user feature.
After the initial user features corresponding to any sample user are obtained, key features can be extracted from the initial user features, wherein the key feature extraction process can be, but is not limited to, as shown in the following step (2).
Step (2): performing feature selection processing on each feature in the initial user features corresponding to any sample user to extract key features in the initial user features corresponding to any sample user; in specific application, the importance or the correlation of each feature in the initial user features can be calculated by using variance analysis and mutual information algorithm, so that feature selection is performed according to the importance and the correlation, and key features are extracted; meanwhile, the feature importance in the gradient lifting tree model can be used for feature selection; of course, importance or correlation analysis is performed based on analysis of variance and mutual information algorithm, which is a common technique in feature selection, and its principle is not described in detail.
After the extraction of the key features is completed, the extracted key features can be utilized to form the sample user features corresponding to any sample user, as shown in the following step (3).
Step (3): and using the extracted key features to form sample user features corresponding to any sample user.
The key features can be extracted from the initial user features of each sample user through the steps (1) to (3), so that the sample user features of each sample user are formed; then, each sample user characteristic can be associated with the user label of the corresponding sample user, so that associated data are obtained; the data association process is as follows in step three.
Step three: carrying out association processing on sample user characteristics of each sample user and user labels of each sample user to obtain a plurality of association characteristic data, and dividing the plurality of association characteristic data into a training set and a testing set; in this embodiment, the plurality of key feature data may be divided into a training set and a test set according to a preset ratio by way of example and not limited thereto; alternatively, the partitioning of the data is preferably performed in a 8:2 ratio.
After the training set and the testing set are obtained, training of the gradient lifting tree can be performed, and the training process is shown in the following step four.
Step four: taking each training data in the training set as input, taking a user identification result of a corresponding sample user of each training data as output, training a gradient lifting tree model, and testing the trained gradient lifting tree model by using a testing set after training is completed, so that when the testing result meets a preset condition, the trained gradient lifting tree model is used as the high-risk authorized user identification model; in this embodiment, the user identification result of any sample user is a high-risk user confidence level; thus, the model can be adjusted according to the confidence and by combining the label data.
In this embodiment, the model training process is shown in the following steps a to f.
A. Initializing iteration times t to be 1, and extracting a training subset in the t-th iteration from the training set; in this embodiment, at each iteration, a number of training data may be selected from the training set with a random number of substitutions, but not limited to, to form a training subset at each iteration.
After obtaining the training subset at the t-th iteration, model training may be performed based thereon, as shown in steps b-d below.
B. training a gradient lifting tree model by taking the training subset as input to obtain a model residual error in the t-th iteration; in this embodiment, the training input and output of the t-th iteration are identical to those described above, and will not be described again here; meanwhile, the model residual at the t-th iteration may be calculated by, for example and not limited to, using the following formula (1).
r_(it)=yi-F_{t-1}(xi),i=1,2,...,N (1)
In the above formula (1), r_ (it) represents the model residual error at the t-th iteration, yi represents the true value (i.e., the label data) of the i-th training data in the training subset at the t-th iteration; xi represents the ith training data, F_ { t-1} represents the model predictive value at the t-1 th iteration, N represents the total number of data in the training subset at the t-1 th iteration; meanwhile, when t is 1, F_ { t-1} is an initial value, and。
After the model residual at the t-th iteration is obtained, the model residual can be used to train the model in turn, which is trained Cheng Ruxia as shown in step c.
C. Training the gradient lifting tree model by using the model residual error to obtain a model prediction function in the t-th iteration; in this embodiment, the model predictive function is equivalent to a new trained regression tree through residuals.
After the model prediction function is obtained, the model can be updated, wherein the updating process is as shown in the following step d.
D. Updating the gradient lifting tree model by using the model prediction function to obtain an updated gradient lifting tree model; in practice, the distance may be updated by, but is not limited to, using the following equation (2).
F_t+1(x) = F_t(x) + η* h_t(x) (2)
In the above formula (2), f_t+1 (x) represents an updated gradient-lifted tree model, f_t (x) represents a model trained at the t-th iteration (i.e., the model output obtained in step b), h_t (x) is a model prediction function at the t-th iteration, and η represents a learning rate.
After updating the model based on the formula (2), the t-th iteration can be ended; then, judging whether the ending condition is met or not, and if not, jumping to the step a, and repeating continuously until the ending condition is met; the judging process and the loop iteration process are shown in the following steps e and f.
E. Judging whether a training ending condition is reached, wherein the training ending condition comprises whether t is equal to the maximum iteration number.
F. If not, adding 1 to t and replacing the gradient lifting tree model with the updated gradient lifting tree model, and extracting a training subset for the t iteration from the training set again to retrain the gradient lifting tree model until reaching a training ending condition, so as to obtain the trained gradient lifting tree model when reaching the training ending condition.
The training of the gradient lifting tree can be completed through the steps a-f, and a trained gradient lifting tree model is obtained; then, the test set can be used for testing the model; the method can be used for testing the model by using a K-fold cross validation method, and measuring the performance of the model by using indexes such as accuracy, recall, F1 fraction, mean square error and the like; thus, after the K rounds of cross validation, the K performance evaluation results can be summarized (generally, an average value is adopted as a final measure of the performance of the model), so as to judge whether the model reaches the preset condition.
In addition, in this embodiment, before feature extraction, data preprocessing, such as cleaning, conversion, normalization, etc., is required; of course, the foregoing process is a common technique for data preprocessing, and the principle thereof is not described in detail.
Through the design, the fort machine collects historical user data of all online users and trains an integrated learning model according to the historical user data, so that the recognition accuracy can be improved; meanwhile, the gradient lifting tree can automatically adapt to the complexity of data in the training process, so that the gradient lifting tree has strong adaptability to the identification problem of different types of high-risk authorized users.
After the training and testing of the gradient lifting tree model are completed through the steps, a high-risk authorized user identification model based on integrated learning can be constructed; then, inputting the user characteristics obtained in the step S2 into the high-risk authorized user identification model, and obtaining the identification result of the target user; wherein the identification process is as shown in step S4 below.
S4, inputting the user characteristics of the target user into the high-risk authorized user identification model to obtain a user identification result of the target user, and carrying out safety alarm when the user identification result is that the target user belongs to the high-risk authorized user; in this embodiment, for example, the user identification result of the target user includes a high-risk user confidence coefficient, so when the high-risk user confidence coefficient is greater than or equal to a confidence coefficient threshold value, it may be determined that the target user belongs to a high-risk authorized user; meanwhile, the confidence threshold value may be specifically set according to actual use, and is not specifically limited herein.
Further, after determining that the target user belongs to the high-risk authorized user, the embodiment is further provided with a corresponding security restriction step, and the operation process is as shown in the following steps S41 and S42.
S41, determining the risk level of the target user based on the high-risk user confidence coefficient of the target user; in this embodiment, different risk thresholds may be set, so as to determine a risk level of the target user according to a magnitude relationship between the high-risk user confidence level and the risk threshold; if the high-risk user confidence coefficient of the target user is between a confidence coefficient threshold value and a first risk threshold value, determining that the risk level of the target user is a three-level risk user; if the high-risk user confidence of the target user is between a first risk threshold and a second risk threshold, determining that the risk level of the target user is a secondary risk user; if the high-risk user confidence coefficient of the target user is larger than a second risk threshold value, determining that the risk level of the target user is a first-level risk user; in specific application, the risk levels of the primary risk user, the secondary risk user and the tertiary risk user are sequentially reduced, the confidence coefficient threshold value is smaller than a first risk threshold value, and the first risk threshold value is smaller than a second risk threshold value; of course, the risk threshold may be set specifically according to the actual use, and is not particularly limited herein.
After determining the risk level of the target user, a corresponding operation policy may be executed according to the risk level, as shown in step S42 below.
S42, according to the risk level of the target user, adopting an operation strategy corresponding to the risk level of the target user to carry out security authentication on the target user; in this embodiment, if the target user is a first-level risk user, stricter operations, such as multiple verification, additional identity verification, etc., may be adopted to ensure the identity and behavior security of the user, and if a threat continues to occur, the user login is limited; if the target user is a secondary risk user, some medium-level operations are adopted, such as limiting access of certain sensitive functions, sending risk prompts and the like, so that safety is improved; if the target user is a three-level risk user, easier operation such as common identity verification and monitoring can be adopted; of course, the operation strategies corresponding to the different classes may be specifically set according to actual use, and are not limited to the examples described above.
The user identification result of the target user also comprises high-risk user confidence coefficient, and safety authentication is carried out on the high-risk user confidence coefficient, and the method comprises the following steps:
Determining the risk level of the target user based on the high-risk user confidence level of the target user;
And according to the risk level of the target user, adopting an operation strategy corresponding to the risk level of the target user to carry out security authentication on the target user.
Further, determining the risk level of the target user based on the high-risk user confidence level of the target user includes:
if the high-risk user confidence coefficient of the target user is between the confidence coefficient threshold value and the first risk threshold value, determining that the risk level of the target user is a three-level risk user;
If the high-risk user confidence of the target user is between a first risk threshold and a second risk threshold, determining that the risk level of the target user is a secondary risk user;
If the high-risk user confidence coefficient of the target user is larger than a second risk threshold, determining that the risk level of the target user is a first-level risk user, wherein the risk levels of the first-level risk user, the second-level risk user and the third-level risk user are sequentially reduced.
In addition, after the security authentication in the foregoing step, the embodiment further determines whether the target user passes the security authentication; if the security authentication is not passed, a blocking instruction of the target user can be generated and executed to block the access of the target user to a sensitive system or resource; the forced offline of the actual high-risk authorized user is realized, and the danger authority obtained before is recovered; in this way, it can be ensured that high-risk users can no longer access sensitive systems or resources and remove their unsafe permission settings from the system.
The invention utilizes the gradient lifting tree algorithm in the integrated learning to realize the identification of the high-risk authorized users, namely, the performance of the model is gradually lifted by collecting a plurality of decision trees, so that the users with high risk authorized behaviors can be accurately distinguished, in each iteration, the gradient lifting tree is gradually focused on the characteristics of the high-risk authorized mode according to the error of the previous iteration, and the accuracy and the stability of the model are further optimized.
Finally, it should be noted that: the foregoing description is only of the preferred embodiments of the invention and is not intended to limit the scope of the invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (5)
1. The high-risk authorized user identification method based on ensemble learning is characterized by being applied to a fort machine and comprising the following steps of:
Acquiring user data of a target user, performing ETL operation on the user data of the target user, and performing data extraction, data conversion and data loading steps to eliminate errors and repeated user data; wherein the user data includes behavior data, personal information data, and device information data of the target user;
Performing feature extraction on the behavior data, the personal information data and the equipment information data in the user data to obtain user features of the target user;
Feature selection is carried out on the user features of the target user, and the dimension of user data is reduced by eliminating irrelevant or redundant features, so that the computational complexity is reduced;
mapping user data from a high dimension to a low dimension space using a Linear Discriminant Analysis (LDA);
Acquiring a high-risk authorized user identification model based on ensemble learning, wherein the high-risk authorized user identification model is obtained by training with user characteristics of a plurality of samples as input and user identification results of the samples as output;
Inputting the user characteristics of the target user into the high-risk authorized user identification model to obtain a user identification result of the target user; the high-risk authorized user identification model gives a score according to the difference between the quantized abnormal data and the normal data in the user identification result, and the smaller the score is, the higher the degree of abnormality is;
calculating an anomaly score: determining the information entropy and the conditional entropy of the feature data;
According to the information gain method, the quantized information contribution degree of the characteristic data to the training data set is represented, each abnormal data has a corresponding abnormal score, and after the abnormality is identified, the last abnormal score is judged according to the type of the abnormal data of the high-risk behavior;
determining whether the user is a high-risk authorized user according to the sources of the score statistical abnormal data and the influence of the data characteristics;
Before acquiring the user data of the target user, the method further comprises:
Acquiring historical user data of a plurality of sample users, wherein the historical user data of any sample user comprises historical behavior data, historical personal information data and historical equipment information data of any sample user and a user tag of any sample user, and the user tag comprises a high-risk authorized user or a low-risk authorized user;
Performing feature extraction processing on historical behavior data, historical personal information data and historical equipment information data in each piece of historical user data to obtain sample user features of each sample user;
Carrying out association processing on sample user characteristics of each sample user and user labels of each sample user to obtain a plurality of association characteristic data, and dividing the plurality of association characteristic data into a training set and a testing set;
Taking each training data in the training set as input, taking a user identification result of a corresponding sample user of each training data as output, training a gradient lifting tree model, and testing the trained gradient lifting tree model by using a testing set after training is completed, so that when the testing result meets a preset condition, the trained gradient lifting tree model is used as the high-risk authorized user identification model;
The training process of the high-risk authorized user identification model is shown in the following steps a-f;
a. Initializing iteration times t to be 1, and extracting a training subset in the t-th iteration from the training set; during each iteration, randomly selecting a plurality of training data from the training set to form a training subset during each iteration;
After obtaining the training subset at the t-th iteration, model training can be performed based on the training subset, as shown in the following steps b-d;
b. training a gradient lifting tree model by taking the training subset as input to obtain a model residual error in the t-th iteration; calculating a model residual error at the t-th iteration by adopting the following formula (1);
r_(it)=yi-F_{t-1}(xi),i=1,2,...,N(1)
In the above formula (1), r_ (it) represents the model residual error at the t-th iteration, and yi represents the true value of the i-th training data in the training subset at the t-th iteration; xi represents the ith training data, F_ { t-1} represents the model predictive value at the t-1 th iteration, N represents the total number of data in the training subset at the t-1 th iteration; meanwhile, when t is 1, F_ { t-1} is an initial value, and
After obtaining the model residual error at the t-th iteration, the model residual error can be used to train the model reversely, and the model is trained by Cheng Ruxia in the step c;
c. training the gradient lifting tree model by using the model residual error to obtain a model prediction function in the t-th iteration;
D, after the model prediction function is obtained, updating the model, wherein the updating process is shown in the following step d;
d. Updating the gradient lifting tree model by using the model prediction function to obtain an updated gradient lifting tree model; in specific implementation, the distance adopts the following formula (2) to update the model;
F_t+1(x)=F_t(x)+η*h_t(x)(2)
In the above formula (2), f_t+1 (x) represents an updated gradient lifting tree model, f_t (x) represents a model trained at the t-th iteration, h_t (x) is a model prediction function at the t-th iteration, and η represents a learning rate;
After updating the model based on the formula (2), the t-th iteration can be ended; then, judging whether the ending condition is met or not, and if not, jumping to the step a, and repeating continuously until the ending condition is met; the judging process and the cyclic iteration process are shown in the following steps e and f;
e. Judging whether a training ending condition is reached, wherein the training ending condition comprises whether t is equal to the maximum iteration times;
f. if not, adding 1 to t and replacing the gradient lifting tree model with the updated gradient lifting tree model, and extracting a training subset for the t iteration from the training set again to retrain the gradient lifting tree model until reaching a training ending condition, so as to obtain the trained gradient lifting tree model when reaching the training ending condition.
2. The method for identifying high-risk authorized users based on ensemble learning according to claim 1, wherein the user identification result of the target user further includes a high-risk user confidence level, and the high-risk user confidence level is securely authenticated, the method comprising:
Determining the risk level of the target user based on the high-risk user confidence level of the target user;
And according to the risk level of the target user, adopting an operation strategy corresponding to the risk level of the target user to carry out security authentication on the target user.
3. The high-risk authorized user identification method based on ensemble learning of claim 2, wherein determining the risk level of the target user based on the high-risk user confidence level of the target user comprises:
if the high-risk user confidence coefficient of the target user is between the confidence coefficient threshold value and the first risk threshold value, determining that the risk level of the target user is a three-level risk user;
If the high-risk user confidence of the target user is between a first risk threshold and a second risk threshold, determining that the risk level of the target user is a secondary risk user;
If the high-risk user confidence coefficient of the target user is larger than a second risk threshold, determining that the risk level of the target user is a first-level risk user, wherein the risk levels of the first-level risk user, the second-level risk user and the third-level risk user are sequentially reduced.
4. The high risk authorized user identification method based on ensemble learning of claim 2, wherein after security authentication of the target user, the method further comprises:
Judging whether the target user passes the security authentication;
If not, generating a blocking instruction of the target user, and executing the blocking instruction to block the access of the target user to the sensitive system or the resource.
5. The method for identifying high-risk authorized users based on ensemble learning according to claim 1, wherein the feature extraction processing is performed on the historical behavior data, the historical personal information data and the historical equipment information data in each historical user data to obtain sample user features of each sample user, and the method comprises the steps of:
For the historical user data corresponding to any sample user, performing primary feature extraction processing on the historical behavior data, the historical personal information data and the historical equipment information data in the historical user data corresponding to any sample user to obtain initial user features of any sample user;
Performing feature selection processing on each feature in the initial user features corresponding to any sample user to extract key features in the initial user features corresponding to any sample user;
and using the extracted key features to form sample user features corresponding to any sample user.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410014208.2A CN117521042B (en) | 2024-01-05 | 2024-01-05 | High-risk authorized user identification method based on ensemble learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410014208.2A CN117521042B (en) | 2024-01-05 | 2024-01-05 | High-risk authorized user identification method based on ensemble learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117521042A CN117521042A (en) | 2024-02-06 |
CN117521042B true CN117521042B (en) | 2024-05-14 |
Family
ID=89753559
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410014208.2A Active CN117521042B (en) | 2024-01-05 | 2024-01-05 | High-risk authorized user identification method based on ensemble learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117521042B (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017020713A1 (en) * | 2015-08-03 | 2017-02-09 | 阿里巴巴集团控股有限公司 | Method and apparatus for realizing high-risk right protection pre-warning |
CN110166438A (en) * | 2019-04-19 | 2019-08-23 | 平安科技(深圳)有限公司 | Login method, device, computer equipment and the computer storage medium of account information |
CN111178767A (en) * | 2019-12-31 | 2020-05-19 | 中国银行股份有限公司 | Risk control method and system, computer device and computer-readable storage medium |
CN112417439A (en) * | 2019-08-21 | 2021-02-26 | 北京达佳互联信息技术有限公司 | Account detection method, device, server and storage medium |
CN113468510A (en) * | 2021-07-15 | 2021-10-01 | 中国银行股份有限公司 | Abnormal login behavior data detection method and device |
CN113610366A (en) * | 2021-07-23 | 2021-11-05 | 上海淇玥信息技术有限公司 | Risk warning generation method and device and electronic equipment |
CN114398966A (en) * | 2021-12-31 | 2022-04-26 | 北京久安世纪科技有限公司 | Early warning method for user portrait based on fortress machine |
CN115700787A (en) * | 2021-07-14 | 2023-02-07 | 中移(成都)信息通信科技有限公司 | Abnormal object identification method and device, electronic equipment and storage medium |
CN116723018A (en) * | 2023-06-13 | 2023-09-08 | 中国电信股份有限公司 | Network characteristic analysis method and device, electronic equipment and storage medium |
-
2024
- 2024-01-05 CN CN202410014208.2A patent/CN117521042B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017020713A1 (en) * | 2015-08-03 | 2017-02-09 | 阿里巴巴集团控股有限公司 | Method and apparatus for realizing high-risk right protection pre-warning |
CN110166438A (en) * | 2019-04-19 | 2019-08-23 | 平安科技(深圳)有限公司 | Login method, device, computer equipment and the computer storage medium of account information |
CN112417439A (en) * | 2019-08-21 | 2021-02-26 | 北京达佳互联信息技术有限公司 | Account detection method, device, server and storage medium |
CN111178767A (en) * | 2019-12-31 | 2020-05-19 | 中国银行股份有限公司 | Risk control method and system, computer device and computer-readable storage medium |
CN115700787A (en) * | 2021-07-14 | 2023-02-07 | 中移(成都)信息通信科技有限公司 | Abnormal object identification method and device, electronic equipment and storage medium |
CN113468510A (en) * | 2021-07-15 | 2021-10-01 | 中国银行股份有限公司 | Abnormal login behavior data detection method and device |
CN113610366A (en) * | 2021-07-23 | 2021-11-05 | 上海淇玥信息技术有限公司 | Risk warning generation method and device and electronic equipment |
CN114398966A (en) * | 2021-12-31 | 2022-04-26 | 北京久安世纪科技有限公司 | Early warning method for user portrait based on fortress machine |
CN116723018A (en) * | 2023-06-13 | 2023-09-08 | 中国电信股份有限公司 | Network characteristic analysis method and device, electronic equipment and storage medium |
Non-Patent Citations (2)
Title |
---|
The Risk Prediction of Mobile User Tricking Account Overdraft Limit based on Fusion Model of Logistic and GBDT;Hao Kong等;ITNEC;20190606;全文 * |
基于生成对抗网络的异常检测方法的研究;周杰;中国优秀博硕士学位论文全文数据库(硕士);20210215;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN117521042A (en) | 2024-02-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108718310B (en) | Deep learning-based multilevel attack feature extraction and malicious behavior identification method | |
CN108566364B (en) | Intrusion detection method based on neural network | |
CN111914873A (en) | Two-stage cloud server unsupervised anomaly prediction method | |
CN107241358B (en) | Smart home intrusion detection method based on deep learning | |
Mohammadi et al. | A new deep learning approach for anomaly base IDS using memetic classifier | |
CN116957049B (en) | Unsupervised internal threat detection method based on countermeasure self-encoder | |
Hong et al. | The entropy and PCA based anomaly prediction in data streams | |
Kalyani et al. | Performance assessment of different classification techniques for intrusion detection | |
CN111143838A (en) | Database user abnormal behavior detection method | |
CN116633689B (en) | Data storage risk early warning method and system based on network security analysis | |
CN117992953A (en) | Abnormal user behavior identification method based on operation behavior tracking | |
CN110290101B (en) | Deep trust network-based associated attack behavior identification method in smart grid environment | |
CN117972596B (en) | Risk prediction method based on operation log | |
CN118353667A (en) | Network security early warning method and system based on deep learning | |
Guibene et al. | A pattern mining-based false data injection attack detector for industrial cyber-physical systems | |
CN111784404B (en) | Abnormal asset identification method based on behavior variable prediction | |
CN117874680A (en) | Operation and maintenance management system for fort machine | |
AL-Maliki et al. | Comparison study for NLP using machine learning techniques to detecting SQL injection vulnerabilities | |
CN117692242A (en) | Network attack path analysis method based on graph analysis | |
CN117473477A (en) | Login method, device and equipment of SaaS interactive system and storage medium | |
CN117370548A (en) | User behavior risk identification method, device, electronic equipment and medium | |
CN117407816A (en) | Multi-element time sequence anomaly detection method based on contrast learning | |
CN117251817A (en) | Radar fault detection method, device, equipment and storage medium | |
CN117521042B (en) | High-risk authorized user identification method based on ensemble learning | |
CN116776334A (en) | Office software vulnerability analysis method based on big data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |