CN115409115A - Time sequence clustering abnormal terminal identification method based on user log - Google Patents

Time sequence clustering abnormal terminal identification method based on user log Download PDF

Info

Publication number
CN115409115A
CN115409115A CN202211060899.7A CN202211060899A CN115409115A CN 115409115 A CN115409115 A CN 115409115A CN 202211060899 A CN202211060899 A CN 202211060899A CN 115409115 A CN115409115 A CN 115409115A
Authority
CN
China
Prior art keywords
data
user
abnormal
model
identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211060899.7A
Other languages
Chinese (zh)
Inventor
温时豪
朱正亮
吴帅帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qichacha Technology Co ltd
Original Assignee
Qichacha Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qichacha Technology Co ltd filed Critical Qichacha Technology Co ltd
Priority to CN202211060899.7A priority Critical patent/CN115409115A/en
Publication of CN115409115A publication Critical patent/CN115409115A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults

Abstract

The application relates to a time sequence clustering abnormal terminal identification method based on a user log. In one embodiment, through exploration and preprocessing of log data accessed by historical users, feature variance screening is carried out on the basis to obtain data variables with higher correlation, the problem of overfitting which can occur possibly can be effectively solved, a Gaussian mixed clustering model is used for iterative training, a small-scale abnormal recognition model is built according to the training process, real-time recognition is completed through an online mode of large data stream processing, the recognition speed is higher, and the accuracy is higher.

Description

Time sequence clustering abnormal terminal identification method based on user log
Technical Field
The disclosure relates to the field of data statistical analysis, in particular to a time sequence clustering abnormal terminal identification method based on a user log.
Background
At present, in a big data age of information explosion, the transmission and presentation modes of data on the internet are various, and more companies pay attention to protect their own data. Excessive abnormal users or crawler robots and the like are a great test for data security, server resources and the like of companies. The loss of data resources, the most core problem is the loss of company competitiveness; and if the access occupation ratio of the abnormal user/crawler robot is high, server resources are wasted, a large amount of server resource consumption is generated, if yes, normal user access is influenced, and if no, website service is available.
At present, the identification of abnormal users is mainly completed based on traditional data statistical analysis, the existing analysis result can only complete the identification of the abnormal users in small batches, the efficiency is low, the identification rule is hard, the identification can be completed by a large amount of manual calculation, and a plurality of problems still exist in the follow-up process, and the problem that personnel need to maintain regularly and solve customer complaints is solved. In addition, the existing abnormal recognition rule has thick subjective colors, so that the abnormal user group cannot be automatically and intelligently and quickly positioned.
Therefore, a method for automatically, intelligently and rapidly locating abnormal user groups is needed.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a method for identifying an abnormal time-series clustering terminal based on a user log. The technical scheme of the disclosure is as follows:
according to a first aspect of the embodiments of the present disclosure, a method for identifying a time-series clustering abnormal terminal based on a user log is provided, which includes:
obtaining historical user access log data;
performing data exploration and data preprocessing on the historical user access log data to obtain user basic data;
creating data characteristic variables based on the user basic data;
carrying out standardization processing on the data characteristic variables to obtain standard characteristic variables;
performing characteristic correlation analysis on the standard characteristic variables, and determining general characteristic variables of which the correlation is higher than a preset threshold;
performing feature variance analysis on the standard feature variables, and removing the general feature variables of which the variance does not reach a preset standard to obtain sample feature variables;
putting the sample characteristic variables into a Gaussian mixture clustering model for iterative training;
obtaining a small-scale anomaly identification model based on the iterative training result;
acquiring online user access log data through large data stream processing, and identifying the online user access log data in real time based on the small-level anomaly identification model to obtain anomalous user data;
and determining abnormal terminal information according to the abnormal user data.
In one embodiment, the data exploration comprises:
acquiring field information and quantity information in the historical user access log data;
carrying out anomaly analysis on the field information, and determining whether an abnormal value exists or not;
and analyzing the missing value of the field information to determine whether the missing value exists.
In one embodiment, the data preprocessing comprises:
and finishing data cleaning and data filling according to the missing value to obtain the user basic data.
In one embodiment, the creating of the data characteristic variable based on the user basic data includes:
carrying out user identification based on the user basic data to obtain a user basic data identification;
performing aggregation statistics based on the user basic data to obtain aggregation data characteristics;
calculating a time window based on the user basic data to obtain time sequence data characteristics;
performing category coding based on the user basic data to obtain category label data characteristics;
and creating the data characteristic variable according to the user basic data identification, the aggregation data characteristic, the time sequence data characteristic and the category label data characteristic.
In one embodiment, the iteratively training the sample feature variables into the gaussian mixture clustering model comprises:
determining the initial number of clusters;
initializing Gaussian distribution parameters of each cluster, and constructing Gaussian mixture probability density;
traversing the standard characteristic variables and calculating the conditional probability of meeting the distribution;
calculating a new Gaussian distribution parameter based on the conditional probability to obtain a new probability density;
and repeating the steps, and performing the iterative training until the effect of the Gaussian mixture clustering model is converged.
In one embodiment, the obtaining an hourly abnormality recognition model based on the result of the iterative training includes:
extracting key parameter information in the iterative training result, and serializing the key parameter information into a local model file;
and constructing the hour-level anomaly identification model according to the local model file.
In one embodiment, the real-time identification of the online user access log data based on the hour-level anomaly identification model includes:
putting the log data of the online user access into a kafka queue;
spark consumes said online user access log data in said kafka queue in real time;
processing the online user access log data through spark structure flow to obtain user data to be tested according with the small-scale abnormal recognition model format;
and importing the user data to be detected into the small-scale abnormal recognition model for real-time recognition to obtain the abnormal user data.
In one embodiment, after the online user access log data is identified in real time based on the hour-level anomaly identification model to obtain anomalous user data, the method further includes:
marking the abnormal users in real time according to the abnormal user data, and carrying out risk management and control measures.
According to a second aspect of the embodiments of the present disclosure, there is also provided an apparatus for identifying a time-series clustering abnormal terminal based on a user log, including:
the data acquisition module is used for acquiring historical user access log data and online user access log data;
the data processing module is used for carrying out data exploration and data preprocessing on the acquired historical user access log data to obtain user basic data; the system is also used for carrying out characteristic correlation analysis and characteristic variance analysis on the standard characteristic variables to obtain sample characteristic variables;
the characteristic engineering module is used for creating a data characteristic variable according to the user basic data and carrying out standardization processing on the data characteristic variable to obtain the standard characteristic variable;
the model training module is used for putting the sample characteristic variables into a Gaussian mixture clustering model for iterative training to obtain key parameter information required for constructing a small-scale anomaly recognition model;
the model construction module is used for constructing the small-scale abnormity identification model based on the key parameter information obtained by the iterative training;
the identification module is used for identifying the online user access log data in real time through the small-level abnormity identification model to obtain abnormal user data;
and the exception handling module is used for determining the exception terminal information according to the exception user data.
According to a third aspect of the embodiments of the present disclosure, there is also provided a computer device including a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the above method when executing the computer program.
According to a fourth aspect of the embodiments of the present disclosure, there is also provided a computer-readable storage medium, on which a computer program is stored, wherein the computer program is configured to implement the steps of the above method when executed by a processor.
According to the technical scheme, basic log data with higher identification correlation with abnormal users are obtained by exploring and preprocessing access log data of historical users, the basic log data are processed through clustering and a time window, so that the data contain time sequence characteristics, feature variance screening is performed again on the basis to obtain data variables with higher correlation, the problem of overfitting possibly occurring can be effectively avoided, a Gaussian mixture clustering model is placed after standardization processing for iterative training, a small-scale abnormal recognition model is built according to the training process, real-time identification is completed through an online mode of large data flow processing, the identification speed is higher, and the accuracy is higher.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the specification, and other drawings can be obtained by those skilled in the art without inventive labor.
FIG. 1 is a schematic flow chart of a method for identifying an abnormal terminal in time-series clustering based on a user log according to an embodiment;
FIG. 2 is a flow diagram that illustrates the data exploration process, in one embodiment;
FIG. 3 is a schematic flow diagram that illustrates the creation of data characteristic variables based on user base data in one embodiment;
FIG. 4 is a schematic flow chart of iterative training for putting sample feature variables into a Gaussian mixture clustering model in one embodiment;
FIG. 5 is a diagram illustrating an embodiment of determining a number of clusters based on Chichi region information criteria;
FIG. 6 is a schematic flow diagram that illustrates real-time identification of logged data for online user access based on an hourly anomaly identification model, under an embodiment;
FIG. 7 is a schematic diagram of an apparatus for identifying a chronological clustering abnormal terminal based on a user log according to an embodiment;
FIG. 8 is a schematic diagram of the internal structure of a computer device in one embodiment;
fig. 9 is a schematic diagram of an internal structure of a computer device in another embodiment.
Reference numerals:
802-a data acquisition module; 804-a data processing module; 806-feature engineering module; 808-a model training module; 810-a model building module; 812-an identification module; 814-exception handling module.
Detailed Description
In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the presence of additional identical or equivalent elements in a process, method, article, or apparatus that comprises the recited elements is not excluded. For example, if the terms first, second, etc. are used to denote names, they do not denote any particular order.
In the present disclosure, when an element is referred to as being "connected" to another element or "coupled" to another element, it can be directly connected to the other element or intervening elements may be present, and the same is to be understood in a broad sense, e.g., fixedly connected, detachably connected, or integrally connected; either mechanically or electrically. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
As used herein, the terms "vertical," "horizontal," "left," "right," "upper," "lower," "front," "rear," "circumferential," "direction of travel," and similar expressions are based on the orientations and positional relationships shown in the drawings and are intended only to facilitate the description of the invention and to simplify the description, but do not indicate or imply that the device or element so referred to must have a particular orientation, be constructed and operated in a particular orientation, and is therefore not to be considered limiting of the invention.
Unless defined otherwise, technical and scientific terms used herein may have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or", "at least one of 823030a" includes any and all combinations of one or more of the associated listed items. It should be noted that the connections, and the like described in this disclosure may be directly connected through interfaces or pins between devices, or may be connected through wires.
In the aspect of user data management, the existing identification method for abnormal users is mainly completed based on traditional data statistical analysis, the existing analysis result can only complete the identification of the abnormal users in small batches, the efficiency is low, the identification rule is hard, the identification rule can be completed by a large amount of manual calculation, and a lot of problems still exist in the follow-up process, and personnel are required to maintain regularly and solve the problem of customer complaints; meanwhile, the existing abnormity identification rule is obvious along with subjective experience and consciousness of business personnel, and an abnormal user group cannot be positioned automatically and intelligently and quickly.
In one embodiment, as shown in fig. 1, a method for identifying an abnormal terminal in time-series clustering based on a user log is provided, which includes the following steps:
step S202, obtaining historical user access log data.
And the historical user access log data is used as sample data for determining the optimal recognition model in the training process.
Specifically, the data acquisition dimension may be 24 hours of user access log data of the whole day of the complete week, specifically including field information, total amount, and the like.
And step S204, performing data exploration and data preprocessing on the historical user access log data to obtain user basic data.
The data exploration mainly comprises the steps of analyzing the overall situation of data, obtaining field and total amount information of the data, and determining whether missing values and abnormal values exist in the data; the data preprocessing comprises missing value processing and other operations on the data.
Optionally, distribution analysis, comparative analysis and statistic analysis can be performed on the data characteristics of each data field in the data exploration.
And step S206, creating a data characteristic variable based on the user basic data.
Specifically, the data characteristic variables are created based on the user basic data after data preprocessing. In some other embodiments, the data characteristic variables may also be visualized.
And S208, carrying out standardization processing on the data characteristic variables to obtain standard characteristic variables.
Specifically, data normalization processing is performed based on the used gaussian mixture clustering model, and standard characteristic variables which can be used by the gaussian mixture clustering model are obtained.
And step S210, performing characteristic correlation analysis on the standard characteristic variables, and determining general characteristic variables with the correlation higher than a preset threshold value.
Specifically, the analysis is performed based on the correlation of the data features, and the feature variables with the correlation higher than the preset threshold are determined. In some implementations, the variables for which the correlation reaches a threshold can be determined by a data thermodynamic diagram. It should be noted that the preset threshold is to screen out the feature variables with higher correlation, and can be set by a user according to actual needs. In some embodiments, the preset threshold may be set to 80% or 90%.
And step S212, performing feature variance analysis on the standard feature variables, and removing the general feature variables of which the variances do not reach the preset standard to obtain sample feature variables.
The preset standard is preset according to the relevance of the user data features, and in the process of carrying out variance analysis on the data features, the higher the relevance of the data features is, the smaller the variance is, and the smaller the user feature discrimination is, so that the data which are not enough to embody the user feature discrimination are removed through screening of the variance preset standard, and the feature variable with higher contribution degree to user anomaly prediction and identification is reserved as the sample feature variable.
And step S214, putting the sample characteristic variables into a Gaussian mixture clustering model for iterative training.
The gaussian mixed clustering model is a linear combination of multiple gaussian distribution functions, and is generally used for solving the problem that data in the same set contains multiple different distributions. The gaussian mixture model uses gaussian distributions as parametric models and is trained using the expectation maximization algorithm.
And S216, obtaining a small-scale abnormal recognition model based on the result of the iterative training.
Specifically, according to the optimal result obtained by iterative training, model parameter information of the optimal result can be determined, and a small-scale anomaly recognition model for predicting abnormal user behavior is constructed based on the model parameter information.
It should be noted that the sample feature variables used in the training process relate to the data features of the user within one hour, and the creating of the data feature variables in step S206 is also performed based on the data features within one hour, so that 24 different hour-level anomaly identification models can be obtained after iterative training, and are used for implementing hour-level anomaly real-time prediction identification for 24 time periods within one day.
Step S218, obtaining on-line user access log data through large data flow processing, and identifying the on-line user access log data in real time based on the small-level anomaly identification model to obtain anomalous user data.
Stream processing is a way to process big data, and it can operate on data entering the system at any time.
Specifically, the online user access log data can be acquired in a large data stream processing mode, operations such as data cleaning, sliding time window calculation, standardization processing and the like are performed, and the processed data are led into a corresponding small-level anomaly identification model to be identified in real time to obtain anomalous user data. Therefore, the real-time analysis and identification of users are completed by combining a large data medium-flow data processing operation means, abnormal user identification can be extracted from many user groups more quickly, and final data security risk guarantee is completed.
Step S220, according to the abnormal user data, determining abnormal terminal information.
The abnormal terminal information is terminal equipment information for performing abnormal behavior operation.
Specifically, the abnormal behavior operation and the abnormal user are determined according to the abnormal user data identified by the hour-level abnormal identification model, and the terminal equipment information used when the abnormal user performs the abnormal behavior operation is obtained by accessing the log data.
In the technical scheme provided by the embodiment of the disclosure, through exploration and pretreatment of historical user access log data, feature variance screening is carried out on the basis to obtain a data variable with higher correlation, the overfitting problem which possibly occurs can be effectively avoided, a Gaussian mixed clustering model is used for iterative training, a small-order anomaly recognition model is built according to the training process, real-time recognition is completed in an online mode of large data stream processing, data arrives in a stream mode, and the working time reaches millisecond level. Compared with the existing batch processing mode, the technical scheme provided by the embodiment does not need to execute operation on the whole data set, but executes operation on each data item in the transmission process, so that real-time reading of log data accessed by the online user is realized, and by combining a small-scale exception identification model, the entry of all data to be detected is not required to be waited, the exception user and the exception terminal can be immediately identified, and the risk possibly brought by exception behavior can be avoided earlier. The method analyzes the existence of the abnormal users in the last hour in real time in an hour unit, effectively solves the problems caused by the current abnormal users and the hard core abnormal recognition rule of the flow, fills the blank of modes such as related streaming data prediction and the like, and enables the prediction to be more stable and intelligent.
In one embodiment, as shown in FIG. 2, data exploration includes:
step S302, field information and quantity information in the historical user access log data are obtained.
The field information includes data field information such as uid (User Identification) User Identification, ip, token, referrer, status, etc. of the User. Wherein, ip is an ip address used by a user; token is an access token in computer identity authentication, representing the object of the right to perform an operation; refer represents any way a user enters a website or accesses a web page; status represents the state of the user accessing the web page.
And step S304, carrying out anomaly analysis on the field information, and determining whether an abnormal value exists.
Specifically, the presence of an abnormal value in the field information is determined by means of correlation analysis, variance analysis, or the like. In this way, whether the field has a larger entropy value or a larger data variance can be judged through the existence of the abnormal value of the field, and whether the field has the capability of distinguishing normal users from abnormal users is further determined.
And step S306, carrying out missing value analysis on the field information, and determining whether a missing value exists.
The missing value is missing data caused by errors and losses in the data acquisition and transmission process.
In the embodiment, through data exploration and analysis of the historical user access log data, basic information and field of the acquired data can be extracted to distinguish the capacity of abnormal users, and a cushion is made for screening and processing data characteristics in the follow-up process.
In one embodiment, the data pre-processing comprises:
and finishing data cleaning and data filling according to the missing value to obtain the user basic data.
Specifically, the missing useless data is cleaned, and corresponding data filling is performed according to the feature type of the missing data, so that the user basic data is obtained. When the feature type is a numerical feature, the average value of the feature can be taken for filling; when the feature class is a non-numerical feature, the feature with the largest number of occurrences in the feature class may be selected for filling.
In some optional embodiments, the populated data may be further subjected to aggregation analysis, and data features of a single user are aggregated to obtain an index result under a specific data feature of the single user, such as the longest login time of the single user. In some other embodiments, the padded data may be further subjected to a sliding window analysis, so as to obtain user data characteristics including information of different time periods.
In the embodiment, the missing data values are subjected to data cleaning and data filling, so that the sample data is supplemented completely, and the stability of the final recognition result is facilitated.
In one embodiment, as shown in FIG. 3, creating data feature variables based on the user base data comprises:
step S402, based on the basic data of the user, user identification is carried out to obtain the basic data identification of the user.
The user basic data identification comprises the use time of the user, the identification of whether the user pays or not, the use terminal ip identification and the like.
And S404, carrying out aggregation statistics based on the user basic data to obtain aggregation data characteristics.
Specifically, aggregation operation can be performed on the user basic access log data through different aggregation statistical modes, so that different aggregation data characteristics are obtained.
Wherein the different aggregated data characteristics comprise the total amount of requests of a single user within one hour, the vip identification condition of the user within one hour and the like.
Step S406, time window calculation is carried out based on the user basic data, and time sequence data characteristics are obtained.
Specifically, a sliding time window function may be used to perform window calculation on the user basic access log data, so as to obtain a time series data characteristic of the user in a time series. In this way, the user data characteristics of the user in a certain time interval can be obtained.
The time sequence data characteristics of the user in the time sequence comprise the number of the user switching the ip in a 5-minute time window or the number of different ips.
And step S408, performing category coding based on the user basic data to obtain category label data characteristics.
Specifically, the corresponding encoding processing may be performed on the category type data in the user basic data to obtain the corresponding category label data characteristics. For example, the vip Identifier of the user is encoded by one hot code, and the index of the ip and access URI (Uniform Resource Identifier) used by the user is encoded by tagging. Where a URI is a string used to identify the name of an internet resource, allowing a user to interact with any resource (including local and internet) via a specific protocol.
Step S410, the data characteristic variable is created according to the user basic data identification, the aggregation data characteristic, the time sequence data characteristic and the category label data characteristic.
Specifically, the data characteristic variables are created based on the user basic data identification, the aggregated data characteristic, the time sequence data characteristic and the category label data characteristic obtained in the previous steps.
In the embodiment, various data characteristics of the user under different time sequences can be obtained by means of aggregation, time window and the like of the basic data of the user, and the related classified data of the user can be independently coded, so that the created characteristic variables are more comprehensive, the capability of the training model for identifying abnormal users in a certain time interval is greatly improved, the real-time abnormal identification or monitoring based on a large amount of user data and log data is facilitated, and the data processing efficiency and the accuracy of model prediction identification are improved.
In one embodiment, as shown in fig. 4, the iteratively training the sample feature variables into the gaussian mixture clustering model comprises:
step S502, determining the initial number of clusters.
The number of clusters represents that the users are finally divided into several groups, and finally, a corresponding number of Gaussian distributions are formed.
Specifically, different numbers are traversed, and the optimal cluster number is determined through subsequent model calculation and data evaluation. In some other embodiments, the Chi-pool information criterion and the contour coefficients may be used to select the optimal number of model clusters. The Chi information criterion is a standard for measuring the fitting superiority of the statistical model, and can balance the complexity of the estimated model and the superiority of the fitting data of the model. The contour coefficient is an evaluation mode with good and bad clustering effect, and can be used for evaluating the influence of different algorithms or different operation modes of the algorithms on a clustering result on the basis of the same original data by combining two factors of cohesion and separation.
Step S504, initializing Gaussian distribution parameters of each cluster and constructing Gaussian mixture probability density.
Specifically, the standard characteristic variable includes multiple categories, data in each category conforms to gaussian distribution, the probability densities of the gaussian distributions in different categories are subjected to weighted summation to determine a final probability density function, the probability density function is represented by P (X), and a calculation formula is as follows:
Figure BDA0003825866560000121
Figure BDA0003825866560000122
wherein k represents the number of clusters, representing the Gaussian distribution forming k categories, and is a positive integer; mu,. Epsilon.are Gaussian distribution parameters, mu k Denotes the mean value, ∈ k Represents the variance; d represents the dimension of the variable; n (X | mu) kk ) The gaussian density representing the kth class, i.e. the probability that the kth class yields X.
It should be noted that μ and ∈ are given values at random at the beginning, and the final values are determined through a subsequent iterative training process.
Step S506, traversing the standard characteristic variables and calculating the conditional probability of meeting the distribution.
Specifically, by traversing all the data, the conditional probability that the standard feature variable of each user meets each distribution is calculated.
Step S508, calculating a new gaussian distribution parameter based on the conditional probability, and obtaining a new probability density.
Specifically, new μ and new epsilon are calculated according to the conditional probability calculated in the previous step, and then the gaussian distribution model is updated through the new μ and epsilon to obtain new probability density.
Step S510, repeatedly executing the above steps, and iteratively generating gaussian distribution parameters of the gaussian distribution model until the model effect is converged.
In some embodiments, the final number of clusters may be determined according to the akachi-pool information criterion, the conditional probability of each user is obtained based on the gaussian distribution parameters obtained in step S510, and the user data is divided into corresponding clusters according to the probability maximum, thereby completing the division of the user data.
In the embodiment, the Gaussian mixture clustering model is used for carrying out iterative training on the data, the number of the optimal clusters can be obtained after the model is converged, the optimal model closer to the actual abnormal condition can be obtained, and the final recognition accuracy is higher.
FIG. 5 is a diagram illustrating the determination of the number of clusters based on the Chichi region information criterion during the process of FIG. 4 in one embodiment.
In this figure, the horizontal axis represents the number of clusters, and the vertical axis represents AIC (Akaike information criterion) values. The smaller the AIC value, the more desirable the model, and the optimal number of clusters when the AIC value is minimized.
In fig. 5, when the number of clusters is 2, the AIC value is minimum, and it can be determined that the number of optimal clusters is 2.
In one embodiment, the deriving the small-scale anomaly recognition model based on the result of the iterative training includes:
extracting key parameter information in the iterative training result, and serializing the key parameter information into a local model file;
constructing the hour-level anomaly identification model according to the local model file;
the key parameter information comprises the number of the optimal model clusters and the weight information of the data characteristics.
In some optional embodiments, mapping information, weight parameter information, and the like in an important process of iterative training using a gaussian mixture clustering model may be serialized into a local model file. And before the abnormity identification is started, loading a local model file, and reconstructing the time sequence Gaussian mixture model. And the reconstructed time series Gaussian mixture model is the small-scale anomaly identification model.
In one embodiment, as shown in fig. 6, the identifying, in real time, log data of online user access based on the small-scale anomaly identification model includes:
step S602, putting the log data of the online user access into a kafka queue.
Kafka is a message queue, and can process the message queue in a large data state.
Step S604, the spark consumes the online user access log data in the kafka queue in real time.
The spark is a distributed computing framework, and the real-time consumption means that the spark can process online user access log data sent by the message queue kafka in real time.
Step S606, processing the online user access log data through spark structure flow to obtain the user data to be tested according with the small-scale abnormal recognition model format.
Specifically, data processing modes such as standardized operation and the like can be performed on the online user access log data, so that the user access log data to be tested meeting the small-scale anomaly recognition model import standard can be obtained. Those skilled in the art will appreciate that this step is performed in order for the data under test to meet the requirements of the entry model.
Step S608, importing the user data to be tested into the small-scale anomaly identification model for real-time identification, so as to obtain the anomalous user data.
Specifically, a small-scale anomaly recognition model is built according to a local model file obtained through iterative training, the user data to be detected obtained in the previous step are imported into the corresponding small-scale anomaly recognition model for operation recognition, and the abnormal user data are determined according to the classification result of the user group.
The abnormal user data comprises identified abnormal behavior operation data and user information data for performing abnormal behavior operation.
In the embodiment, the kafka queue is used as the middleware to store the peak of the data flow in the message queue, so that the request processing pressure of the server is relieved. In addition, data in the karfka are consumed in real time by spark, the processing speed is higher, the online user access log data per hour can be sent to the corresponding hour-level abnormity identification model for identification in real time, and risks caused by abnormal behaviors can be avoided more effectively in advance.
In one embodiment, after identifying the online user access log data based on the hour-level anomaly identification model to obtain anomalous user data, the method further includes:
marking the abnormal users in real time according to the abnormal user data, and carrying out risk management and control measures.
Wherein, the risk management and control measure includes: triggering security verification, for example, a user is required to perform graphic sliding, object identification or Chinese character clicking and the like, and continuing the next operation after verification; the user can be forced to log in again when the abnormal user data reaches a certain standard; prompting abnormal use information through a window; and forbidding the account of the abnormal user within a certain time, and the forbidding needs to be relieved through complaints, and the like.
Specifically, after the abnormal user is determined according to the abnormal user data, the abnormal user can be marked in real time by calling a risk management interface and sent to the blacklist library.
According to a second aspect of the embodiments of the present disclosure, as shown in fig. 7, there is also provided an apparatus for identifying a time-series clustering abnormal terminal based on a user log, including:
a data obtaining module 802, configured to obtain historical user access log data and online user access log data;
the data processing module 804 is used for performing data exploration and data preprocessing on the acquired historical user access log data to obtain user basic data; the system is also used for carrying out characteristic correlation analysis and characteristic variance analysis on the standard characteristic variables to obtain sample characteristic variables;
a feature engineering module 806, configured to create a data feature variable according to the user basic data, and perform standardization processing on the data feature variable to obtain the standard feature variable;
the model training module 808 is used for putting the sample characteristic variables into a Gaussian mixture clustering model for iterative training to obtain key parameter information required for constructing the small-scale anomaly recognition model;
the model construction module 810 is used for constructing the small-scale abnormality recognition model based on the key parameter information obtained by the iterative training;
the identification module 812 is used for identifying online user access log data in real time through the small-scale abnormal identification model to obtain abnormal user data; the identification module 812 includes a data integration unit (not shown in the figure) for integrating the acquired online user access log data into user data to be tested conforming to the small-scale anomaly identification model format;
and an exception handling module 814, configured to determine exception terminal information according to the exception user data.
According to a third aspect of the embodiments of the present disclosure, there is provided a computer device, which may be a server, and an internal structure diagram of which may be as shown in fig. 8. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store a local model file. The network interface of the computer device is used for communicating with an external terminal through a network connection. Which computer program is executed by a processor to implement the above-mentioned identification method.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for communicating with an external terminal in a wired or wireless manner, and the wireless manner can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. Which computer program is executed by a processor to implement the above-mentioned identification method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on a shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the configurations shown in fig. 8 or 9 are merely block diagrams of some configurations relevant to the present disclosure, and do not constitute a limitation on the computing devices to which the present disclosure may be applied, and that a particular computing device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps in the above-described method embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases involved in the embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the various embodiments provided herein may be, without limitation, general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, or the like.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It is noted that other embodiments of the present disclosure will become readily apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements that have been described and illustrated in the drawings, and that various modifications and changes may be made without departing from the scope thereof.

Claims (11)

1. A time sequence clustering abnormal terminal identification method based on a user log is characterized by comprising the following steps:
acquiring historical user access log data;
performing data exploration and data preprocessing on the historical user access log data to obtain user basic data;
creating data characteristic variables based on the user basic data;
carrying out standardization processing on the data characteristic variables to obtain standard characteristic variables;
performing characteristic correlation analysis on the standard characteristic variables, and determining general characteristic variables with the correlation higher than a preset threshold;
performing feature variance analysis on the standard feature variables, and removing the general feature variables of which the variance does not reach the preset standard to obtain sample feature variables;
putting the sample characteristic variables into a Gaussian mixture clustering model for iterative training;
obtaining a small-scale anomaly identification model based on the iterative training result;
acquiring online user access log data through large data stream processing, and identifying the online user access log data in real time based on the small-level anomaly identification model to obtain anomalous user data;
and determining abnormal terminal information according to the abnormal user data.
2. The identification method of claim 1, wherein the data exploration comprises:
acquiring field information and quantity information in the historical user access log data;
carrying out anomaly analysis on the field information to determine whether an abnormal value exists;
and analyzing the missing value of the field information to determine whether the missing value exists.
3. The identification method according to claim 2, wherein the data preprocessing comprises:
and finishing data cleaning and data filling according to the missing value to obtain the user basic data.
4. The method of claim 1, wherein the creating data characteristic variables based on the user base data comprises:
carrying out user identification based on the user basic data to obtain a user basic data identification;
performing aggregation statistics based on the user basic data to obtain aggregation data characteristics;
calculating a time window based on the user basic data to obtain time sequence data characteristics;
performing category coding based on the user basic data to obtain category label data characteristics;
and creating the data characteristic variable according to the user basic data identification, the aggregation data characteristic, the time sequence data characteristic and the category label data characteristic.
5. The identification method according to claim 1, wherein the iteratively training the sample feature variables into the Gaussian mixture clustering model comprises:
determining the initial number of clusters;
initializing Gaussian distribution parameters of each cluster, and constructing Gaussian mixture probability density;
traversing the standard characteristic variables and calculating the conditional probability of meeting the distribution;
calculating new Gaussian distribution parameters based on the conditional probability to obtain new probability density;
and repeating the steps, and performing the iterative training until the effect of the Gaussian mixture clustering model is converged.
6. The recognition method according to claim 1, wherein the deriving an hour-scale anomaly recognition model based on the result of the iterative training comprises:
extracting key parameter information in the iterative training result, and serializing the key parameter information into a local model file;
and constructing the hour-level anomaly identification model according to the local model file.
7. The identification method according to claim 1, wherein the identifying the online user access log data in real time based on the hour-scale anomaly identification model comprises:
putting the log data of the online user access into a kafka queue;
spark consumes the online user access log data in the kafka queue in real time;
processing the online user access log data through spark structure flow to obtain user data to be tested according with the small-scale abnormal recognition model format;
and importing the user data to be detected into the small-scale abnormal recognition model for real-time recognition to obtain the abnormal user data.
8. The identification method according to claim 1, wherein after the online user access log data is identified in real time based on the hour-level anomaly identification model to obtain anomalous user data, the method further comprises:
marking the abnormal users in real time according to the abnormal user data, and carrying out risk management and control measures.
9. An identification device for a time sequence clustering abnormal terminal based on a user log is characterized by comprising:
the data acquisition module is used for acquiring historical user access log data and online user access log data;
the data processing module is used for carrying out data exploration and data preprocessing on the acquired historical user access log data to obtain user basic data; the system is also used for carrying out feature correlation analysis and feature variance analysis on the standard feature variables to obtain sample feature variables;
the characteristic engineering module is used for creating a data characteristic variable according to the user basic data and carrying out standardization processing on the data characteristic variable to obtain the standard characteristic variable;
the model training module is used for putting the sample characteristic variables into a Gaussian mixture clustering model for iterative training to obtain key parameter information required for constructing a small-scale anomaly recognition model;
the model construction module is used for constructing the small-scale abnormity identification model based on the key parameter information obtained by the iterative training;
the identification module is used for identifying the online user access log data in real time through the small-level abnormity identification model to obtain abnormal user data;
and the exception handling module is used for determining the abnormal terminal information according to the abnormal user data.
10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the identification method of any one of claims 1 to 8 when executing the computer program.
11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the identification method of any one of claims 1 to 8.
CN202211060899.7A 2022-08-31 2022-08-31 Time sequence clustering abnormal terminal identification method based on user log Pending CN115409115A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211060899.7A CN115409115A (en) 2022-08-31 2022-08-31 Time sequence clustering abnormal terminal identification method based on user log

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211060899.7A CN115409115A (en) 2022-08-31 2022-08-31 Time sequence clustering abnormal terminal identification method based on user log

Publications (1)

Publication Number Publication Date
CN115409115A true CN115409115A (en) 2022-11-29

Family

ID=84164031

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211060899.7A Pending CN115409115A (en) 2022-08-31 2022-08-31 Time sequence clustering abnormal terminal identification method based on user log

Country Status (1)

Country Link
CN (1) CN115409115A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117033464A (en) * 2023-08-11 2023-11-10 上海鼎茂信息技术有限公司 Log parallel analysis algorithm based on clustering and application

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117033464A (en) * 2023-08-11 2023-11-10 上海鼎茂信息技术有限公司 Log parallel analysis algorithm based on clustering and application
CN117033464B (en) * 2023-08-11 2024-04-02 上海鼎茂信息技术有限公司 Log parallel analysis algorithm based on clustering and application

Similar Documents

Publication Publication Date Title
CN112529204A (en) Model training method, device and system
EP4020315A1 (en) Method, apparatus and system for determining label
CN111796957B (en) Transaction abnormal root cause analysis method and system based on application log
US20220284352A1 (en) Model update system, model update method, and related device
CN115409115A (en) Time sequence clustering abnormal terminal identification method based on user log
CN115545103A (en) Abnormal data identification method, label identification method and abnormal data identification device
CN116501979A (en) Information recommendation method, information recommendation device, computer equipment and computer readable storage medium
CN113891342A (en) Base station inspection method and device, electronic equipment and storage medium
CN116542013A (en) Reliability evaluation method, system and storage medium for power edge computing chip
CN113935788B (en) Model evaluation method, device, equipment and computer readable storage medium
CN112132498A (en) Inventory management method, device, equipment and storage medium
CN111737319B (en) User cluster prediction method, device, computer equipment and storage medium
CN112712194A (en) Electric quantity prediction method and device for power consumption cost intelligent optimization analysis
CN112308419A (en) Data processing method, device, equipment and computer storage medium
CN117556946A (en) Method, device, equipment and storage medium for predicting business handling quantity
CN116484293B (en) Platform user payment behavior prediction method based on SVM algorithm
Katz et al. Cold start for cloud anomaly detection
CN113742472B (en) Data mining method and device based on customer service marketing scene
CN116719888A (en) Cloud computing-based enterprise group service method, system and storage medium
CN117437004A (en) Risk identification method and device for resource borrowing service and computer equipment
CN117312912A (en) Method and device for generating service data classification prediction model and computer equipment
Liu et al. Abnormal electricity detection with hybrid deep neural network model
CN117453764A (en) Data mining analysis method
CN114493268A (en) Online analysis processing method, device, equipment and storage medium for big electric power data
CN117131030A (en) Cleaning method, device and equipment for storage addresses of equipment management database

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination