CN112738088A - Behavior sequence anomaly detection method and system based on unsupervised algorithm - Google Patents

Behavior sequence anomaly detection method and system based on unsupervised algorithm Download PDF

Info

Publication number
CN112738088A
CN112738088A CN202011589236.5A CN202011589236A CN112738088A CN 112738088 A CN112738088 A CN 112738088A CN 202011589236 A CN202011589236 A CN 202011589236A CN 112738088 A CN112738088 A CN 112738088A
Authority
CN
China
Prior art keywords
user
sequence
opr
session
behavior sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011589236.5A
Other languages
Chinese (zh)
Other versions
CN112738088B (en
Inventor
梁淑云
刘胜
马影
陶景龙
王启凡
魏国富
徐�明
殷钱安
余贤喆
周晓勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information and Data Security Solutions Co Ltd
Original Assignee
Information and Data Security Solutions Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information and Data Security Solutions Co Ltd filed Critical Information and Data Security Solutions Co Ltd
Priority to CN202011589236.5A priority Critical patent/CN112738088B/en
Publication of CN112738088A publication Critical patent/CN112738088A/en
Application granted granted Critical
Publication of CN112738088B publication Critical patent/CN112738088B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Security & Cryptography (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Computer Hardware Design (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention provides a behavior sequence abnormity detection method based on an unsupervised algorithm, which is based on enterprise web system operation data, calculates the time interval of two operations according to the sequence of user operations, segments a user behavior sequence according to whether the time interval of the two operations is larger than a preset threshold value or not, further trains a probability suffix tree model, outputs a probability value corresponding to the user behavior sequence according to the probability suffix tree model, takes the probability value corresponding to a user as the input of an isolated forest model which is a characteristic, and judges whether the user behavior is abnormal or not according to the output result of the model.

Description

Behavior sequence anomaly detection method and system based on unsupervised algorithm
Technical Field
The invention relates to the technical field of information safety, in particular to a behavior sequence abnormity detection method and system based on an unsupervised algorithm.
Background
In recent years, with the continuous development of cloud large-object-moving technology and market demands, business systems of all industries are rapidly increased, accompanying network attack means also show a diversified development trend, conventional and traditional security protection measures can only play a traditional security protection effect, and the capabilities are gradually disabled under the current complex network environment. How to rapidly and accurately excavate attack threats, malicious users, malicious behaviors and the like is becoming more and more difficult. Malicious behaviors such as website attack, "wool pulling", stealing internal data of enterprises and the like are hidden in a large number of normal network behaviors through various disguising means, and cause serious adverse effects and great harm to individuals, enterprises and society. Therefore, a method capable of processing massive complex weblog data and detecting abnormal behaviors or attack threats timely and accurately is needed.
On one hand, the user behavior sequence anomaly detection mainly adopts a rule engine-based method, combines with business experience, and summarizes a series of behavior sequence combinations in a manual or semi-manual mode to form a behavior sequence library, and when the user behavior sequence is not in the preset behavior sequence library, the behavior sequence belongs to suspected abnormal behaviors; on one hand, the behavior characteristics of the user are summarized through a historical case, a supervised or unsupervised model is trained, whether the user behavior is abnormal or not is detected, or a Markov model is constructed through a training historical behavior mode similar to that mentioned in the patent number CN201810668279.9, and the behavior abnormal score of the user is graded through the model; on one hand, the similarity of the behavior sequences between the users is compared, so that the users with abnormal behavior sequences are detected; on the other hand, similar to the processing method mentioned in patent No. CN201810668279.9, the user behaviors are cut at equal time intervals, statistical characteristics of various behaviors of the user within a time segment are counted, and then an unsupervised model such as an isolated forest is used to obtain an abnormal score of the user.
Based on the above discussion, the existing user behavior sequence anomaly detection method has some disadvantages, specifically as follows:
the definition of rules through manual or semi-manual summarization has poor adaptability and great difficulty. Firstly, experts capable of summarizing rules need to know and familiarize all systems and services, meanwhile need to know certain possible attack modes, namely security services, and need to have rich rule configuration experience, need to combine with practical application of users, and have poor operability. On the other hand, when the system is updated or the business process is changed, the corresponding rules may need to be updated and iterated at the same time, and the flexibility is relatively poor. Moreover, as time is accumulated, the rules of the rule base are more and more, because a plurality of persons possibly participate in cooperation in the rule base and meanwhile certain mobility possibly exists in the persons, the maintainability of the rule base is poor;
through the case that has occurred historically, the behavior feature structure of the user such as statistical features of fluctuation, frequency and the like is summarized, and a supervised or unsupervised model is trained, wherein the model can only capture information on the behavior statistical features of the user, such as obvious business statistical features of overhigh operation frequency, large operation fluctuation and the like, and the information of a behavior sequence such as the time sequence significance of behavior sequence and the business significance of behavior combination is difficult to capture. Existing attackers typically bypass model detection through statistical characterization by various means such as low frequency operation, IP replacement, etc.
And constructing a Markov model by training a historical behavior mode, generating a transition probability matrix, calculating the probability of a behavior sequence of the user, and taking the probability value as a standard for judging whether the user behavior is abnormal or not. Although a user with an abnormal behavior sequence can be captured to a certain extent by the method, it is difficult to capture abnormal behaviors outside the preset time sequence window, for example, if the preset time sequence window is 1, the behaviors a are normal, and the behaviors ab and bc are normal, and if the behaviors abc belong to an abnormal behavior, the model constructed by the method is difficult to capture. Furthermore, the model is generally used for calculating the occurrence probability of the whole sequence, is insensitive to whether the subsequences in the sequence are abnormal or not, and can be influenced by the sequence length.
And cutting user behaviors at equal time intervals, counting the statistical characteristics and the like of various behaviors of the user in a time segment, and further obtaining the abnormal score or abnormal label of the user by using an unsupervised model such as an isolated forest. The method is still essentially that by constructing statistical characteristics on behaviors, service information of partial behaviors per se can be lost.
Disclosure of Invention
The invention aims to solve the technical problems of label sample loss, low accuracy, high false alarm rate and the like in the detection of behavior sequence abnormity in the prior art.
The invention solves the technical problems through the following technical means:
a behavior sequence anomaly detection method based on an unsupervised algorithm comprises the following steps:
s01, collecting an operation log of the enterprise web system, and processing to obtain a first operation table T1_ opr of the web user;
s02, based on the web user operation table T _ opr, data of a preset period are taken, and the data are processed to obtain a second operation table T2_ opr; the second operation table T2_ opr at least includes a field USER _ ID and a session value corresponding to the operation interval time of two adjacent sides of the USER;
s03, combining short code columns corresponding to interfaces according to the USER _ ID and the session fields and the time sequence to generate a behavior sequence table T _ opr _ seq;
s04, training a probability suffix tree model with a preset order based on a behavior sequence field in a behavior sequence table T _ opr _ seq to obtain a target model;
s05, processing and processing the operation log data of the web system on the next day based on the steps S01-S03, obtaining a user operation behavior sequence table T _ opr _ seq _ test on the next day, inputting a target model obtained by training S04 to a behavior sequence field in the behavior sequence table T _ opr _ seq _ test, converting the target model into a numerical sequence, and counting an occupation ratio ABNORMAL _ LV which is lower than a preset probability threshold value in the numerical sequence;
s06, cutting the numerical value sequence obtained in the step S05 according to a preset width, and filling numerical values-1 on the right side of the sequence which is less than the preset width, so that the length of each record is the same, namely each user has the same number of characteristic numbers;
and S07, taking the equal-length data sequence processed in the step S06 and the occupation ratio ABNORMAL _ LV lower than the preset probability threshold in the step S05 as characteristic input of each user, calculating the ABNORMAL score of each record through an isolated forest algorithm, and outputting the ABNORMAL user, namely the record with label of-1, so as to realize ABNORMAL detection of the user behavior sequence.
The method provided by the invention belongs to an unsupervised method, has strong adaptability and does not need to depend too much on service experience; the user behavior sequence is segmented through the concept of 'session', and the logic and the reality are relatively met; on the other hand, considering that most attackers pretend themselves to be normal users in practical application, the attackers may be difficult to detect by calculating probability values of the whole sequence or the proportion of abnormal behaviors in the whole sequence. The method converts the behavior sequence into the numerical sequence, further inputs the numerical sequence as characteristics, detects whether the user is abnormal or not through an isolated forest algorithm, is effective in detecting the whole sequence abnormality, and is sensitive to local sequence abnormality.
Further, the first operation table T1_ opr in the step S01 at least includes the following fields: USER _ ID, USER IP address IP _ ADDR, OPERATION time OPR _ DATE, OPERATION TYPE OPERATION _ TYPE.
Further, the specific calculation method of the session value in step S02 is as follows: sequencing each USER _ ID in an ascending manner according to the operation time OPR _ DATE to form a first sequence rd1(1,2,3 … n …), and calculating two operation time intervals of each USER _ ID, namely subtracting the operation time corresponding to the sequence n from the operation time corresponding to the sequence n +1 to generate a field operation time interval OPR _ DUR according to the first sequence rd 1; presetting the session initial values corresponding to all users to be 1, judging whether the operation time interval of the users is smaller than a preset threshold, if so, determining the session corresponding to the record to be the current session, if not, determining the session corresponding to the record to be the current session plus 1, and so on, corresponding each record to the corresponding session, thereby generating a third column of operation session identification sessions.
Further, in step S03, the behavior sequence table T _ OPR _ SEQ includes fields of USER _ ID, session, and behavior sequence OPR _ SEQ, where the content of the field of the behavior sequence OPR _ SEQ is a combination of all operations in each session of each USER.
The invention also provides a behavior sequence anomaly detection system based on the unsupervised algorithm, which comprises the following steps:
the data acquisition module is used for acquiring the operation logs of the enterprise web system and processing the operation logs to obtain a first operation table T1_ opr of the web user;
the first data processing module is used for acquiring data of a preset period based on the web user operation table T _ opr and processing the data to obtain a second operation table T2_ opr; the second operation table T2_ opr at least includes a field USER _ ID and a session value corresponding to the operation interval time of two adjacent sides of the USER;
the second data processing module is used for combining the short coding columns corresponding to the interfaces according to the USER _ ID and the session fields and the time sequence order to generate a behavior sequence table T _ opr _ seq;
the model training module trains a probability suffix tree model with a preset order based on a behavior sequence field in a behavior sequence table T _ opr _ seq to obtain a target model;
the ratio calculation module is used for processing and processing the operation log data of the web system on the next day based on the execution process of the data acquisition module, the first data processing module and the second data processing module, obtaining a user operation behavior sequence table T _ opr _ seq _ test on the next day, inputting a trained target model to a behavior sequence field in the behavior sequence table T _ opr _ seq _ test, converting the trained target model into a numerical sequence, and counting an occupation ratio ABNORMAL _ LV in the numerical sequence, wherein the occupation ratio ABNORMAL _ LV is lower than a preset probability threshold;
the numerical sequence cutting module cuts the numerical sequence obtained in the ratio calculation module according to a preset width, and fills numerical value-1 on the right side of the sequence which is less than the preset width, so that the length of each record is the same, namely each user has the same number of characteristic numbers;
and the anomaly detection module is used for inputting the characteristic of each user by taking the equal-length data sequence processed in the numerical sequence cutting module and the percentage ABNORMAL _ LV which is lower than the preset probability threshold in the ratio calculation module, calculating the anomaly score of each record through an isolated forest algorithm, and outputting the record with the ABNORMAL user label of-1, thereby realizing the anomaly detection of the user behavior sequence.
Further, the first operation table T1_ opr in the data collection module at least includes the following fields: USER _ ID, USER IP address IP _ ADDR, OPERATION time OPR _ DATE, OPERATION TYPE OPERATION _ TYPE.
Further, a specific calculation method of the session value in the first data processing module is as follows: sequencing each USER _ ID in an ascending manner according to the operation time OPR _ DATE to form a first sequence rd1(1,2,3 … n …), and calculating two operation time intervals of each USER _ ID, namely subtracting the operation time corresponding to the sequence n from the operation time corresponding to the sequence n +1 to generate a field operation time interval OPR _ DUR according to the first sequence rd 1; presetting the session initial values corresponding to all users to be 1, judging whether the operation time interval of the users is smaller than a preset threshold, if so, determining the session corresponding to the record to be the current session, if not, determining the session corresponding to the record to be the current session plus 1, and so on, corresponding each record to the corresponding session, thereby generating a third column of operation session identification sessions.
Further, the behavior sequence table T _ OPR _ SEQ in the second data processing module includes fields of USER _ ID, session, and behavior sequence OPR _ SEQ, where the content of the field of the behavior sequence OPR _ SEQ is a combination of all operations in each session of each USER.
The present invention also provides a processing device comprising at least one processor, and at least one memory communicatively coupled to the processor, wherein: the memory stores program instructions executable by the processor, which when called by the processor are capable of performing the methods described above.
The present invention also provides a computer readable storage medium storing computer instructions for causing a computer to perform the method of the claims above.
The invention has the advantages that:
on one hand, the method provided by the invention belongs to an unsupervised method, has strong adaptability and does not need to depend on service experience too much; on one hand, the user behavior sequence is segmented through the concept of 'session', and the logic and the reality are relatively met; on the other hand, considering that most attackers pretend themselves to be normal users in practical application, the attackers may be difficult to detect by calculating probability values of the whole sequence or the proportion of abnormal behaviors in the whole sequence. The method converts the behavior sequence into the numerical sequence, further inputs the numerical sequence as characteristics, detects whether the user is abnormal or not through an isolated forest algorithm, is effective in detecting the whole sequence abnormality, and is sensitive to local sequence abnormality.
Drawings
FIG. 1 is a flow chart of a detection method in an embodiment of the invention;
FIG. 2 is a schematic diagram of a probabilistic suffix tree structure with a depth of 3 according to an embodiment of the present invention;
FIG. 3 is an exemplary diagram of the conversion into a numerical sequence according to a probabilistic suffix tree model in an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment provides a behavior sequence abnormity detection method based on an unsupervised algorithm, which is based on enterprise web system operation data, calculates the time interval of two operations according to the sequence of user operations, segments a user behavior sequence according to whether the time interval of the two operations is greater than a preset threshold value, further trains a probability suffix tree model, outputs a probability value corresponding to the user behavior sequence according to the probability suffix tree model, takes the probability value corresponding to the user as the input of an isolated forest model which is a characteristic, and judges whether the user behavior is abnormal according to the output result of the model. The method comprises the following specific steps:
s01, collecting operation logs of the enterprise web system, standardizing the operation logs into a web user operation table T _ opr, wherein the table comprises the following fields: the method comprises the steps of identifying a USER unique identifier (USER _ ID), a USER IP address (IP _ ADDR), OPERATION time (OPR _ DATE), an OPERATION TYPE (OPERATION _ TYPE) and the like, corresponding each OPERATION TYPE to a preset single-character code set (a, b, c and d …), wherein if the OPERATION TYPEs are more, the single-character code set can be expanded by using common Chinese characters and the like, for example, the OPERATION TYPE 'whole _ page _ load' (OPERATION TYPE name: page loading) corresponds to a code 'a' in the preset code set, the OPERATION TYPE 'submit' (OPERATION TYPE name: submitting) corresponds to a code 'b' in the preset code set, and the OPERATION TYPE 'down' (OPERATION TYPE name: downloading) corresponds to a code 'c' in the preset code set, so that the OPERATION TYPE corresponds to a SHORT code (SHORT _ OPR _ TYPE), memory occupation is reduced, and processing is convenient.
S02, based on the web user operation table T _ opr processed above, takes data of a preset period (7 days). In the invention, a field similar to "session" does not exist in the data, and if the field similar to "session" exists in the original data, the step can be omitted. And sequencing each USER _ ID in an ascending manner according to the operation time (OPR _ DATE) to form a first sequence rd1(1,2,3 … n …), and calculating the operation time interval of each USER (USER _ ID) twice according to the first sequence, namely the operation time corresponding to the sequence n +1 minus the operation time corresponding to the sequence n, so as to generate a field operation time interval OPR _ DUR. Presetting the session initial values corresponding to all users to be 1, judging whether the operation time interval of the users is smaller than a preset threshold (20 minutes), if so, determining the session corresponding to the record to be the current session, if not, determining the session corresponding to the record to be the current session plus 1, and so on, corresponding each record to the corresponding session, thereby generating a third column of operation session identification sessions.
The session, in the network application, the web server will judge whether the user has a session, if so, the original session will be used continuously, if not, the session will be created for the user, and when the session is expired or abandoned, the server will terminate the session. The invention constructs a field similar to 'session' by using the thought.
S03, according to the USER (USER _ ID) and the session field, combining the SHORT code columns (SHORT _ OPR _ TYPE) corresponding to the interfaces according to the time sequence, and generating a behavior sequence list T _ OPR _ seq, wherein the list comprises the following fields: the USER unique identifier (USER _ ID), the session, and the behavior sequence (OPR _ SEQ), where the content of the behavior sequence field is a combination of all operations in each session of each USER, such as USER "a 151", where when the session is 15, a, b, c, b, and c are performed in sequence, and then the behavior sequence (OPR _ SEQ) is abcbc.
And S04, training a probability suffix tree model with a preset order (3 orders) based on the action sequence field (OPR _ SEQ) in the action sequence table T _ OPR _ SEQ generated in S03.
The probabilistic suffix tree is a compact form of a variable-order markov chain model that uses the suffix tree as an indexing structure. When the subsequences form a probabilistic suffix tree, the probability of the next action occurring can be predicted by examining points near the root node in the tree. Briefly, the probabilistic suffix tree is a suffix tree constructed by using subsequences with prediction capability, in which each node represents an element, while generating each node, the whole probability distribution of the next node is counted, and each edge represents a path from the root node to the current node, i.e. a subsequence of the incoming tree. The specific process of the model training is as follows:
1. presetting a probability suffix tree depth L (3);
2. initializing a root node (root), wherein the probability vector value of the root node is the probability of each symbol appearing in the sequence, and all symbols are used as a candidate child node set;
3. calculating the probability of the candidate sub-nodes appearing in the subsequence aiming at each candidate sub-node, and taking all symbols as a new candidate sub-node set;
4. and recursing the process until the tree depth on the current branch reaches the preset probability suffix tree depth L or the candidate child node set is empty.
As shown in fig. 2, fig. 2 is a probability suffix tree with a depth of 3.
S05, processing the operation log data of the web system of the next day based on the steps S01-S03, obtaining the action sequence table T _ OPR _ SEQ _ test of the user operation of the next day, and converting the action sequence (OPR _ SEQ) field in the action sequence table T _ OPR _ SEQ _ test into a numerical value sequence according to the probability suffix tree model trained in S04, as shown in FIG. 3.
Meanwhile, if the operation sequence of the user "a 832" is "bbaaa", the corresponding value sequences are [0.523,0.617,0.038,0.187,0.191], wherein the number of the values below the preset probability threshold is only 1 0.038, so that the occupancy ABNORMAL _ LV is 0.2.
S06, considering that the length of the operation behavior sequence in each session of each user has difference, the numerical value sequence generated for S05 is cut according to the same width as the preset width (9), and the sequences which are less than the preset width are filled with the numerical value-1 on the right side, so that the length of each record is the same, that is, each user has the same number of characteristic numbers:
s07, the isolated forest (iForest) algorithm belongs to a parameter-free and unsupervised algorithm, namely a data model does not need to be assumed, a training model with label is not needed, and large-scale data can be processed quickly. In the model building process, by means of a binary tree, an isolated forest cuts a data space by using a random hyperplane, two subspaces, namely left and right children, can be generated after cutting once, then each subspace is cut by using a random hyperplane, and the process is circulated until each subspace cannot be cut again. Intuitively, it can be found that the clusters with high density need to be cut many times to stop cutting, but the points with low density can easily and early stop to a subspace, and finally the abnormal score of each point is obtained by calculating the path length from the leaf node to the root node, so that the points with consistent behaviors are cut to a subspace, and the paths from the points with consistent behaviors to the root node are the same, namely the abnormal scores are the same. And (3) taking the processed data sequence with the equal length in the S06 and the occupation ratio ABNORMAL _ LV which is lower than a preset probability threshold (0.1) in the S05 as the characteristic input of each user, calculating the ABNORMAL score of each record through an isolated forest algorithm, and outputting the ABNORMAL user, namely the record with label of-1, thereby realizing the ABNORMAL detection of the user behavior sequence.
Based on the above detection method, this embodiment further provides a behavior sequence anomaly detection system based on an unsupervised algorithm, including:
and the data acquisition module is used for acquiring the operation log of the enterprise web system and standardizing the operation log into a web user operation table T _ opr, wherein the table comprises the following fields: the method comprises the steps of identifying a USER unique identifier (USER _ ID), a USER IP address (IP _ ADDR), OPERATION time (OPR _ DATE), an OPERATION TYPE (OPERATION _ TYPE) and the like, corresponding each OPERATION TYPE to a preset single-character code set (a, b, c and d …), wherein if the OPERATION TYPEs are more, the single-character code set can be expanded by using common Chinese characters and the like, for example, the OPERATION TYPE 'whole _ page _ load' (OPERATION TYPE name: page loading) corresponds to a code 'a' in the preset code set, the OPERATION TYPE 'submit' (OPERATION TYPE name: submitting) corresponds to a code 'b' in the preset code set, and the OPERATION TYPE 'down' (OPERATION TYPE name: downloading) corresponds to a code 'c' in the preset code set, so that the OPERATION TYPE corresponds to a SHORT code (SHORT _ OPR _ TYPE), memory occupation is reduced, and processing is convenient.
And the first data processing module is used for taking data in a preset period (7 days) based on the processed web user operation table T _ opr. In the invention, a field similar to "session" does not exist in the data, and if the field similar to "session" exists in the original data, the step can be omitted. And sequencing each USER _ ID in an ascending manner according to the operation time (OPR _ DATE) to form a first sequence rd1(1,2,3 … n …), and calculating the operation time interval of each USER (USER _ ID) twice according to the first sequence, namely the operation time corresponding to the sequence n +1 minus the operation time corresponding to the sequence n, so as to generate a field operation time interval OPR _ DUR. Presetting the session initial values corresponding to all users to be 1, judging whether the operation time interval of the users is smaller than a preset threshold (20 minutes), if so, determining the session corresponding to the record to be the current session, if not, determining the session corresponding to the record to be the current session plus 1, and so on, corresponding each record to the corresponding session, thereby generating a third column of operation session identification sessions.
The session, in the network application, the web server will judge whether the user has a session, if so, the original session will be used continuously, if not, the session will be created for the user, and when the session is expired or abandoned, the server will terminate the session. The invention constructs a field similar to 'session' by using the thought.
The second data processing module merges a SHORT code column (SHORT _ OPR _ TYPE) corresponding to the interface according to the sequence of time and the sequence of USER (USER _ ID) and session fields to generate a behavior sequence table T _ OPR _ seq, wherein the table comprises the following fields: the USER unique identifier (USER _ ID), the session, and the behavior sequence (OPR _ SEQ), where the content of the behavior sequence field is a combination of all operations in each session of each USER, such as USER "a 151", where when the session is 15, a, b, c, b, and c are performed in sequence, and then the behavior sequence (OPR _ SEQ) is abcbc.
And the model training module trains a probability suffix tree model with a preset order (3 th order) based on the action sequence field (OPR _ SEQ) field in the action sequence table T _ OPR _ SEQ generated in the S03.
The probabilistic suffix tree is a compact form of a variable-order markov chain model that uses the suffix tree as an indexing structure. When the subsequences form a probabilistic suffix tree, the probability of the next action occurring can be predicted by examining points near the root node in the tree. Briefly, the probabilistic suffix tree is a suffix tree constructed by using subsequences with prediction capability, in which each node represents an element, while generating each node, the whole probability distribution of the next node is counted, and each edge represents a path from the root node to the current node, i.e. a subsequence of the incoming tree. The specific process of the model training is as follows:
1. presetting a probability suffix tree depth L (3);
2. initializing a root node (root), wherein the probability vector value of the root node is the probability of each symbol appearing in the sequence, and all symbols are used as a candidate child node set;
3. calculating the probability of the candidate sub-nodes appearing in the subsequence aiming at each candidate sub-node, and taking all symbols as a new candidate sub-node set;
4. and recursing the process until the tree depth on the current branch reaches the preset probability suffix tree depth L or the candidate child node set is empty.
As shown in fig. 2, fig. 2 is a probability suffix tree with a depth of 3.
And the ratio calculation module is used for processing the operation log data of the web system on the next day based on the execution processes of the data acquisition module, the first data processing module and the second data processing module, obtaining a user operation behavior sequence table T _ OPR _ SEQ _ test on the next day, and converting a probability suffix tree model trained according to S04 into a numerical value sequence for a behavior sequence (OPR _ SEQ) field in the T _ OPR _ SEQ _ test table, wherein the probability suffix tree model is shown in FIG. 3.
Meanwhile, if the operation sequence of the user "a 832" is "bbaaa", the corresponding value sequences are [0.523,0.617,0.038,0.187,0.191], wherein the number of the values below the preset probability threshold is only 1 0.038, so that the occupancy ABNORMAL _ LV is 0.2.
The numerical sequence cutting module cuts the numerical sequence generated by the step S05 according to the same width as the preset width (9) in view of the difference of the length of the operation behavior sequence in each session of each user, and fills a numerical value-1 on the right side of the sequence which is less than the preset width, so that the length of each record is the same, that is, each user has the same number of characteristic numbers:
the anomaly detection module is an isolated forest (iForest) algorithm, belongs to a parameter-free and unsupervised algorithm, namely a data model does not need to be assumed, a training model with label is not needed, and large-scale data can be rapidly processed. In the model building process, by means of a binary tree, an isolated forest cuts a data space by using a random hyperplane, two subspaces, namely left and right children, can be generated after cutting once, then each subspace is cut by using a random hyperplane, and the process is circulated until each subspace cannot be cut again. Intuitively, it can be found that the clusters with high density need to be cut many times to stop cutting, but the points with low density can easily and early stop to a subspace, and finally the abnormal score of each point is obtained by calculating the path length from the leaf node to the root node, so that the points with consistent behaviors are cut to a subspace, and the paths from the points with consistent behaviors to the root node are the same, namely the abnormal scores are the same. And (3) taking the processed data sequence with the equal length in the S06 and the occupation ratio ABNORMAL _ LV which is lower than a preset probability threshold (0.1) in the S05 as the characteristic input of each user, calculating the ABNORMAL score of each record through an isolated forest algorithm, and outputting the ABNORMAL user, namely the record with label of-1, thereby realizing the ABNORMAL detection of the user behavior sequence.
The present embodiment also provides a processing device, including at least one processor, and at least one memory communicatively coupled to the processor, wherein: the memory stores program instructions executable by the processor, which when called by the processor are capable of performing the methods described above.
The present embodiments also provide a computer-readable storage medium storing computer instructions that cause a computer to perform the method according to the above claims.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A behavior sequence anomaly detection method based on an unsupervised algorithm is characterized by comprising the following steps:
s01, collecting an operation log of the enterprise web system, and processing to obtain a first operation table T1_ opr of the web user;
s02, based on the web user operation table T _ opr, data of a preset period are taken, and the data are processed to obtain a second operation table T2_ opr; the second operation table T2_ opr at least includes a field USER _ ID and a session value corresponding to the operation interval time of two adjacent sides of the USER;
s03, combining short code columns corresponding to interfaces according to the USER _ ID and the session fields and the time sequence to generate a behavior sequence table T _ opr _ seq;
s04, training a probability suffix tree model with a preset order based on a behavior sequence field in a behavior sequence table T _ opr _ seq to obtain a target model;
s05, processing and processing the operation log data of the web system on the next day based on the steps S01-S03, obtaining a user operation behavior sequence table T _ opr _ seq _ test on the next day, inputting a target model obtained by training S04 to a behavior sequence field in the behavior sequence table T _ opr _ seq _ test, converting the target model into a numerical sequence, and counting an occupation ratio ABNORMAL _ LV which is lower than a preset probability threshold value in the numerical sequence;
s06, cutting the numerical value sequence obtained in the step S05 according to a preset width, and filling numerical values-1 on the right side of the sequence which is less than the preset width, so that the length of each record is the same, namely each user has the same number of characteristic numbers;
and S07, taking the equal-length data sequence processed in the step S06 and the occupation ratio ABNORMAL _ LV lower than the preset probability threshold in the step S05 as characteristic input of each user, calculating the ABNORMAL score of each record through an isolated forest algorithm, and outputting the ABNORMAL user, namely the record with label of-1, so as to realize ABNORMAL detection of the user behavior sequence.
2. The unsupervised algorithm-based behavior sequence anomaly detection method according to claim 1, characterized in that: the first operation table T1_ opr in the step S01 includes at least the following fields: USER _ ID, USER IP address IP _ ADDR, OPERATION time OPR _ DATE, OPERATION TYPE OPERATION _ TYPE.
3. The unsupervised algorithm-based behavior sequence anomaly detection method according to claim 1, characterized in that: the specific calculation method of the session value in the step S02 is as follows: sequencing each USER _ ID in an ascending manner according to the operation time OPR _ DATE to form a first sequence rd1(1,2,3 … n …), and calculating two operation time intervals of each USER _ ID, namely subtracting the operation time corresponding to the sequence n from the operation time corresponding to the sequence n +1 to generate a field operation time interval OPR _ DUR according to the first sequence rd 1; presetting the session initial values corresponding to all users to be 1, judging whether the operation time interval of the users is smaller than a preset threshold, if so, determining the session corresponding to the record to be the current session, if not, determining the session corresponding to the record to be the current session plus 1, and so on, corresponding each record to the corresponding session, thereby generating a third column of operation session identification sessions.
4. The unsupervised algorithm-based behavior sequence anomaly detection method according to claim 1, characterized in that: in step S03, the behavior sequence table T _ OPR _ SEQ includes fields of USER _ ID, session, and behavior sequence OPR _ SEQ, where the content of the field of the behavior sequence OPR _ SEQ is a combination of all operations in each session of each USER.
5. A behavior sequence anomaly detection system based on an unsupervised algorithm is characterized in that: the method comprises the following steps:
the data acquisition module is used for acquiring the operation logs of the enterprise web system and processing the operation logs to obtain a first operation table T1_ opr of the web user;
the first data processing module is used for acquiring data of a preset period based on the web user operation table T _ opr and processing the data to obtain a second operation table T2_ opr; the second operation table T2_ opr at least includes a field USER _ ID and a session value corresponding to the operation interval time of two adjacent sides of the USER;
the second data processing module is used for combining the short coding columns corresponding to the interfaces according to the USER _ ID and the session fields and the time sequence order to generate a behavior sequence table T _ opr _ seq;
the model training module trains a probability suffix tree model with a preset order based on a behavior sequence field in a behavior sequence table T _ opr _ seq to obtain a target model;
the ratio calculation module is used for processing and processing the operation log data of the web system on the next day based on the execution process of the data acquisition module, the first data processing module and the second data processing module, obtaining a user operation behavior sequence table T _ opr _ seq _ test on the next day, inputting a trained target model to a behavior sequence field in the behavior sequence table T _ opr _ seq _ test, converting the trained target model into a numerical sequence, and counting an occupation ratio ABNORMAL _ LV in the numerical sequence, wherein the occupation ratio ABNORMAL _ LV is lower than a preset probability threshold;
the numerical sequence cutting module cuts the numerical sequence obtained in the ratio calculation module according to a preset width, and fills numerical value-1 on the right side of the sequence which is less than the preset width, so that the length of each record is the same, namely each user has the same number of characteristic numbers;
and the anomaly detection module is used for inputting the characteristic of each user by taking the equal-length data sequence processed in the numerical sequence cutting module and the percentage ABNORMAL _ LV which is lower than the preset probability threshold in the ratio calculation module, calculating the anomaly score of each record through an isolated forest algorithm, and outputting the record with the ABNORMAL user label of-1, thereby realizing the anomaly detection of the user behavior sequence.
6. The unsupervised algorithm-based behavior sequence anomaly detection system according to claim 5, wherein: the first operation table T1_ opr in the data acquisition module comprises at least the following fields: USER _ ID, USER IP address IP _ ADDR, OPERATION time OPR _ DATE, OPERATION TYPE OPERATION _ TYPE.
7. The unsupervised algorithm-based behavior sequence anomaly detection system according to claim 5, wherein: the specific calculation method of the session value in the first data processing module is as follows: sequencing each USER _ ID in an ascending manner according to the operation time OPR _ DATE to form a first sequence rd1(1,2,3 … n …), and calculating two operation time intervals of each USER _ ID, namely subtracting the operation time corresponding to the sequence n from the operation time corresponding to the sequence n +1 to generate a field operation time interval OPR _ DUR according to the first sequence rd 1; presetting the session initial values corresponding to all users to be 1, judging whether the operation time interval of the users is smaller than a preset threshold, if so, determining the session corresponding to the record to be the current session, if not, determining the session corresponding to the record to be the current session plus 1, and so on, corresponding each record to the corresponding session, thereby generating a third column of operation session identification sessions.
8. The unsupervised algorithm-based behavior sequence anomaly detection system according to claim 5, wherein: and the action sequence table T _ OPR _ SEQ in the second data processing module comprises fields of USER _ ID, session and action sequence OPR _ SEQ, wherein the content of the field of the action sequence OPR _ SEQ is the combination of all operations in each session of each USER.
9. A processing device comprising at least one processor and at least one memory communicatively coupled to the processor, wherein: the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1 to 4.
10. A computer-readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 4.
CN202011589236.5A 2020-12-28 2020-12-28 Behavior sequence anomaly detection method and system based on unsupervised algorithm Active CN112738088B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011589236.5A CN112738088B (en) 2020-12-28 2020-12-28 Behavior sequence anomaly detection method and system based on unsupervised algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011589236.5A CN112738088B (en) 2020-12-28 2020-12-28 Behavior sequence anomaly detection method and system based on unsupervised algorithm

Publications (2)

Publication Number Publication Date
CN112738088A true CN112738088A (en) 2021-04-30
CN112738088B CN112738088B (en) 2023-03-21

Family

ID=75607372

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011589236.5A Active CN112738088B (en) 2020-12-28 2020-12-28 Behavior sequence anomaly detection method and system based on unsupervised algorithm

Country Status (1)

Country Link
CN (1) CN112738088B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113569949A (en) * 2021-07-28 2021-10-29 广州博冠信息科技有限公司 Abnormal user identification method and device, electronic equipment and storage medium
CN113609933A (en) * 2021-07-21 2021-11-05 广州大学 Fault detection method, system, device and storage medium based on suffix tree
CN113934616A (en) * 2021-12-16 2022-01-14 深圳市活力天汇科技股份有限公司 Method for judging abnormal user based on user operation time sequence
CN116070206A (en) * 2023-03-28 2023-05-05 上海观安信息技术股份有限公司 Abnormal behavior detection method, system, electronic equipment and storage medium

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9516053B1 (en) * 2015-08-31 2016-12-06 Splunk Inc. Network security threat detection by user/user-entity behavioral analysis
US20170132523A1 (en) * 2015-11-09 2017-05-11 Nec Laboratories America, Inc. Periodicity Analysis on Heterogeneous Logs
CN108038049A (en) * 2017-12-13 2018-05-15 西安电子科技大学 Real-time logs control system and control method, cloud computing system and server
CN108154029A (en) * 2017-10-25 2018-06-12 上海观安信息技术股份有限公司 Intrusion detection method, electronic equipment and computer storage media
CN108985632A (en) * 2018-07-16 2018-12-11 国网上海市电力公司 A kind of electricity consumption data abnormality detection model based on isolated forest algorithm
CN109829733A (en) * 2019-01-31 2019-05-31 重庆大学 A kind of false comment detection system and method based on Shopping Behaviors sequence data
CN109889538A (en) * 2019-03-20 2019-06-14 中国工商银行股份有限公司 User's anomaly detection method and system
US20190243743A1 (en) * 2018-02-07 2019-08-08 Apple Inc. Unsupervised anomaly detection
CN110334488A (en) * 2019-06-14 2019-10-15 北京大学 User authentication password security appraisal procedure and device based on Random Forest model
CN110347724A (en) * 2019-07-12 2019-10-18 深圳众赢维融科技有限公司 Abnormal behaviour recognition methods, device, electronic equipment and medium
CN110570244A (en) * 2019-09-04 2019-12-13 深圳创新奇智科技有限公司 hot-selling commodity construction method and system based on abnormal user identification
CN111275547A (en) * 2020-03-19 2020-06-12 重庆富民银行股份有限公司 Wind control system and method based on isolated forest
CN111814436A (en) * 2020-07-27 2020-10-23 上海观安信息技术股份有限公司 User behavior sequence detection method and system based on mutual information and entropy
WO2020248291A1 (en) * 2019-06-11 2020-12-17 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for anomaly detection

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9516053B1 (en) * 2015-08-31 2016-12-06 Splunk Inc. Network security threat detection by user/user-entity behavioral analysis
US20170132523A1 (en) * 2015-11-09 2017-05-11 Nec Laboratories America, Inc. Periodicity Analysis on Heterogeneous Logs
CN108154029A (en) * 2017-10-25 2018-06-12 上海观安信息技术股份有限公司 Intrusion detection method, electronic equipment and computer storage media
CN108038049A (en) * 2017-12-13 2018-05-15 西安电子科技大学 Real-time logs control system and control method, cloud computing system and server
US20190243743A1 (en) * 2018-02-07 2019-08-08 Apple Inc. Unsupervised anomaly detection
CN108985632A (en) * 2018-07-16 2018-12-11 国网上海市电力公司 A kind of electricity consumption data abnormality detection model based on isolated forest algorithm
CN109829733A (en) * 2019-01-31 2019-05-31 重庆大学 A kind of false comment detection system and method based on Shopping Behaviors sequence data
CN109889538A (en) * 2019-03-20 2019-06-14 中国工商银行股份有限公司 User's anomaly detection method and system
WO2020248291A1 (en) * 2019-06-11 2020-12-17 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for anomaly detection
CN110334488A (en) * 2019-06-14 2019-10-15 北京大学 User authentication password security appraisal procedure and device based on Random Forest model
CN110347724A (en) * 2019-07-12 2019-10-18 深圳众赢维融科技有限公司 Abnormal behaviour recognition methods, device, electronic equipment and medium
CN110570244A (en) * 2019-09-04 2019-12-13 深圳创新奇智科技有限公司 hot-selling commodity construction method and system based on abnormal user identification
CN111275547A (en) * 2020-03-19 2020-06-12 重庆富民银行股份有限公司 Wind control system and method based on isolated forest
CN111814436A (en) * 2020-07-27 2020-10-23 上海观安信息技术股份有限公司 User behavior sequence detection method and system based on mutual information and entropy

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
朱佳俊等: "基于用户画像的异常行为检测", 《通信技术》 *
郑天宇等: "基于变长马尔科夫模型的用户购物行为分析", 《现代计算机(专业版)》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113609933A (en) * 2021-07-21 2021-11-05 广州大学 Fault detection method, system, device and storage medium based on suffix tree
CN113609933B (en) * 2021-07-21 2022-09-16 广州大学 Fault detection method, system, device and storage medium based on suffix tree
CN113569949A (en) * 2021-07-28 2021-10-29 广州博冠信息科技有限公司 Abnormal user identification method and device, electronic equipment and storage medium
CN113934616A (en) * 2021-12-16 2022-01-14 深圳市活力天汇科技股份有限公司 Method for judging abnormal user based on user operation time sequence
CN113934616B (en) * 2021-12-16 2022-03-18 深圳市活力天汇科技股份有限公司 Method for judging abnormal user based on user operation time sequence
CN116070206A (en) * 2023-03-28 2023-05-05 上海观安信息技术股份有限公司 Abnormal behavior detection method, system, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112738088B (en) 2023-03-21

Similar Documents

Publication Publication Date Title
CN112738088B (en) Behavior sequence anomaly detection method and system based on unsupervised algorithm
CN108737406B (en) Method and system for detecting abnormal flow data
US10679135B2 (en) Periodicity analysis on heterogeneous logs
Yadav et al. A survey on log anomaly detection using deep learning
Cao et al. Machine learning to detect anomalies in web log analysis
CN109784042B (en) Method and device for detecting abnormal point in time sequence, electronic equipment and storage medium
CN111585955B (en) HTTP request abnormity detection method and system
US20210126931A1 (en) System and a method for detecting anomalous patterns in a network
CN111597550A (en) Log information analysis method and related device
CN110933115B (en) Analysis object behavior abnormity detection method and device based on dynamic session
CN116957049B (en) Unsupervised internal threat detection method based on countermeasure self-encoder
Dou et al. Pc 2 a: predicting collective contextual anomalies via lstm with deep generative model
CN112131249A (en) Attack intention identification method and device
CN112003834B (en) Abnormal behavior detection method and device
CN113918367A (en) Large-scale system log anomaly detection method based on attention mechanism
CN113468035B (en) Log abnormality detection method, device, training method, device and electronic equipment
CN115758908A (en) Alarm online prediction method under alarm flooding condition based on deep learning
CN117312098B (en) Log abnormity alarm method and device
CN112039907A (en) Automatic testing method and system based on Internet of things terminal evaluation platform
KR101621959B1 (en) Apparatus for extracting and analyzing log pattern and method thereof
CN115567572A (en) Method, device and equipment for determining abnormality degree of object and storage medium
CN111625825B (en) Virus detection method, device, equipment and storage medium
CN115834156A (en) Abnormal behavior detection method based on web access log
CN111814436B (en) User behavior sequence detection method and system based on mutual information and entropy
CN111209158B (en) Mining monitoring method and cluster monitoring system for server cluster

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant