CN111538642A - Abnormal behavior detection method and device, electronic equipment and storage medium - Google Patents

Abnormal behavior detection method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111538642A
CN111538642A CN202010625852.5A CN202010625852A CN111538642A CN 111538642 A CN111538642 A CN 111538642A CN 202010625852 A CN202010625852 A CN 202010625852A CN 111538642 A CN111538642 A CN 111538642A
Authority
CN
China
Prior art keywords
log
category
vector
sample
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010625852.5A
Other languages
Chinese (zh)
Other versions
CN111538642B (en
Inventor
王滨
张峰
王星
李志强
万里
徐文渊
殷丽华
李超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Hikvision Digital Technology Co Ltd
Original Assignee
Hangzhou Hikvision Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Hikvision Digital Technology Co Ltd filed Critical Hangzhou Hikvision Digital Technology Co Ltd
Priority to CN202010625852.5A priority Critical patent/CN111538642B/en
Publication of CN111538642A publication Critical patent/CN111538642A/en
Application granted granted Critical
Publication of CN111538642B publication Critical patent/CN111538642B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • G06F11/3072Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Quality & Reliability (AREA)
  • Debugging And Monitoring (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a method and a device for detecting abnormal behaviors, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a log to be detected; dividing adjacent log records of which the interval of the generation time is not more than a preset threshold value into the same data slice to obtain a plurality of data slices; generating a log vector according to the log type of the log record included in each data slice; and determining whether the log record is generated by abnormal behaviors or not based on the distance between each log vector and each category central point, wherein the category central point is a category central point obtained by clustering log vector samples corresponding to data slice samples obtained by dividing the pre-obtained log samples. When the data slice is divided, the adjacent log records of which the interval of the generation time is not more than the preset threshold value are divided into the same data slice, so that the log records generated by continuous physical behaviors are not divided into different data slices, and the log records corresponding to abnormal behaviors can be accurately determined.

Description

Abnormal behavior detection method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of device abnormal behavior detection technologies, and in particular, to a method and an apparatus for detecting an abnormal behavior, an electronic device, and a storage medium.
Background
The detection of abnormal behavior is particularly important because abnormal behavior threatens the security of network devices. The logs generated by various physical behaviors have certain rules and accord with certain behavior modes, so that abnormal behaviors can be determined through analyzing and processing the logs. Currently, the widely used log-based abnormal behavior detection is to detect log data that does not conform to a behavior pattern, and then determine that a behavior corresponding to the log data is an abnormal behavior.
The current log-based abnormal behavior detection mode is as follows: the method comprises the steps of firstly obtaining log data, then carrying out data slicing according to fixed time or fixed log quantity, then counting the number of log records of each log type contained in each data slice, further mining the log data which do not accord with a behavior mode according to the number of the log records of each log type, and realizing abnormal behavior detection based on the log data.
In the above abnormal behavior detection method, since the data slicing is performed according to the fixed time or the fixed number of logs, the log records generated by the continuous physical behaviors are divided into different data slices, so that the physical behavior information of the log data is lost, and the log record included in each data slice is not the log record generated by the consecutive physical behaviors, so that the abnormal behavior detection result is inaccurate.
Disclosure of Invention
An embodiment of the present invention provides a method and an apparatus for detecting an abnormal behavior, an electronic device, and a storage medium, so as to solve the problem that a detection result of a current abnormal behavior detection method is inaccurate. The specific technical scheme is as follows:
in a first aspect, an embodiment of the present application provides a method for detecting an abnormal behavior, where the method includes:
acquiring a log to be detected, wherein the log to be detected comprises a plurality of log records, and each log record comprises generation time and a log type;
dividing adjacent log records of which the interval of the generation time is not more than a preset threshold value into the same data slice to obtain a plurality of data slices, wherein the log records included in each data slice correspond to one physical behavior;
generating a log vector corresponding to each data sheet according to a log type of a log record included in each data sheet, wherein the log vector is related to the log type;
determining whether the log record included in each data sheet is the log record generated by abnormal behaviors or not based on the distance between each log vector and each category central point obtained by pre-clustering, wherein the category central point is as follows: and clustering log vector samples corresponding to data sheet samples obtained by dividing the pre-obtained log samples to obtain the central point of the category.
Optionally, the step of determining whether the log record included in each data piece is a log record generated by an abnormal behavior based on the distance between each log vector and the center point of each category obtained by pre-clustering includes:
calculating the distance between each log vector and each category central point obtained by pre-clustering, wherein the distance between each log vector and each category central point is used for representing the probability that each log vector belongs to the category corresponding to the category central point;
for each data sheet, determining a category corresponding to the minimum distance in the distances as a target category of the data sheet;
if the target type is an abnormal type, determining that the log record included in the data sheet corresponding to the target type is the log record generated by the abnormal behavior; or the like, or, alternatively,
and for each data sheet, determining whether the log record included in each data sheet is a log record generated by abnormal behaviors or not based on the corresponding minimum distance and the distance between the log vector sample included in the target category and the category central point, wherein the distance between the log vector sample and the category central point is used for representing the probability that the log vector sample belongs to the category.
Optionally, the step of determining, for each data slice, whether the log record included in each data slice is a log record generated by an abnormal behavior based on the corresponding minimum distance and the distance between the log vector sample included in the target category and the category center point includes:
and for each data sheet, if the minimum distance corresponding to the data sheet is greater than the maximum distance between the log vector sample included in the target category corresponding to the data sheet and the category center point, determining that the log record included in the data sheet is the log record generated by abnormal behaviors.
Optionally, the step of generating a log vector corresponding to each data slice according to a log type of a log record included in each data slice includes:
counting the number of log records of each log type in each data slice;
and generating a log vector corresponding to each data slice based on the number.
Optionally, before the step of determining whether the log record included in each data slice is a log record generated by abnormal behavior based on the distance between each log vector and the center point of each category obtained by pre-clustering, the method further includes:
according to a preset standardization rule, carrying out standardization processing on each log vector to obtain a processed log vector;
the step of determining whether each data sheet is a log record generated by abnormal behaviors based on the distance between each log vector and the central point of each category obtained by pre-clustering comprises the following steps:
and determining whether each data sheet is a log record generated by abnormal behaviors or not based on the distance between each processed log vector and the central point of each category obtained by pre-clustering.
Optionally, the determining manner of the category center point includes:
obtaining a log sample, wherein the log sample comprises a plurality of log record samples, and each log record sample comprises generation time and a log type;
dividing adjacent log record samples of which the interval of the generation time is not more than a preset threshold value into the same data slice to obtain a plurality of data slice samples, wherein the log record sample included in each data slice sample corresponds to one physical behavior;
generating a log vector sample corresponding to each data slice sample according to a log type of a log record sample included in each data slice sample, wherein the log vector sample is related to the log type;
and clustering the log vector samples, and determining the central point of each category as a category central point.
Optionally, the step of clustering the log vector samples and determining the central point of each category includes:
determining the number of categories of clustering processing through an elbow algorithm based on the log vector samples;
and clustering the log vector samples based on the number of the categories, and determining the central point of each category.
Optionally, after the step of clustering the log vector samples and determining a center point of each category as a category center point, the method further includes:
counting the number of log vector samples corresponding to each category;
for each category, calculating the ratio of the number corresponding to the category to the total number of the log vector samples;
and if the ratio is smaller than a preset threshold value, determining that the category corresponding to the ratio is an abnormal category.
In a second aspect, an embodiment of the present application provides an apparatus for detecting abnormal behavior, where the apparatus includes:
the log acquiring module is used for acquiring a log to be detected, wherein the log to be detected comprises a plurality of log records, and each log record comprises generation time and a log type;
the data slice dividing module is used for dividing adjacent log records of which the interval of the generation time is not more than a preset threshold value into the same data slice to obtain a plurality of data slices, wherein the log record included in each data slice corresponds to one physical behavior;
the log vector generation module is used for generating a log vector corresponding to each data sheet according to the log type of the log record included in each data sheet, wherein the log vector is related to the log type;
an abnormal behavior detection module, configured to determine, based on a distance between each log vector and each category center point obtained by pre-clustering, whether a log record included in each data slice is a log record generated by an abnormal behavior, where the category center point is: and clustering log vector samples corresponding to data sheet samples obtained by dividing the pre-obtained log samples to obtain the central point of the category.
Optionally, the abnormal behavior detection module includes:
the distance calculation unit is used for calculating the distance between each log vector and each category central point obtained by pre-clustering, wherein the distance between each log vector and each category central point is used for representing the probability that each log vector belongs to the category corresponding to the category central point;
a target category determining unit, configured to determine, for each data slice, a category corresponding to a minimum distance in the distances, as a target category of the data slice;
the first abnormal behavior detection unit is used for determining that the log record included in the data sheet corresponding to the target category is the log record generated by the abnormal behavior if the target category is the abnormal category; or, for each data sheet, determining whether the log record included in each data sheet is a log record generated by abnormal behaviors based on the corresponding minimum distance and the distance between the log vector sample included in the target category and the category center point, wherein the distance between the log vector sample and the category center point is used for representing the probability that the log vector sample belongs to the category.
Optionally, the first abnormal behavior detection unit includes:
and the abnormal behavior detection subunit is configured to, for each data slice, determine that a log record included in the data slice is a log record generated by an abnormal behavior if the minimum distance corresponding to the data slice is greater than the maximum distance between the log vector sample included in the target category corresponding to the data slice and the category center point.
Optionally, the log vector generating module includes:
a number counting unit, configured to count the number of log records of each log type included in each data slice;
and the vector generating unit is used for generating a log vector corresponding to each data sheet based on the number.
Optionally, the apparatus further comprises:
the normalization module is used for performing normalization processing on each log vector according to a preset normalization rule before determining whether the log record included in each data sheet is the log record generated by abnormal behaviors or not based on the distance between each log vector and the central point of each category obtained by pre-clustering to obtain the processed log vector;
the abnormal behavior detection module comprises:
and the second abnormal behavior detection unit is used for determining whether each data sheet is a log record generated by abnormal behaviors or not based on the distance between each processed log vector and the central point of each category obtained by pre-clustering.
Optionally, the apparatus further includes a category center point determining module, configured to determine a category center point, where the category center point determining module includes:
the log sample acquiring unit is used for acquiring a log sample, wherein the log sample comprises a plurality of log record samples, and each log record sample comprises a generation time and a log type;
the data slice sample dividing unit is used for dividing adjacent log record samples of which the interval of the generation time is not more than a preset threshold value into the same data slice to obtain a plurality of data slice samples, wherein the log record sample included in each data slice sample corresponds to one physical behavior;
the log vector sample generating unit is used for generating a log vector sample corresponding to each data sheet sample according to the log type of a log record sample included in each data sheet sample, wherein the log vector sample is related to the log type;
and the clustering unit is used for clustering the log vector samples and determining the central point of each category as the category central point.
Optionally, the clustering unit includes:
a category number determination subunit, configured to determine, based on the log vector sample, a category number of clustering processing by an elbow algorithm;
and the clustering subunit is used for clustering the log vector samples based on the category number and determining the central point of each category.
Optionally, the apparatus further comprises:
the sample number counting module is used for clustering the log vector samples, determining the central point of each category and counting the number of the log vector samples corresponding to each category after the central point of each category is determined;
an abnormal category determination module, configured to calculate, for each category, a ratio of a number corresponding to the category to a total number of the log vector samples; and if the ratio is smaller than a preset threshold value, determining that the category corresponding to the ratio is an abnormal category.
In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor and the communication interface complete communication between the memory and the processor through the communication bus;
a memory for storing a computer program;
a processor adapted to perform the method steps of any of the above first aspects when executing a program stored in the memory.
In a fourth aspect, the present application provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the method steps of any one of the above first aspects.
The embodiment of the invention has the following beneficial effects:
in the scheme provided by the embodiment of the present invention, an electronic device may obtain a log to be detected, where the log to be detected includes a plurality of log records, each log record includes generation time and a log type, and adjacent log records whose generation time interval is not greater than a preset threshold are divided into the same data slice to obtain a plurality of data slices, where the log record included in each data slice corresponds to one physical behavior, a log vector corresponding to each data slice is generated according to the log type of the log record included in each data slice, the log vector is related to the log type, and it is determined whether the log record included in each data slice is a log record generated by an abnormal behavior based on a distance between each log vector and a pre-clustered central point of each category, where the central point of the category is: and clustering log vector samples corresponding to data sheet samples obtained by dividing the pre-obtained log samples to obtain the central point of the category. When the data pieces are divided, adjacent log records with the generation time interval not larger than the preset threshold value are divided into the same data piece, the log records generated by continuous physical behaviors are not divided into different data pieces, the log records included in each data piece are the log records generated by the coherent physical behaviors, namely, the log records included in each data piece correspond to one physical behavior, therefore, the log records which do not conform to the behavior mode can be accurately mined based on the log vector corresponding to each data piece and the central points of various categories obtained by pre-clustering, and the log records corresponding to abnormal behaviors are accurately determined. Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart of a method for detecting abnormal behavior according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a specific step S104 in the embodiment shown in FIG. 1;
FIG. 3 is a flowchart illustrating the step S103 in FIG. 1;
FIG. 4 is a flow chart of a determination of the category center point based on the embodiment shown in FIG. 1;
FIG. 5 is a flowchart illustrating a specific step S404 in the embodiment shown in FIG. 4;
FIG. 6 is a flow chart of the manner in which the anomaly class is determined based on the embodiment shown in FIG. 4;
fig. 7 is a schematic structural diagram of an apparatus for detecting abnormal behavior according to an embodiment of the present invention;
FIG. 8 is a schematic diagram illustrating an embodiment of the abnormal behavior detection module 740 in the embodiment shown in FIG. 7;
fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to improve the accuracy of a detection result of abnormal behavior detection based on a log, embodiments of the present invention provide a method and an apparatus for detecting abnormal behavior, an electronic device, and a computer-readable storage medium. First, a method for detecting an abnormal behavior according to an embodiment of the present invention is described below.
The method for detecting abnormal behavior provided by the embodiment of the invention can be applied to any electronic equipment which needs to detect abnormal behavior, such as a computer, a processor and the like, and for convenience and clarity of description, the method is hereinafter referred to as electronic equipment.
As shown in fig. 1, a method for detecting abnormal behavior, the method comprising:
s101, acquiring a log to be detected;
the log to be detected comprises a plurality of log records, and each log record comprises generation time and a log type.
S102, dividing adjacent log records of which the interval of the generation time is not more than a preset threshold value into the same data slice to obtain a plurality of data slices;
wherein, the log record included in each data slice corresponds to a physical behavior.
S103, generating a log vector corresponding to each data sheet according to the log type of the log record included in each data sheet;
wherein the log vector is associated with the log type.
And S104, determining whether the log record included in each data sheet is the log record generated by abnormal behaviors or not based on the distance between each log vector and the central point of each category obtained by pre-clustering.
Wherein the category center point is: and clustering log vector samples corresponding to data sheet samples obtained by dividing the pre-obtained log samples to obtain the central point of the category.
As can be seen, in the scheme provided in the embodiment of the present invention, the electronic device may obtain a log to be detected, where the log to be detected includes a plurality of log records, each log record includes generation time and a log type, the log record included in each data slice corresponds to one physical behavior, adjacent log records whose generation time interval is not greater than a preset threshold are divided into the same data slice to obtain a plurality of data slices, a log vector corresponding to each data slice is generated according to the log type of the log record included in each data slice, the log vector is related to the log type, and it is determined whether the log record included in each data slice is a log record generated by an abnormal behavior based on a distance between each log vector and a pre-clustered center point of each category, where the category center point is: and clustering log vector samples corresponding to data sheet samples obtained by dividing the pre-obtained log samples to obtain the central point of the category. When the data pieces are divided, adjacent log records with the generation time interval not larger than the preset threshold value are divided into the same data piece, the log records generated by continuous physical behaviors are not divided into different data pieces, the log records included in each data piece are the log records generated by the coherent physical behaviors, namely, the log records included in each data piece correspond to one physical behavior, therefore, the log records which do not conform to the behavior mode can be accurately mined based on the log vector corresponding to each data piece and the central points of various categories obtained by pre-clustering, and the log records corresponding to abnormal behaviors are accurately determined.
When abnormal behavior detection is required, the electronic device may obtain a log to be detected, where the log to be detected may be a log generated by the security device, and may be a log generated during operation of the security device such as a Network Camera (IP Camera, IPC), a Digital Video Recorder (DNR), a Network Video Recorder (NVR), and the like. The log to be detected may also be a web log, a system log, and the like, which is not specifically limited herein.
The log to be detected may include a plurality of log records, each log record may include a log type, and may further include a log source and a generation time. The log source is a device generating the log, and may be an IP (Internet Protocol) address, a device identifier, and the like of the device generating the log.
The log type is a type of a physical behavior corresponding to the identification log record, for example, for a log generated by the security device, the physical behavior is a physical operation executed in the running process of the security device, for example, a remote login behavior, a local video recording starting behavior, a device video recording checking behavior, and the like, and for the device video recording checking behavior, the log type of the corresponding log record may include login, playback, logout, and the like.
For the condition that the log to be detected is a log generated by the security equipment, the log source is information of a source _ ip field, the log event type is information of an action field, and the generation time is information of a @ timestamp field, so that the electronic equipment can acquire the information of the source _ ip, the action and the @ timestamp field of each log record.
After obtaining the log to be detected, the electronic device may execute step S102, and divide the adjacent log records whose generation time interval is not greater than the preset threshold into the same data slice, so as to obtain a plurality of data slices. Since the interval between the generation times corresponding to a series of log records generated by a physical behavior is not too large, log records with shorter time intervals are generally generated by a physical behavior.
Then, in order to divide a series of log records generated by one physical behavior into one data slice, the electronic device may divide adjacent log records whose generation time interval is not greater than a preset threshold into the same data slice, where the log records included in each data slice obtained by the division correspond to one physical behavior. The preset threshold may be set according to a general rule of the generation time of the physical behavior, for example, for a log generated by the security device, the preset time interval may be 10 minutes, 8 minutes, 12 minutes, and the like, which is not specifically limited herein.
In one embodiment, the electronic device may sort the logs according to the generation time of the logs to be detected. For example, in
Figure 236874DEST_PATH_IMAGE001
The time is obtained from the database, the log source IP is S1, and the generation time is
Figure 154014DEST_PATH_IMAGE002
-
Figure 871435DEST_PATH_IMAGE003
And logging the intervals, and sequencing the intervals in an ascending order according to the generation time, namely sequencing the intervals from early to late according to the generation time.
Next, the electronic device may generate a log vector corresponding to each data slice according to the log type of the log record included in each data slice, that is, execute step S103. The electronic device may determine, according to the number of log records of each log type included in each data slice, a log vector corresponding to each data slice, that is, the log vector is related to the log type.
After generating the log vector corresponding to each data slice, the electronic device may execute step S104, that is, determine whether the log record included in each data slice is the log record generated by the abnormal behavior based on the distance between each log vector and the central point of each category obtained by clustering in advance, thereby completing the detection of the abnormal behavior.
Wherein, the class central point is: and clustering log vector samples corresponding to data sheet samples obtained by dividing the pre-obtained log samples to obtain the central point of the category. The specific mode of dividing the pre-acquired log samples and the generation mode of the log vector samples corresponding to the data slice samples are respectively the same as the division mode of the log to be detected and the mode of generating the log vector corresponding to each data slice.
The category center point is a center point of a category obtained by clustering the log vector samples, namely the center position of the log vector sample of each category, the distance between the log vector sample and the corresponding center point can represent the probability that the log vector sample belongs to the category, and the closer the distance, the higher the probability that the log vector sample belongs to the category.
Therefore, the distance between each log vector and the central point of each category can represent the probability that each log vector belongs to the category corresponding to the central point of the category, and whether each category is an abnormal category can be determined according to the number of log vector samples included in the category, so that the electronic device can determine whether the log record included in each data sheet is the log record generated by abnormal behaviors based on the distance between each log vector and the central point of each category obtained by clustering in advance.
By adopting the method for detecting the abnormal behavior provided by the embodiment of the invention, when the data pieces are divided, the adjacent log records of which the generation time interval is not more than the preset threshold value are divided into the same data piece, the log records generated by the continuous physical behavior are not divided into different data pieces, and the log records included in each data piece are the log records generated by the coherent physical behavior, so that the log records which do not accord with the behavior pattern can be accurately detected based on the log vector corresponding to each data piece and the central points of all categories obtained by pre-clustering.
In addition, with the rapid development of the internet of things, the number of security devices in the internet of things is also rapidly increased, and the problems that the security devices are difficult to manage and maintain are accompanied, so that the detection of abnormal behaviors of the security devices is an important problem to be solved urgently. The security equipment logs have the problems of less field information and repeated log field information, and meanwhile, the number of log types of the security equipment logs is limited, so that accurate detection results cannot be obtained by adopting other existing abnormal behavior detection modes. Each log type of the security equipment log can correspond to one physical behavior, and the behavior mode of the security equipment is relatively fixed, so that behavior mode mining is facilitated. Therefore, for the logs of the security equipment, the method for detecting the abnormal behavior provided by the embodiment of the invention can be used for well mining the behavior pattern to obtain an accurate abnormal behavior detection result, and is favorable for the management and maintenance of the security equipment.
As an implementation manner of the embodiment of the present invention, as shown in fig. 2, the step of determining whether the log record included in each data slice is a log record generated by an abnormal behavior based on a distance between each log vector and a center point of each category obtained by clustering in advance may include:
s201, calculating the distance between each log vector and each category central point obtained by pre-clustering;
because the distance between each log vector and each category center point can represent the probability that each log vector belongs to the category corresponding to the category center point, the electronic equipment can calculate the distance between each log vector and each category center point obtained by pre-clustering.
In one embodiment, the euclidean distance between each log vector and the center point of each category obtained by pre-clustering may be used as the distance between each log vector and the center point of each category obtained by pre-clustering. Certainly, the distance between each log vector and the center point of each category obtained by pre-clustering may also be determined by adopting other calculation methods, for example, the distance may be a mahalanobis distance, a hamming distance, a manhattan distance, a cosine distance, and the like, which are all reasonable.
S202, aiming at each data sheet, determining a category corresponding to the minimum distance in the distances as a target category of the data sheet;
since the smaller the distance, the higher the probability that the category corresponding to the piece of data is the category corresponding to the distance, the electronic device may set, for each piece of data, the category corresponding to the minimum distance among the distances between the piece of data and the center point of each category as the target category of the piece of data.
For example, the log to be detected is divided into
Figure 204327DEST_PATH_IMAGE004
A data piece can be generated
Figure 18699DEST_PATH_IMAGE005
A log vector of
Figure 841162DEST_PATH_IMAGE006
}. The central point of each category is
Figure 544413DEST_PATH_IMAGE007
}, total of
Figure 743313DEST_PATH_IMAGE008
A category center point. Then for the log vector
Figure 349875DEST_PATH_IMAGE009
The electronic equipment can calculate the distance between the electronic equipment and each class central point to obtain
Figure 343239DEST_PATH_IMAGE010
A distance being
Figure 97568DEST_PATH_IMAGE011
Other log vectors can be calculated
Figure 772263DEST_PATH_IMAGE008
A distance.
For log vectors
Figure 295649DEST_PATH_IMAGE012
The electronic device may determine
Figure 663176DEST_PATH_IMAGE011
Minimum distance in (1) }, if the minimum distance is
Figure 904801DEST_PATH_IMAGE013
Figure 179925DEST_PATH_IMAGE013
The corresponding category is category B, then the electronic device can determine a log vector
Figure 259614DEST_PATH_IMAGE012
The target class of the corresponding piece of data is class B.
S203, if the target type is an abnormal type, determining that the log record included in the data sheet corresponding to the target type is the log record generated by the abnormal behavior; or, for each data slice, determining whether the log record included in each data slice is the log record generated by abnormal behaviors based on the corresponding minimum distance and the distance between the log vector sample included in the target category and the category center point.
After determining the target class for each piece of data, the electronic device can determine whether each piece of data is a log record generated by abnormal behavior. In one embodiment, after the center points of the categories are clustered in advance, whether the clustered categories are abnormal categories or not can be determined. The specific determination manner may be a manner such as whether the log sample included in the category is a log record generated by an abnormal behavior, and is not specifically limited herein. If the target class of the data sheet is an abnormal class, the log record included in the data sheet can be determined to be the log record generated by the abnormal behavior.
In another embodiment, for each data slice, the minimum distance between the corresponding log vector and the center point of each category obtained by pre-clustering indicates the maximum probability that the corresponding log vector belongs to the target category, the distance between the log vector sample and the center point of the category is used for representing the probability that the log vector sample belongs to the category, and the distance between the log vector sample included in the target category and the center point of the category also indicates the probability that the log vector sample belongs to the target category.
It can be seen that, in this embodiment, the electronic device may calculate a distance between each log vector and a central point of each category obtained by clustering in advance, determine, for each data slice, a category corresponding to a minimum distance in the distances, as a target category of the data slice, and if the target category is an abnormal category, determine that a log record included in the data slice corresponding to the target category is a log record generated by abnormal behavior. Or, for each data slice, determining whether the log record included in each data slice is the log record generated by the abnormal behavior based on the corresponding minimum distance and the distance between the log vector sample included in the target category and the category center point. By comparing with each category obtained by clustering in advance, whether the log record included in each data piece is the log record generated by abnormal behaviors can be accurately determined.
As an implementation manner of the embodiment of the present invention, the step of determining, for each data slice, whether a log record included in each data slice is a log record generated by an abnormal behavior based on the corresponding minimum distance and the distance between the log vector sample included in the target category and the category center point may include:
and for each data sheet, if the minimum distance corresponding to the data sheet is greater than the maximum distance between the log vector sample included in the target category corresponding to the data sheet and the category center point, determining that the log record included in the data sheet is the log record generated by abnormal behaviors.
The distance between the log vector sample included in the target category and the category center point represents the probability that each log vector sample belongs to the target category, and the maximum distance between the log vector sample included in the target category and the category center point represents the lowest possibility that the log vector sample belongs to the target category, so that if the minimum distance corresponding to the data piece is greater than the maximum distance, the probability that the log vector corresponding to the data piece belongs to the target category is very small.
And the target class is the class to which the log vector most probably belongs, which indicates that the log vector has a low possibility of belonging to any one of the classes, and therefore the target class is possibly a log vector corresponding to a log record generated by a new physical behavior, and therefore the target class is likely to be an abnormal behavior, the electronic device can determine that the log record included in the data slice is the log record generated by the abnormal behavior.
As can be seen, in this embodiment, for each data slice, if the minimum distance corresponding to the data slice is greater than the maximum distance between the log vector sample included in the target category corresponding to the data slice and the category center point, the electronic device may determine that the log record included in the data slice is a log record generated by the abnormal behavior, and may detect the abnormal behavior quickly and accurately.
As an implementation manner of the embodiment of the present invention, as shown in fig. 3, the step of generating a log vector corresponding to each data slice according to a log type of a log record included in each data slice may include:
s301, counting the number of log records of each log type in each data slice;
each data slice comprises at least one log record, and the log type of each log record may be different, and in order to determine the log vector corresponding to each data slice, the electronic device may count the number of log records of each log type included in each data slice.
As an implementation mode, the electronic device can count all the log types which can appear in advance and record the log types as
Figure 594781DEST_PATH_IMAGE014
And (c) the step of (c) in which,
Figure 323702DEST_PATH_IMAGE015
is the number of log types. To be detectedRecording data slice obtained by dividing log as
Figure 340200DEST_PATH_IMAGE016
And (c) the step of (c) in which,
Figure 838177DEST_PATH_IMAGE005
is the number of data pieces. The electronic device may count the log records in each data slice as belonging to
Figure 547507DEST_PATH_IMAGE017
The number of log types.
For example,
Figure 498146DEST_PATH_IMAGE018
to 5, data sheet
Figure 380651DEST_PATH_IMAGE019
Comprises 10 log records, wherein 2 log records in the 10 log records have the log type of
Figure 670818DEST_PATH_IMAGE020
There are 3 logs of type
Figure 613366DEST_PATH_IMAGE021
There are 4 logs of type
Figure 51301DEST_PATH_IMAGE022
With 1 log type being
Figure 908136DEST_PATH_IMAGE023
Do not belong to
Figure 380706DEST_PATH_IMAGE024
Is logged, then the piece of data
Figure 494155DEST_PATH_IMAGE019
The number of log records for each log type included is 2, 3, 4, 0, 1, respectively.
S302, generating a log vector corresponding to each data sheet based on the number.
After the number corresponding to each data sheet is obtained through statistics, the electronic equipment can generate a digital log vector corresponding to each data sheet, and subsequent processing is facilitated. The log vector may include elements that indicate the number of log records included in the corresponding data slice that belong to each log type.
In one embodiment, the electronic device pre-counts all possible log types, which are written as
Figure 357069DEST_PATH_IMAGE025
Recording data slice obtained by dividing log to be detected as a great opening
Figure 581377DEST_PATH_IMAGE016
}. The electronic device may count the log records in each data slice as belonging to
Figure 580557DEST_PATH_IMAGE017
Number of log types, and, in turn, can be generated
Figure 864908DEST_PATH_IMAGE005
Dimension of
Figure 277435DEST_PATH_IMAGE018
The digital vector of (1), namely the log vector, can be recorded as
Figure 243117DEST_PATH_IMAGE026
}。
For example,
Figure 424699DEST_PATH_IMAGE018
and 5, recording the data piece obtained by dividing the log to be detected as a great curl
Figure 50590DEST_PATH_IMAGE027
Data slice
Figure 950413DEST_PATH_IMAGE028
Comprises 8 log records, each of whichThe number of log records of the log type is 1, 3, 1 and 2 respectively; data sheet
Figure 516524DEST_PATH_IMAGE019
The method comprises the following steps of (1) including 10 log records, wherein the number of the log records of each log type is respectively 2, 3, 4, 0 and 1; data sheet
Figure 755875DEST_PATH_IMAGE029
The method comprises the following steps of (1) including 12 log records, wherein the number of the log records of each log type is respectively 4, 3, 1, 2 and 2; data sheet
Figure 116449DEST_PATH_IMAGE030
The log record comprises 5 log records, and the number of the log records of each log type is 1, 2, 0, 1 and 1 respectively. Then 4 log vectors of dimension 5 can be generated, of
Figure 503568DEST_PATH_IMAGE031
As can be seen, in this embodiment, the electronic device may count the number of log records of each log type included in each data slice, and further generate a log vector corresponding to each data slice based on the number. The elements of the log vector can represent the number of log records contained in the corresponding data sheet belonging to each log type, and are digital vectors, so that the distance can be conveniently calculated between the subsequent data sheet and the category center point obtained by pre-clustering, and the accuracy of abnormal behavior detection is improved.
As an implementation manner of the embodiment of the present invention, before the step of determining whether the log record included in each data slice is a log record generated by an abnormal behavior based on a distance between each log vector and a center point of each category obtained by clustering in advance, the method may further include:
and carrying out normalization processing on each log vector according to a preset normalization rule to obtain a processed log vector.
Because the number of log records included in the data slice corresponding to one or some log vectors may be large, a phenomenon that the numerical range of one or some log vectors is too large may occur, and in order to avoid the influence of the too large numerical range of individual log vectors on subsequent processing, the electronic device may perform normalization processing on each log vector according to a preset normalization rule to obtain a processed log vector.
In one embodiment, each log vector can be processed by adopting a Z-score normalization method, the log vectors after Z-score normalization conform to standard normal distribution, the influence caused by overlarge range of individual values in the log vectors can be eliminated, and the vectors after Z-score normalization can be recorded as a small
Figure 811053DEST_PATH_IMAGE032
Standardized information for each log type, which can be written as-
Figure 701649DEST_PATH_IMAGE033
}. The normalization information may be a standard deviation and a mean corresponding to each log type.
For example, the log vector is
Figure 436386DEST_PATH_IMAGE031
For log type
Figure 45222DEST_PATH_IMAGE020
At present, the corresponding values are 1, 2, 4 and 1, and the Z-score normalization method can be used for normalization, so that the corresponding standardized information is obtained
Figure 218715DEST_PATH_IMAGE034
Namely: mean (1 +2+4+ 1)/4 =2, standard deviation
Figure 665614DEST_PATH_IMAGE035
=1.22。
Of course, it is reasonable to perform normalization processing on the log vector by using a normalization method, and the like, and the normalization processing is not specifically limited and described herein.
Correspondingly, the step of determining whether each data sheet is a log record generated by abnormal behavior based on the distance between each log vector and the central point of each category obtained by pre-clustering may include:
and determining whether each data sheet is a log record generated by abnormal behaviors or not based on the distance between each processed log vector and the central point of each category obtained by pre-clustering.
After the log vectors after the standard processing are obtained, the electronic device may determine whether each data piece is a log record generated by an abnormal behavior based on a distance between each processed log vector and each category center point obtained by clustering in advance.
Therefore, in this embodiment, before determining whether the log record included in each data piece is the log record generated by the abnormal behavior based on the distance between each log vector and the central point of each category obtained by clustering in advance, the electronic device may perform normalization processing on each log vector according to a preset normalization rule to obtain a processed log vector, so that an adverse effect on subsequent processing due to an excessively large numerical range of an individual log vector may be avoided, and the accuracy of detecting the abnormal behavior may be further improved.
As an implementation manner of the embodiment of the present invention, as shown in fig. 4, the determining manner of the category center point may include:
s401, obtaining a log sample;
the electronic device may perform behavior pattern mining in advance to determine information such as various types and abnormal types of the log. First, the electronic device may obtain a log sample. The log sample comprises a plurality of log record samples, each log record sample comprises a log type, and the log sample also comprises information such as generation time and a log source. When the abnormal condition needs to be detected for the security equipment, the log sample can be historical log data generated by the security equipment.
In one embodiment, the electronic device may obtain historical log data of the security device for a period of time as a log sample according to a log source IP, and obtain the log sample according to the logThe generation time orders the log samples for subsequent partitioning. For example, in
Figure 633570DEST_PATH_IMAGE036
Selecting a log source IP from a database as S1, the generation time is
Figure 729702DEST_PATH_IMAGE037
-And generating log time for all log data of the interval and sequencing the log time in an ascending order.
S402, dividing adjacent log record samples of which the interval of the generation time is not more than a preset threshold value into the same data slice to obtain a plurality of data slice samples;
wherein, the log record sample included in each data slice sample corresponds to a physical behavior.
S403, generating a log vector sample corresponding to each data sheet sample according to the log type of the log record sample included in each data sheet sample;
the specific manner of dividing the log record sample into a plurality of data slice samples is the same as the specific manner of dividing the log to be detected into a plurality of data slices, and therefore, the detailed description is omitted here. A plurality of data pieces can be written as
Figure 244177DEST_PATH_IMAGE039
}. The specific manner of generating the log vector samples is the same as that of generating the log vectors, and therefore, the detailed description thereof is omitted here. Log vector samples can be written as
Figure 383035DEST_PATH_IMAGE040
}。
S404, clustering the log vector samples, and determining the central point of each category as the category central point.
After obtaining the log vector samples, in order to divide the plurality of log vector samples into different categories, the electronic device may perform clustering on the log vector samples, and further determine a center point of each category as a category center point.
In one embodiment, a K-Means algorithm may be used to perform unsupervised learning on a plurality of log vector samples to determine a class number for each log vector sample. The K-Means algorithm is a typical non-hierarchical clustering algorithm, can calculate the distance between samples as the evaluation index of similarity, and divides the input vector samples into two parts after continuous iterative optimization
Figure 904146DEST_PATH_IMAGE008
A plurality of classes, each class comprising at least one vector sample, wherein a center point of each class can be determined and can be written as a
Figure 153862DEST_PATH_IMAGE041
The maximum distance between the vector sample and the center point in each category can also be determined and can be written as
Figure 76818DEST_PATH_IMAGE042
}。
As can be seen, in this embodiment, the electronic device may obtain log samples, divide adjacent log record samples whose generation time interval is not greater than a preset threshold into the same data slice to obtain a plurality of data slice samples, generate a log vector sample corresponding to each data slice sample according to a log type of the log record sample included in each data slice sample, and further perform clustering on the log vector samples to determine a central point of each category as a category central point. The behavior patterns of the physical behaviors can be mined, the category center points of the categories are accurately determined, and the log vector samples are divided into the categories.
As an implementation manner of the embodiment of the present invention, as shown in fig. 5, the step of clustering the log vector samples and determining the central point of each category may include:
s501, determining the number of the clustering categories through an elbow algorithm based on the log vector samples;
the unsupervised clustering algorithm needs to specify the number of the clustered categories, and the elbow algorithm can determine the optimal number of the categories by calculating the square error and the descending speed of the samples when different types are set, so that the electronic equipment can determine the number of the clustered categories based on the log vector samples by adopting the elbow algorithm, and can be recorded as the number of the clustered categories
Figure 386577DEST_PATH_IMAGE043
S502, clustering the log vector samples based on the category number, and determining the central point of each category.
After the category number is determined, the electronic device can cluster the log vector samples according to the category number, and then determine the central points of the categories with the category number.
Therefore, in this embodiment, the electronic device may determine the number of categories of clustering processing by using an elbow algorithm based on the log vector samples, perform clustering on the log vector samples based on the number of categories, and determine the central point of each category. Therefore, the optimal category number can be determined, the accuracy of the clustering result is ensured, and the accuracy of subsequent abnormal behavior detection is further improved.
As an implementation manner of the embodiment of the present invention, after the step of clustering the log vector samples, determining a center point of each category, and using the center point as a category center point, the method may further include:
counting the number of log vector samples corresponding to each category; and determining whether each category is an abnormal category or not based on the number corresponding to each category and the total number of the log vector samples.
In order to determine whether each of the clustered categories has an abnormal category, the electronic device may count the number of log vector samples corresponding to each category, and if the number of log vector samples corresponding to a certain category is small, it indicates that, in a large number of log vector samples, the probability of occurrence of the log vector sample of the category is very low, and it is likely that the log vector sample corresponds to the abnormal behavior, so the electronic device may determine whether each category is an abnormal category based on the number corresponding to each category and the total number of the log vector samples.
Therefore, in this embodiment, the electronic device may count the number of log vector samples corresponding to each category, and then determine whether each category is an abnormal category based on the number corresponding to each category and the total number of the log vector samples, so that, when abnormal behavior detection is performed subsequently, if a log vector belonging to the abnormal category exists in a log vector corresponding to a log to be detected, a log vector generated for a log record generated by the abnormal behavior can be determined quickly and accurately by using the log vector.
As an implementation manner of the embodiment of the present invention, as shown in fig. 6, the step of determining whether each category is an abnormal category based on the number corresponding to each category and the total number of the log vector samples may include:
s601, aiming at each category, calculating the ratio of the number corresponding to the category to the total number of the log vector samples; if the ratio is smaller than a preset threshold, executing step S602; if the ratio is not less than the preset threshold, executing step S603;
since the ratio of the number of log vector samples corresponding to a certain category to the total number of log vector samples can reflect the occurrence probability of the log vector samples of the category, the electronic device can count the ratio of the number corresponding to each category to the total number of log vector samples.
S602, determining the category corresponding to the ratio as an abnormal category;
if the ratio corresponding to a certain category is smaller than the preset threshold, it indicates that the occurrence probability of the log vector sample of the category is very low, and the log vector sample corresponding to the category is likely to be the log vector sample generated by the log record generated by the abnormal behavior, so that the category corresponding to the ratio can be determined to be the abnormal category. For the convenience of subsequent processing, the class number of the abnormal class can be recorded and can be recorded as
Figure 457301DEST_PATH_IMAGE044
And (c) the step of (c) in which,
Figure 970363DEST_PATH_IMAGE045
the number of exception categories.
S603, determining the category corresponding to the ratio as a normal category.
If the ratio corresponding to a certain category is not less than the preset threshold, it indicates that the occurrence probability of the log vector sample of the category is not low, then the log vector sample corresponding to the category is likely to be the log vector sample generated by the log record generated by the normal behavior, so that the category corresponding to the ratio can be determined to be the normal category.
As can be seen, in this embodiment, the electronic device may calculate, for each category, a ratio of the number corresponding to the category to the total number of log vector samples, and if the ratio is smaller than a preset threshold, determine that the category corresponding to the ratio is an abnormal category; and if the ratio is not smaller than the preset threshold, determining that the category corresponding to the ratio is a normal category. Thus, the abnormal category in each category obtained by clustering can be accurately determined.
As an implementation manner of the embodiment of the present invention, before the step of clustering the log vector samples, determining a center point of each category, and using the center point as a category center point, the method may further include:
according to a preset normalization rule, each log vector sample is subjected to normalization processing to obtain a processed log vector sample,
correspondingly, the step of clustering the log vector samples to determine the center point of each category as the category center point may include:
clustering the processed log vector samples, and determining the center point of each category as a category center point.
In one embodiment, each log vector sample may be normalized by a Z-score normalization method, and the log vector sample after normalization may be recorded as a
Figure 810143DEST_PATH_IMAGE046
}. In order to ensure that the normalization processing of the log vector in the subsequent abnormal behavior detection process is the same as the behavior pattern mining process (namely, the clustering process), the electronic equipment can record standardized information
Figure 290803DEST_PATH_IMAGE033
And using the standardized information for subsequent processing
Figure 520927DEST_PATH_IMAGE033
Normalizing the log vector.
To facilitate subsequent detection of abnormal behavior, the electronic device may record the log source S1 and the behavior pattern mining time
Figure 378025DEST_PATH_IMAGE036
Start time of log sample
Figure 72312DEST_PATH_IMAGE037
End time of log sample
Figure 661556DEST_PATH_IMAGE038
Journal type
Figure 706872DEST_PATH_IMAGE047
Chinese character dictionary, standardized information
Figure 39765DEST_PATH_IMAGE048
Great, central point information of each category
Figure 588558DEST_PATH_IMAGE007
Great distance between each log vector sample in category and center point
Figure 676599DEST_PATH_IMAGE049
Great, category number of exception category
Figure 379851DEST_PATH_IMAGE044
And number of classes clustered
Figure 578751DEST_PATH_IMAGE008
As a result of behavioral pattern mining. In this way, when abnormal behavior detection is performed subsequently, the stored information may be read.
As an implementation manner of the embodiment of the present invention, the method may further include:
and when the log record included in the data sheet is determined to be the log record generated by the abnormal behavior, outputting alarm information.
In an implementation manner, the electronic device may construct and output alarm information for each data sheet of the log to be detected belonging to the abnormal category or the data sheet of the log record generated by the included log record as the abnormal behavior, so as to prompt the relevant person to process in time.
Wherein, the alarm information may include: log source of the log to be detected, abnormal behavior detection time, log start time, end time, log type and other information of the target data sheet. The target data sheet is a data sheet of the log to be detected belonging to the abnormal category, or a data sheet of the log record generated by the abnormal behavior.
Corresponding to the above abnormal behavior detection method, the embodiment of the present invention further provides an abnormal behavior detection apparatus. The following describes an abnormal behavior detection apparatus provided in an embodiment of the present invention.
As shown in fig. 7, a device for detecting abnormal behavior, the method includes:
a log obtaining module 710, configured to obtain a log to be detected;
the log to be detected comprises a plurality of log records, and each log record comprises generation time and a log type.
The data slice dividing module 720 is configured to divide the adjacent log records, of which the interval of the generation time is not greater than a preset threshold, into the same data slice to obtain a plurality of data slices;
wherein, the log record included in each data slice corresponds to a physical behavior.
A log vector generating module 730, configured to generate a log vector corresponding to each data slice according to a log type of a log record included in each data slice;
wherein the log vector is associated with the log type.
And the abnormal behavior detection module 740 is configured to determine whether the log record included in each data slice is a log record generated by an abnormal behavior based on a distance between each log vector and a center point of each category obtained by pre-clustering.
Wherein the category center point is: and clustering log vector samples corresponding to data sheet samples obtained by dividing the pre-obtained log samples to obtain the central point of the category.
As can be seen, in the scheme provided in the embodiment of the present invention, the electronic device may obtain a log to be detected, where the log to be detected includes a plurality of log records, each log record includes generation time and a log type, adjacent log records whose generation time interval is not greater than a preset threshold are divided into the same data slice to obtain a plurality of data slices, the log record included in each data slice corresponds to one physical behavior, a log vector corresponding to each data slice is generated according to the log type of the log record included in each data slice, the log vector is related to the log type, and it is determined whether the log record included in each data slice is a log record generated by an abnormal behavior based on a distance between each log vector and a central point of each category obtained by clustering in advance, where the central point of each category is: and clustering log vector samples corresponding to data sheet samples obtained by dividing the pre-obtained log samples to obtain the central point of the category. When the data pieces are divided, adjacent log records with the generation time interval not larger than the preset threshold value are divided into the same data piece, the log records generated by continuous physical behaviors are not divided into different data pieces, the log records included in each data piece are the log records generated by the coherent physical behaviors, namely, the log records included in each data piece correspond to one physical behavior, therefore, the log records which do not conform to the behavior mode can be accurately mined based on the log vector corresponding to each data piece and the central points of various categories obtained by pre-clustering, and the log records corresponding to abnormal behaviors are accurately determined.
As an implementation manner of the embodiment of the present invention, as shown in fig. 8, the abnormal behavior detection module 740 may include:
a distance calculation unit 741, configured to calculate a distance between each log vector and each category center point obtained by pre-clustering, where the distance between each log vector and each category center point is used to represent a probability that each log vector belongs to a category corresponding to the category center point;
a target class determining unit 742, configured to determine, for each data slice, a class corresponding to a minimum distance in the distances as a target class of the data slice;
a first abnormal behavior detection unit 743, configured to determine, if the target category is an abnormal category, that a log record included in a data slice corresponding to the target category is a log record generated by an abnormal behavior; or, for each data sheet, determining whether the log record included in each data sheet is a log record generated by abnormal behaviors based on the corresponding minimum distance and the distance between the log vector sample included in the target category and the category center point, wherein the distance between the log vector sample and the category center point is used for representing the probability that the log vector sample belongs to the category.
As an implementation manner of the embodiment of the present invention, the first abnormal behavior detection unit 743 may include:
and the abnormal behavior detection subunit is configured to, for each data slice, determine that a log record included in the data slice is a log record generated by an abnormal behavior if the minimum distance corresponding to the data slice is greater than the maximum distance between the log vector sample included in the target category corresponding to the data slice and the category center point.
As an implementation manner of the embodiment of the present invention, the log vector generating module 730 may include:
a number counting unit, configured to count the number of log records of each log type included in each data slice;
and the vector generating unit is used for generating a log vector corresponding to each data sheet based on the number.
As an implementation manner of the embodiment of the present invention, the apparatus may further include:
the normalization module is used for performing normalization processing on each log vector according to a preset normalization rule before determining whether the log record included in each data sheet is the log record generated by abnormal behaviors or not based on the distance between each log vector and the central point of each category obtained by pre-clustering to obtain the processed log vector;
the abnormal behavior detection module 740 may include:
and the second abnormal behavior detection unit is used for determining whether each data sheet is a log record generated by abnormal behaviors or not based on the distance between each processed log vector and the central point of each category obtained by pre-clustering.
As an implementation manner of the embodiment of the present invention, the apparatus may further include a category center point determining module, configured to determine the category center point, where the category center point determining module may include:
a log sample obtaining unit for obtaining a log sample;
wherein the log sample comprises a plurality of log record samples, each log record sample comprising a generation time and a log type.
The data slice sample dividing unit is used for dividing adjacent log record samples of which the interval of the generation time is not more than a preset threshold value into the same data slice to obtain a plurality of data slice samples;
wherein, the log record sample included in each data slice sample corresponds to a physical behavior.
The log vector sample generating unit is used for generating a log vector sample corresponding to each data piece sample according to the log type of a log recording sample included in each data piece sample;
wherein the log vector sample is related to the log type.
And the clustering unit is used for clustering the log vector samples and determining the central point of each category as the category central point.
As an implementation manner of the embodiment of the present invention, the clustering unit may include:
a category number determination subunit, configured to determine, based on the log vector sample, a category number of clustering processing by an elbow algorithm;
and the clustering subunit is used for clustering the log vector samples based on the category number and determining the central point of each category.
As an implementation manner of the embodiment of the present invention, the apparatus may further include:
the sample number counting module is used for clustering the log vector samples, determining the central point of each category and counting the number of the log vector samples corresponding to each category after the central point of each category is determined;
an abnormal category determination module, configured to calculate, for each category, a ratio of a number corresponding to the category to a total number of the log vector samples; and if the ratio is smaller than a preset threshold value, determining that the category corresponding to the ratio is an abnormal category.
An embodiment of the present invention further provides an electronic device, as shown in fig. 9, which includes a processor 901, a communication interface 902, a memory 903, and a communication bus 904, where the processor 901, the communication interface 902, and the memory 903 complete mutual communication through the communication bus 904,
a memory 903 for storing computer programs;
the processor 901 is configured to implement the steps of the method for detecting an abnormal behavior according to any one of the embodiments when executing the program stored in the memory 903.
As can be seen, in the scheme provided in the embodiment of the present invention, the electronic device may obtain a log to be detected, where the log to be detected includes a plurality of log records, each log record includes generation time and a log type, adjacent log records whose generation time interval is not greater than a preset threshold are divided into the same data slice to obtain a plurality of data slices, the log record included in each data slice corresponds to one physical behavior, a log vector corresponding to each data slice is generated according to the log type of the log record included in each data slice, the log vector is related to the log type, and it is determined whether the log record included in each data slice is a log record generated by an abnormal behavior based on a distance between each log vector and a central point of each category obtained by clustering in advance, where the central point of each category is: and clustering log vector samples corresponding to data sheet samples obtained by dividing the pre-obtained log samples to obtain the central point of the category. When the data pieces are divided, adjacent log records with the generation time interval not larger than the preset threshold value are divided into the same data piece, the log records generated by continuous physical behaviors are not divided into different data pieces, the log records included in each data piece are the log records generated by the coherent physical behaviors, namely, the log records included in each data piece correspond to one physical behavior, therefore, the log records which do not conform to the behavior mode can be accurately mined based on the log vector corresponding to each data piece and the central points of various categories obtained by pre-clustering, and the log records corresponding to abnormal behaviors are accurately determined.
The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.
In a further embodiment of the present invention, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the method for detecting abnormal behavior described in any of the above embodiments.
It can be seen that, in the scheme provided in the embodiment of the present invention, a computer program may be executed by a processor to obtain a log to be detected, where the log to be detected includes a plurality of log records, each log record includes generation time and a log type, adjacent log records whose generation time interval is not greater than a preset threshold are divided into a same data slice to obtain a plurality of data slices, the log record included in each data slice corresponds to one physical behavior, a log vector corresponding to each data slice is generated according to the log type of the log record included in each data slice, the log vector is related to the log type, and whether the log record included in each data slice is a log record generated by an abnormal behavior is determined based on a distance between each log vector and a central point of each category obtained by clustering in advance, where the central point of the category is: and clustering log vector samples corresponding to data sheet samples obtained by dividing the pre-obtained log samples to obtain the central point of the category. When the data pieces are divided, adjacent log records with the generation time interval not larger than the preset threshold value are divided into the same data piece, the log records generated by continuous physical behaviors are not divided into different data pieces, the log records included in each data piece are the log records generated by the coherent physical behaviors, namely, the log records included in each data piece correspond to one physical behavior, therefore, the log records which do not conform to the behavior mode can be accurately mined based on the log vector corresponding to each data piece and the central points of various categories obtained by pre-clustering, and the log records corresponding to abnormal behaviors are accurately determined.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (14)

1. A method for detecting abnormal behavior, the method comprising:
acquiring a log to be detected, wherein the log to be detected comprises a plurality of log records, and each log record comprises generation time and a log type;
dividing adjacent log records of which the interval of the generation time is not more than a preset threshold value into the same data slice to obtain a plurality of data slices, wherein the log records included in each data slice correspond to one physical behavior;
generating a log vector corresponding to each data sheet according to a log type of a log record included in each data sheet, wherein the log vector is related to the log type;
determining whether the log record included in each data sheet is the log record generated by abnormal behaviors or not based on the distance between each log vector and each category central point obtained by pre-clustering, wherein the category central point is as follows: and clustering log vector samples corresponding to data sheet samples obtained by dividing the pre-obtained log samples to obtain the central point of the category.
2. The method of claim 1, wherein the step of determining whether the log record included in each data piece is a log record generated by abnormal behavior based on the distance between each log vector and the center point of each category obtained by pre-clustering comprises:
calculating the distance between each log vector and each category central point obtained by pre-clustering, wherein the distance between each log vector and each category central point is used for representing the probability that each log vector belongs to the category corresponding to the category central point;
for each data sheet, determining a category corresponding to the minimum distance in the distances as a target category of the data sheet;
if the target type is an abnormal type, determining that the log record included in the data sheet corresponding to the target type is the log record generated by the abnormal behavior; or the like, or, alternatively,
and for each data sheet, determining whether the log record included in each data sheet is a log record generated by abnormal behaviors or not based on the corresponding minimum distance and the distance between the log vector sample included in the target category and the category central point, wherein the distance between the log vector sample and the category central point is used for representing the probability that the log vector sample belongs to the category.
3. The method of claim 2, wherein the step of determining, for each of the data slices, whether the log record included in each of the data slices is a log record generated by abnormal behavior based on the corresponding minimum distance and the distance between the log vector sample included in the target category and the category center point comprises:
and for each data sheet, if the minimum distance corresponding to the data sheet is greater than the maximum distance between the log vector sample included in the target category corresponding to the data sheet and the category center point, determining that the log record included in the data sheet is the log record generated by abnormal behaviors.
4. The method of claim 1, wherein the step of generating a log vector corresponding to each data slice according to a log type of a log record included in each data slice comprises:
counting the number of log records of each log type in each data slice;
and generating a log vector corresponding to each data slice based on the number.
5. The method of claim 1, wherein prior to the step of determining whether the log record included in each of the data pieces is a log record generated by abnormal behavior based on a distance between each of the log vectors and the center points of the respective categories clustered in advance, the method further comprises:
according to a preset standardization rule, carrying out standardization processing on each log vector to obtain a processed log vector;
the step of determining whether each data sheet is a log record generated by abnormal behaviors based on the distance between each log vector and the central point of each category obtained by pre-clustering comprises the following steps:
and determining whether each data sheet is a log record generated by abnormal behaviors or not based on the distance between each processed log vector and the central point of each category obtained by pre-clustering.
6. The method according to any one of claims 1 to 5, wherein the determination of the class center point comprises:
obtaining a log sample, wherein the log sample comprises a plurality of log record samples, and each log record sample comprises generation time and a log type;
dividing adjacent log record samples of which the interval of the generation time is not more than a preset threshold value into the same data slice to obtain a plurality of data slice samples, wherein the log record sample included in each data slice sample corresponds to one physical behavior;
generating a log vector sample corresponding to each data slice sample according to a log type of a log record sample included in each data slice sample, wherein the log vector sample is related to the log type;
and clustering the log vector samples, and determining the central point of each category as a category central point.
7. The method of claim 6, wherein the step of clustering the log vector samples to determine a center point for each category comprises:
determining the number of categories of clustering processing through an elbow algorithm based on the log vector samples;
and clustering the log vector samples based on the number of the categories, and determining the central point of each category.
8. The method of claim 6, wherein after the step of clustering the log vector samples to determine a center point for each category as a category center point, the method further comprises:
counting the number of log vector samples corresponding to each category;
for each category, calculating the ratio of the number corresponding to the category to the total number of the log vector samples;
and if the ratio is smaller than a preset threshold value, determining that the category corresponding to the ratio is an abnormal category.
9. An apparatus for detecting abnormal behavior, the apparatus comprising:
the log acquiring module is used for acquiring a log to be detected, wherein the log to be detected comprises a plurality of log records, and each log record comprises generation time and a log type;
the data slice dividing module is used for dividing adjacent log records of which the interval of the generation time is not more than a preset threshold value into the same data slice to obtain a plurality of data slices, wherein the log record included in each data slice corresponds to one physical behavior;
the log vector generation module is used for generating a log vector corresponding to each data sheet according to the log type of the log record included in each data sheet, wherein the log vector is related to the log type;
an abnormal behavior detection module, configured to determine, based on a distance between each log vector and each category center point obtained by pre-clustering, whether a log record included in each data slice is a log record generated by an abnormal behavior, where the category center point is: and clustering log vector samples corresponding to data sheet samples obtained by dividing the pre-obtained log samples to obtain the central point of the category.
10. The apparatus of claim 9, wherein the abnormal behavior detection module comprises:
the distance calculation unit is used for calculating the distance between each log vector and each category central point obtained by pre-clustering, wherein the distance between each log vector and each category central point is used for representing the probability that each log vector belongs to the category corresponding to the category central point;
a target category determining unit, configured to determine, for each data slice, a category corresponding to a minimum distance in the distances, as a target category of the data slice;
the first abnormal behavior detection unit is used for determining that the log record included in the data sheet corresponding to the target category is the log record generated by the abnormal behavior if the target category is the abnormal category; or, for each data sheet, determining whether the log record included in each data sheet is a log record generated by abnormal behaviors based on the corresponding minimum distance and the distance between the log vector sample included in the target category and the category center point, wherein the distance between the log vector sample and the category center point is used for representing the probability that the log vector sample belongs to the category.
11. The apparatus of claim 9 or 10, wherein the apparatus further comprises a categorical center point determination module for determining a categorical center point, the categorical center point determination module comprising:
the log sample acquiring unit is used for acquiring a log sample, wherein the log sample comprises a plurality of log record samples, and each log record sample comprises a generation time and a log type;
the data slice sample dividing unit is used for dividing adjacent log record samples of which the interval of the generation time is not more than a preset threshold value into the same data slice to obtain a plurality of data slice samples, wherein the log record sample included in each data slice sample corresponds to one physical behavior;
the log vector sample generating unit is used for generating a log vector sample corresponding to each data sheet sample according to the log type of a log record sample included in each data sheet sample, wherein the log vector sample is related to the log type;
and the clustering unit is used for clustering the log vector samples and determining the central point of each category as the category central point.
12. The apparatus of claim 11, wherein the apparatus further comprises:
the sample number counting module is used for clustering the log vector samples, determining the central point of each category and counting the number of the log vector samples corresponding to each category after the central point of each category is determined;
an abnormal category determination module, configured to calculate, for each category, a ratio of a number corresponding to the category to a total number of the log vector samples; and if the ratio is smaller than a preset threshold value, determining that the category corresponding to the ratio is an abnormal category.
13. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any of claims 1 to 8 when executing a program stored in the memory.
14. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of the claims 1-8.
CN202010625852.5A 2020-07-02 2020-07-02 Abnormal behavior detection method and device, electronic equipment and storage medium Active CN111538642B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010625852.5A CN111538642B (en) 2020-07-02 2020-07-02 Abnormal behavior detection method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010625852.5A CN111538642B (en) 2020-07-02 2020-07-02 Abnormal behavior detection method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111538642A true CN111538642A (en) 2020-08-14
CN111538642B CN111538642B (en) 2020-10-02

Family

ID=71974628

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010625852.5A Active CN111538642B (en) 2020-07-02 2020-07-02 Abnormal behavior detection method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111538642B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112306982A (en) * 2020-11-16 2021-02-02 杭州海康威视数字技术股份有限公司 Abnormal user detection method and device, computing equipment and storage medium
CN112367222A (en) * 2020-10-30 2021-02-12 中国联合网络通信集团有限公司 Network anomaly detection method and device
CN112612887A (en) * 2020-12-25 2021-04-06 北京天融信网络安全技术有限公司 Log processing method, device, equipment and storage medium
CN113360313A (en) * 2021-07-07 2021-09-07 时代云英(深圳)科技有限公司 Behavior analysis method based on massive system logs
CN114268451A (en) * 2021-11-15 2022-04-01 中国南方电网有限责任公司 Method, device, equipment and medium for constructing power monitoring network security buffer area
CN115204322A (en) * 2022-09-16 2022-10-18 成都新希望金融信息有限公司 Behavioral link abnormity identification method and device
WO2023272851A1 (en) * 2021-06-29 2023-01-05 未鲲(上海)科技服务有限公司 Anomaly data detection method and apparatus, device, and storage medium
WO2023284132A1 (en) * 2021-07-15 2023-01-19 苏州浪潮智能科技有限公司 Method and system for analyzing cloud platform logs, device, and medium
WO2024031930A1 (en) * 2022-08-12 2024-02-15 苏州元脑智能科技有限公司 Error log detection method and apparatus, and electronic device and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514398A (en) * 2013-10-18 2014-01-15 中国科学院信息工程研究所 Real-time online log detection method and system
US20160286351A1 (en) * 2015-03-24 2016-09-29 Exactigo, Inc. Indoor navigation anomaly detection
US9516053B1 (en) * 2015-08-31 2016-12-06 Splunk Inc. Network security threat detection by user/user-entity behavioral analysis
CN107528832A (en) * 2017-08-04 2017-12-29 北京中晟信达科技有限公司 Baseline structure and the unknown anomaly detection method of a kind of system-oriented daily record
CN110933115A (en) * 2019-12-31 2020-03-27 上海观安信息技术股份有限公司 Analysis object behavior abnormity detection method and device based on dynamic session
CN111178380A (en) * 2019-11-15 2020-05-19 腾讯科技(深圳)有限公司 Data classification method and device and electronic equipment
CN111240942A (en) * 2019-12-02 2020-06-05 华为技术有限公司 Log abnormity detection method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514398A (en) * 2013-10-18 2014-01-15 中国科学院信息工程研究所 Real-time online log detection method and system
US20160286351A1 (en) * 2015-03-24 2016-09-29 Exactigo, Inc. Indoor navigation anomaly detection
US9516053B1 (en) * 2015-08-31 2016-12-06 Splunk Inc. Network security threat detection by user/user-entity behavioral analysis
CN107528832A (en) * 2017-08-04 2017-12-29 北京中晟信达科技有限公司 Baseline structure and the unknown anomaly detection method of a kind of system-oriented daily record
CN111178380A (en) * 2019-11-15 2020-05-19 腾讯科技(深圳)有限公司 Data classification method and device and electronic equipment
CN111240942A (en) * 2019-12-02 2020-06-05 华为技术有限公司 Log abnormity detection method and device
CN110933115A (en) * 2019-12-31 2020-03-27 上海观安信息技术股份有限公司 Analysis object behavior abnormity detection method and device based on dynamic session

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112367222A (en) * 2020-10-30 2021-02-12 中国联合网络通信集团有限公司 Network anomaly detection method and device
CN112306982A (en) * 2020-11-16 2021-02-02 杭州海康威视数字技术股份有限公司 Abnormal user detection method and device, computing equipment and storage medium
CN112612887A (en) * 2020-12-25 2021-04-06 北京天融信网络安全技术有限公司 Log processing method, device, equipment and storage medium
WO2023272851A1 (en) * 2021-06-29 2023-01-05 未鲲(上海)科技服务有限公司 Anomaly data detection method and apparatus, device, and storage medium
CN113360313A (en) * 2021-07-07 2021-09-07 时代云英(深圳)科技有限公司 Behavior analysis method based on massive system logs
WO2023284132A1 (en) * 2021-07-15 2023-01-19 苏州浪潮智能科技有限公司 Method and system for analyzing cloud platform logs, device, and medium
CN114268451A (en) * 2021-11-15 2022-04-01 中国南方电网有限责任公司 Method, device, equipment and medium for constructing power monitoring network security buffer area
CN114268451B (en) * 2021-11-15 2024-04-16 中国南方电网有限责任公司 Method, device, equipment and medium for constructing safety buffer zone of power monitoring network
WO2024031930A1 (en) * 2022-08-12 2024-02-15 苏州元脑智能科技有限公司 Error log detection method and apparatus, and electronic device and storage medium
CN115204322A (en) * 2022-09-16 2022-10-18 成都新希望金融信息有限公司 Behavioral link abnormity identification method and device

Also Published As

Publication number Publication date
CN111538642B (en) 2020-10-02

Similar Documents

Publication Publication Date Title
CN111538642B (en) Abnormal behavior detection method and device, electronic equipment and storage medium
CN110321371B (en) Log data anomaly detection method, device, terminal and medium
Bodik et al. Fingerprinting the datacenter: automated classification of performance crises
He et al. An evaluation study on log parsing and its use in log mining
CN112149757B (en) Abnormity detection method and device, electronic equipment and storage medium
US9678822B2 (en) Real-time categorization of log events
US10600002B2 (en) Machine learning techniques for providing enriched root causes based on machine-generated data
US20190095417A1 (en) Content aware heterogeneous log pattern comparative analysis engine
WO2017113677A1 (en) User behavior data processing method and system
CN109165691B (en) Training method and device for model for identifying cheating users and electronic equipment
CN110083475B (en) Abnormal data detection method and device
US20160255109A1 (en) Detection method and apparatus
CN106201886A (en) The Proxy Method of the checking of a kind of real time data task and device
US10613525B1 (en) Automated health assessment and outage prediction system
US20200380117A1 (en) Aggregating anomaly scores from anomaly detectors
CN112596964B (en) Disk fault prediction method and device
CN113535454A (en) Method and device for detecting log data abnormity
CN114662602A (en) Outlier detection method and device, electronic equipment and storage medium
US10637878B2 (en) Multi-dimensional data samples representing anomalous entities
CN117094184B (en) Modeling method, system and medium of risk prediction model based on intranet platform
US11265232B2 (en) IoT stream data quality measurement indicator and profiling method and system therefor
CN117370548A (en) User behavior risk identification method, device, electronic equipment and medium
CN117170915A (en) Data center equipment fault prediction method and device and computer equipment
Pan et al. An anomaly detection method for system logs using Venn-Abers predictors
CN115793990A (en) Memory health state determination method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant