CN112131199A

CN112131199A - Log processing method, device, equipment and medium

Info

Publication number: CN112131199A
Application number: CN202011023270.6A
Authority: CN
Inventors: 张欢; 范渊; 刘博�
Original assignee: Hangzhou Dbappsecurity Technology Co Ltd
Current assignee: DBAPPSecurity Co Ltd; Hangzhou Dbappsecurity Technology Co Ltd
Priority date: 2020-09-25
Filing date: 2020-09-25
Publication date: 2020-12-25

Abstract

The application discloses a log processing method, a device, equipment and a medium, wherein the method comprises the following steps: acquiring a log to be classified; extracting characteristic items of each log in the logs to be classified to obtain a log characteristic item set corresponding to each log in the logs to be classified; determining log vectors corresponding to all logs in the logs to be classified based on the log feature item sets corresponding to all logs in the logs to be classified; and classifying log vectors corresponding to all logs in the logs to be classified by utilizing an ant colony clustering algorithm so as to classify the logs to be classified. Therefore, the logs can be classified, the accuracy and consistency of classification results are improved, and the applicability is strong.

Description

Log processing method, device, equipment and medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a medium for processing a log.

Background

Clustering analysis is an important branch in the field of data mining, and is to group data objects into multiple classes or clusters, wherein the objects in the same cluster have high similarity, and the objects in different clusters have large differences. The existing clustering algorithm is mainly divided into four categories, namely a dividing method, a hierarchical method, a density-based method and a grid-based method.

The inventors have found that there may be problems in the above prior art, one of which is that the user is required to provide certain clustering prior information, which results in that the clustering result is very sensitive to the input parameters, which greatly reduces the adaptability of the classification method. Secondly, the prior art is based on a heuristic mechanism algorithm, and the method has high solving efficiency, but is easy to fall into local optimum, so that the accuracy and consistency of a clustering result are difficult to ensure.

Disclosure of Invention

In view of this, an object of the present application is to provide a log processing method, apparatus, device, and medium, which can classify logs, improve accuracy and consistency of classification results, and have strong applicability. The specific scheme is as follows:

in a first aspect, the present application discloses a log processing method, including:

acquiring a log to be classified;

extracting characteristic items of each log in the logs to be classified to obtain a log characteristic item set corresponding to each log in the logs to be classified;

determining log vectors corresponding to all logs in the logs to be classified based on the log feature item sets corresponding to all logs in the logs to be classified;

and classifying log vectors corresponding to all logs in the logs to be classified by utilizing an ant colony clustering algorithm so as to classify the logs to be classified.

Optionally, after the obtaining the log to be classified, the method further includes:

and acquiring log classification parameters corresponding to the logs to be classified.

Optionally, the classifying the log vectors corresponding to the logs to be classified by using the ant colony clustering algorithm includes:

a01: determining initial clustering centers from the log vectors, and determining pheromones from the log vectors to be classified to all the initial clustering centers except the initial clustering centers;

a02: dividing each log vector to be classified into a class corresponding to the initial clustering center based on the pheromone;

a03: determining the distance sum of each log vector to other log vectors except the log vector;

a04: updating the initial clustering center based on the distance sum, and updating the pheromone;

and step A02 is executed again until the updated clustering center is the same as the clustering center before updating, or the current iteration number is equal to the preset maximum iteration number, and then the classification of the log vectors is finished.

Optionally, the determining pheromones from the log vectors to be classified to the initial clustering centers except the initial clustering centers in the log vectors includes:

determining pheromones from log vectors to be classified to all the initial clustering centers except the initial clustering centers in the log vectors based on a first operation formula, wherein the first operation formula is as follows:

wherein, tau_ijPheromone representing the ith log vector to be classified to the jth initial cluster center, d_ijAnd representing the Euclidean distance from the ith log vector to be classified to the jth initial cluster center, and r represents the preset cluster center radius.

Optionally, the dividing, based on the pheromone, each log vector to be classified into a class corresponding to the initial clustering center includes:

determining the probability of dividing each log vector to be classified into the class corresponding to each initial clustering center based on the pheromone and a second operation formula;

and dividing the log vectors to be classified into classes corresponding to the initial clustering centers according to the probabilities, wherein the second operation formula is as follows:

wherein, the P_ijAnd S represents a log vector set to be classified, wherein the Euclidean distance from the ith log vector to be classified to the jth initial clustering center is less than or equal to the preset clustering center radius.

Optionally, the determining a sum of distances from each log vector to other log vectors except for the log vector comprises:

determining the distance sum of each log vector to other log vectors except the log vector based on a third operation formula, wherein the third operation formula is as follows:

wherein L is_mRepresents the sum of the distances from the m-th log vector to other log vectors except for itself, x_mRepresents the m-th log vector, x_mpFor the p-th value of the m-th log vector, N represents the total number of log vectors, c_mpIs a transition vector c_mP-th value of (1, | x)_m-c_m||²Representing a log vector x_mAnd a transition vector c_mThe square of the mode.

Optionally, the updating the initial clustering center based on the distance sum and the updating the pheromone includes:

determining the minimum distance and the corresponding log vector as a new clustering center, and updating the initial clustering center by using the new clustering center;

updating the pheromone by using a fourth operation formula, wherein the fourth operation formula is as follows:

wherein, tau_ij' Informative the updated ith log vector to be classified to the jth clustering center, tau_ijPheromone representing the ith log vector to be classified to the jth cluster center before updating, d_ijAnd expressing the Euclidean distance from the ith log vector to be classified to the jth clustering center, wherein rho expresses the volatility of pheromones, and Q expresses the total quantity of preset pheromones.

In a second aspect, the present application discloses a log processing apparatus, including:

the data acquisition module is used for acquiring the logs to be classified;

the characteristic item extraction module is used for extracting characteristic items of all the logs to be classified to obtain a log characteristic item set corresponding to all the logs to be classified;

the log vector determining module is used for determining log vectors corresponding to all logs in the logs to be classified based on the log feature item sets corresponding to all logs in the logs to be classified;

and the log classification module is used for classifying the log vectors corresponding to the logs to be classified by utilizing an ant colony clustering algorithm so as to classify the logs to be classified.

In a third aspect, the present application discloses an electronic device, comprising:

a memory and a processor;

wherein the memory is used for storing a computer program;

the processor is configured to execute the computer program to implement the log processing method disclosed in the foregoing.

In a fourth aspect, the present application discloses a computer readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the log processing method disclosed above.

Therefore, the method includes the steps of firstly obtaining logs to be classified, extracting feature items of the logs in the logs to be classified to obtain a log feature item set corresponding to each log in the logs to be classified, then determining log vectors corresponding to the logs in the logs to be classified based on the log feature item sets corresponding to the logs in the logs to be classified, and then classifying the log vectors corresponding to the logs in the logs to be classified by utilizing an ant colony clustering algorithm to classify the logs to be classified. Therefore, after the logs to be classified are obtained, the obtained logs to be classified are subjected to feature extraction and other processing to obtain corresponding log vectors, and then the obtained logs to be classified are classified by utilizing the ant colony clustering algorithm. In addition, the algorithm structure and operation of the ant colony clustering algorithm are simple and easy to realize.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of a log processing method disclosed in the present application;

FIG. 2 is a partial flow diagram of a particular log processing method disclosed herein;

FIG. 3 is a flowchart of a specific log processing method disclosed in the present application;

FIG. 4 is a schematic diagram of a log processing apparatus according to the disclosure;

fig. 5 is a schematic structural diagram of an electronic device disclosed in the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, an embodiment of the present application discloses a log processing method, including:

step S11: and acquiring a log to be classified.

In a specific implementation process, a log to be classified is required to be obtained first, wherein the log to be classified includes a plurality of logs, specifically, the log to be classified and the like can be obtained from an access log library of a website, and a specific way for obtaining the log to be classified is not limited herein.

After the log to be classified is obtained, correspondingly, a log classification parameter corresponding to the log to be classified is also obtained, wherein the log classification parameter includes, but is not limited to, a preset cluster center radius and the like.

Step S12: and extracting characteristic items of each log in the logs to be classified to obtain a log characteristic item set corresponding to each log in the logs to be classified.

After the log to be classified is obtained, the log to be classified needs to be correspondingly processed, so that log vectors corresponding to all logs in the log to be classified can be obtained, and corresponding equipment can conveniently perform classification processing.

Specifically, feature item extraction needs to be performed on each log in the logs to be classified to obtain a log feature item set corresponding to each log in the logs to be classified, wherein the feature items needing to be extracted from each log include, but are not limited to, excessive outbound traffic, excessive inbound traffic, off-hours VPN login, firewall acceptance, firewall rejection, login from outside an internal network, multiple continuous failed logins, at least one successful login, multiple target IPs probed from a single source, multiple target IPs and ports probed from a single source.

Step S13: and determining log vectors corresponding to the logs in the logs to be classified based on the log feature item sets corresponding to the logs in the logs to be classified.

After the log feature item sets corresponding to the logs in the logs to be classified are obtained, the log vectors corresponding to the logs in the logs to be classified can be determined based on the log feature item sets corresponding to the logs in the logs to be classified.

Specifically, for a current feature item corresponding to any log, if the feature item is extracted, a value corresponding to the feature item is represented as 1, and if the feature item is not extracted, a value corresponding to the feature item is represented as 0, so that a log vector corresponding to the log is obtained.

For example, for log a, there is extracted from the log a excessive outbound traffic, excessive inbound traffic, off-hours VPN login, firewall accept, firewall reject, and there is no extraction of login from outside the internal network, multiple failed logins in succession, at least one successful login, single source probing multiple target IPs and ports. The log vector corresponding to log a is represented as (1, 1, 1, 1, 1, 0, 0, 0, 0, 0).

Step S14: and classifying log vectors corresponding to all logs in the logs to be classified by utilizing an ant colony clustering algorithm so as to classify the logs to be classified.

After the log vectors corresponding to the logs in the logs to be classified are obtained, the ant colony clustering algorithm can be used for classifying the log vectors corresponding to the logs in the logs to be classified so as to classify the logs to be classified.

The foraging process of ants can be divided into two links of food searching and food carrying. Each ant releases pheromone on the path which the ant passes through in the moving process and can sense the pheromone and the intensity of the pheromone. The more ant passes, the stronger the pheromone is, and the pheromone itself volatilizes along with the lapse of time. Ants tend to move in the direction with high pheromone strength, the more ants travel on a certain path, the greater the probability that the later ants select the path, and the behavior of the whole ant colony shows the information positive feedback phenomenon. The ant colony clustering algorithm has the basic idea that data are regarded as ants with different attributes, a clustering center is a 'food source' to be searched by the ants, and then a data clustering process can be regarded as a process that the ants search the food source.

Referring to fig. 2, classifying the log vectors corresponding to each log in the logs to be classified by using an ant colony clustering algorithm may specifically include:

Specifically, the ant colony clustering algorithm is used for classifying log vectors corresponding to each log in the logs to be classified, and initialization is needed first, that is, a certain number of initial clustering centers are determined randomly from the log vectors, and then the pheromones from the log vectors to be classified to each initial clustering center except the initial clustering centers are determined based on a first operation formula, wherein the first operation formula is as follows:

Then, based on the pheromone, dividing each log vector to be classified into a class corresponding to the initial clustering center, specifically, determining a probability of dividing each log vector to be classified into a class corresponding to each initial clustering center based on the pheromone and a second operation formula, and then dividing each log vector to be classified into a class corresponding to the initial clustering center according to the probability, wherein the second operation formula is as follows:

wherein, the P_ijAnd S represents a log vector set to be classified, wherein the Euclidean distance from the ith log vector to be classified to the jth initial clustering center is less than or equal to the preset clustering center radius. That is, the probability of dividing each log vector to be classified into the class corresponding to each initial clustering center is determined based on the pheromone and a second operation formula, and then the current log vector to be classified is divided into the class corresponding to the initial clustering center corresponding to the maximum probability.

In the practical process, alpha and beta can be 0.9 and 0.01 respectively, and alpha and beta play roles in preventing all ants from obtaining stagnation search generated by the same result along the same path and reproducing the classical greedy algorithm idea.

For example, the initial clustering center includes a log vector a and a log vector B, the log vector to be classified includes a log vector C, the probability of dividing the log vector C into the class corresponding to the log vector a is 0.7, and the probability of dividing the log vector C into the class corresponding to the log vector B is 0.3, then the log vector C is divided into the class corresponding to the log vector a.

And when the log vectors to be classified are all divided into classes corresponding to the corresponding initial clustering centers, completing the first round of clustering, and determining the distance sum of each log vector to other log vectors except the log vector.

Specifically, the sum of the distances from each log vector to other log vectors except the log vector is determined based on a third operation formula, where the third operation formula is:

wherein L is_mRepresents the sum of the distances from the m-th log vector to other log vectors except for itself, x_mRepresents the m-th log vector, x_mpFor the p-th value of the m-th log vector, N represents the total number of log vectors, c_mpIs a transition vector c_mP-th value of (1, | x)_m-c_m||²Representing a log vector x_mAnd a transition vector c_mThe square of the mode. In the third operational formula c_mAn intermediate transition vector.

The initial cluster center needs to be updated based on the distance sum, and the pheromone needs to be updated. Specifically, the minimum distance and the corresponding log vector are determined as a new clustering center, and the initial clustering center is updated by using the new clustering center; then, the pheromone is updated by using a fourth operation formula, wherein the fourth operation formula is as follows:

After the new clustering center is determined, whether the determined new clustering center is the same as the clustering center before the new clustering center is determined or not can be judged, if yes, the clustering center tends to be stable, and classification is finished. If not, judging whether the current iteration number is not less than a preset maximum iteration number threshold, if so, executing the step A02 again, and if not, finishing the classification. That is, the step a02 is executed again until the updated cluster center is the same as the cluster center before updating, or the current iteration number is equal to the preset maximum iteration number, and then the classification of the log vectors is completed.

Referring to fig. 3, a log processing method is shown. Firstly, inputting a log to be classified, initializing related classification parameters, subtracting 1 from a preset maximum iteration number, then, using one log to be classified except the clustering center as an ant in the log to be classified, calculating the state transition probability of the ant i, namely calculating the probability of dividing the ant into the classes corresponding to the clustering centers, and dividing ants into corresponding classes (clusters) of corresponding clustering centers according to the probability, judging whether each ant is divided into the corresponding cluster, if so, recalculating the clustering center, updating pheromones from each ant to the clustering center, judging whether a termination condition is met, if the answer is satisfied, outputting a final solution, if the answer is not satisfied, repeating the steps of calculating the state transition probability of the ant i and dividing the ant into the class corresponding to the corresponding clustering center according to the probability.

Referring to fig. 4, an embodiment of the present application discloses a log processing apparatus, including:

the data acquisition module 11 is used for acquiring logs to be classified;

a feature item extraction module 12, configured to perform feature item extraction on each log in the logs to be classified to obtain a log feature item set corresponding to each log in the logs to be classified;

a log vector determining module 13, configured to determine, based on the log feature item set corresponding to each log in the logs to be classified, a log vector corresponding to each log in the logs to be classified;

the log classifying module 14 is configured to classify the log vectors corresponding to the logs to be classified by using an ant colony clustering algorithm, so as to classify the logs to be classified.

Specifically, the data obtaining module 11 is further configured to:

Further, the log classification module 14 is configured to:

Further, the log classification module 14 is configured to:

wherein, the P_ijRepresenting the probability of dividing the ith log vector to be classified into the class corresponding to the jth initial clustering center, wherein alpha and beta are bothAnd S represents a log vector set to be classified, wherein the Euclidean distance from the jth initial clustering center to the jth initial clustering center is smaller than or equal to the preset radius of the clustering center, and is preset adjusting factors.

Further, the log classification module 14 is configured to:

Further, the log classification module 14 is configured to:

wherein, tau_ij' Informative the updated ith log vector to be classified to the jth clustering center, tau_ijRepresenting the ith log vector to be classified into the jth cluster before updatingCardiac pheromone, d_ijAnd expressing the Euclidean distance from the ith log vector to be classified to the jth clustering center, wherein rho expresses the volatility of pheromones, and Q expresses the total quantity of preset pheromones.

Referring to fig. 5, a schematic structural diagram of an electronic device 20 provided in the embodiment of the present application is shown, where the electronic device 20 may specifically include, but is not limited to, a notebook computer, a desktop computer, a server, or the like.

In general, the electronic device 20 in the present embodiment includes: a processor 21 and a memory 22.

The processor 21 may include one or more processing cores, such as a four-core processor, an eight-core processor, and so on. The processor 21 may be implemented by at least one hardware of a DSP (digital signal processing), an FPGA (field-programmable gate array), and a PLA (programmable logic array). The processor 21 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 21 may be integrated with a GPU (graphics processing unit) which is responsible for rendering and drawing images to be displayed on the display screen. In some embodiments, the processor 21 may include an AI (artificial intelligence) processor for processing computing operations related to machine learning.

Memory 22 may include one or more computer-readable storage media, which may be non-transitory. Memory 22 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 22 is at least used for storing the following computer program 221, wherein after being loaded and executed by the processor 21, the steps of the log processing method disclosed in any one of the foregoing embodiments can be implemented.

In some embodiments, the electronic device 20 may further include a display 23, an input/output interface 24, a communication interface 25, a sensor 26, a power supply 27, and a communication bus 28.

Those skilled in the art will appreciate that the configuration shown in FIG. 5 is not limiting of electronic device 20 and may include more or fewer components than those shown.

Further, an embodiment of the present application also discloses a computer-readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the log processing method disclosed in any of the foregoing embodiments.

For the specific process of the log processing method, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not repeated here.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of other elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The log processing method, device, equipment and medium provided by the present application are introduced in detail, and a specific example is applied in the present application to explain the principle and the implementation of the present application, and the description of the above embodiment is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A log processing method, comprising:

acquiring a log to be classified;

2. The log processing method according to claim 1, wherein after the obtaining of the log to be classified, the method further comprises:

3. The log processing method according to claim 1 or 2, wherein the classifying the log vectors corresponding to the logs to be classified by using an ant colony clustering algorithm comprises:

4. The log processing method according to claim 3, wherein the determining pheromones from the log vectors to be classified to each of the initial clustering centers except the initial clustering center in the log vectors comprises:

5. The log processing method according to claim 4, wherein the classifying the log vectors to be classified into the classes corresponding to the initial clustering centers based on the pheromone comprises:

6. The log processing method according to claim 5, wherein said determining a sum of distances of each of the log vectors to other log vectors except for itself comprises:

c_m＝(c_m1,c_m2,c_m3,···c_mp),

7. The log processing method of claim 6, wherein the updating the initial cluster center and the updating the pheromone based on the distance sum comprises:

8. A log processing apparatus, comprising:

the data acquisition module is used for acquiring the logs to be classified;

9. An electronic device, comprising:

a memory and a processor;

wherein the memory is used for storing a computer program;

the processor is configured to execute the computer program to implement the log processing method according to any one of claims 1 to 7.

10. A computer-readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the log processing method according to any one of claims 1 to 7.