CN112800016A

CN112800016A - Log data classification and sorting method and device

Info

Publication number: CN112800016A
Application number: CN202011639754.3A
Authority: CN
Inventors: 黄伟
Original assignee: Wuhan Sipuling Technology Co Ltd
Current assignee: Wuhan Sipuleng Technology Co Ltd; Wuhan Sipuling Technology Co Ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-05-14

Abstract

The invention relates to a log data classification and sorting method, which comprises the following steps: inquiring a log database based on the specified keywords, and performing grouping duplicate removal on log data in the log database to obtain a log data set; the method comprises the steps of customizing a classification type, setting type attributes of the classification type and keywords associated with the classification type; and classifying and sequencing the log data set based on the classification type to obtain a result set. The invention can realize the user-defined classification and sorting of the log data, the classification and sorting process does not depend on manpower, the efficiency is high, and errors are not easy to occur.

Description

Log data classification and sorting method and device

Technical Field

The invention relates to the technical field of log data management, in particular to a log data classification and sorting method, a log data classification and sorting device and a computer storage medium.

Background

Network devices, systems, and service programs, etc. all generate a log event record, i.e. a log record, in which each log records a description of the relevant operation, such as date, time, user, and action. In a more complex network environment, log data generated by network devices (such as firewalls, servers, routers, gateways, switches, repeaters, bridges and the like) in the operation process are massive and disorderly, and even after the devices are analyzed and put in storage, the devices need to be combed, classified and sorted again. Classification means to classify according to category, grade or property; the sorting is a process of arranging disordered data elements according to a keyword sequence in a certain sorting mode.

The classification and sorting of the logs are usually performed after the logs are analyzed, and the classification and dimensionality of the logs can be limited by extracting specified keywords in the analysis process, so that the logs can be classified artificially. The two modes are both artificial configuration or artificial classification to determine the target log to be analyzed, the process fault tolerance rate of artificially specifying the log to be analyzed is low, and if the target log to be analyzed is inconsistent with the log to be analyzed actually, all the services and the target in the later period are diverged. And the manual operation process is complicated and troublesome, low in efficiency and easy to make mistakes.

The log data generated in the operation process of different types of network equipment of various manufacturers are quickly analyzed and stored in a warehouse, and convenient and quick custom classification and comprehensive sequencing operations are performed on the basis, so that the logs have definite attributes (classification names, classification types, danger levels, serial numbers and the like), the boundaries of services corresponding to the log data are more accurate and clear, and the problem to be solved is solved.

Disclosure of Invention

In view of the above, a method and an apparatus for sorting log data are needed to solve the problems that the present sorting of log data depends on artificial sorting, the efficiency is low, and errors are prone to occur.

The invention provides a log data classification and sorting method, which comprises the following steps:

inquiring a log database based on the specified keywords, and performing grouping duplicate removal on log data in the log database to obtain a log data set;

the method comprises the steps of customizing a classification type, setting type attributes of the classification type and keywords associated with the classification type;

and classifying and sequencing the log data set based on the classification type to obtain a result set.

Further, querying a log database based on the specified keyword, and performing grouping deduplication on log data in the log database to obtain a log data set, specifically:

setting different data statistical durations and corresponding statistical moments;

inquiring an original log data table in the log database at each statistical time, extracting log data in the original log data table within corresponding statistical duration, and grouping and de-duplicating the log data based on specified keywords to obtain a log data table of a corresponding time period; and combining the log data tables of all the time periods to obtain the log data set.

Further, the specified keyword is a log keyword or a destination ip.

Further, grouping and duplicate removal are performed on the log data based on the specified keywords, so as to obtain a log data table of a corresponding time period, specifically:

dividing log data with the same value of the specified keywords into a group;

and carrying out duplicate removal on the same log data in the same group to obtain a log data table of the corresponding time period.

Further, setting a type attribute of the classification type, specifically:

and setting the name, security level and sequence number of the classification type.

Further, based on the classification type, the log data set is subjected to classification sorting to obtain a result set, which specifically comprises:

screening the log data set according to the keywords associated with the classification type to obtain log data associated with the classification type;

and sorting the associated log data based on the sorting mode of the classification type to obtain the result set.

Further, the method also comprises the following steps: and warehousing and storing the result set, and analyzing data based on the stored result set.

Further, data analysis is performed based on the stored result set, specifically:

performing reverse statistical analysis on the log data based on the result set to obtain corresponding indexes;

and displaying and viewing the result set based on the classification type as a visual angle.

The invention also provides a log data classification and sorting device, which comprises a processor and a memory, wherein the memory is stored with a computer program, and the computer program is executed by the processor to realize the log data classification and sorting method.

The invention also provides a computer storage medium, on which a computer program is stored, which, when executed by a processor, implements the log data sorting method.

Has the advantages that: according to the invention, firstly, a log database is queried to obtain a log data set which is subjected to grouping and deduplication based on the specified keywords, so that the aggregation statistics of the log data based on the specified keywords is realized, the problem of data disorder is solved primarily by the log data subjected to the aggregation statistics, meanwhile, the grouping based on the specified keywords provides a basis for subsequent classification and sequencing, and the subsequent classification and sequencing can be carried out smoothly. Based on the grouped and de-duplicated data, additional information of classification and sequencing is set, and a mapping relation between classification types and keywords is established, so that the log data can be classified according to the keywords in the follow-up process. And finally, finishing the classification and sorting of the log data based on the user-defined classification to obtain a result set. The invention can quickly, conveniently and efficiently extract the keywords to be classified, and perform custom classification and comprehensive sequencing on the log data, so that the log data is not disordered and disordered in sequencing, and the accuracy of the later-stage business aggregate analysis is improved.

Drawings

FIG. 1 is a flowchart of a log data sorting method according to a first embodiment of the present invention;

fig. 2a is a schematic diagram of an aggregation statistical process based on log keywords according to a first embodiment of a log data sorting method provided by the present invention;

fig. 2b is a schematic diagram of an aggregate statistical result based on log keywords according to the first embodiment of the log data sorting method provided by the present invention;

fig. 3a is a schematic view of a destination ip-based aggregate statistical flow of a log data sorting and sorting method according to a first embodiment of the present invention;

fig. 3b is a schematic diagram of an aggregate statistical result based on a destination ip according to the log data sorting and sorting method provided by the present invention;

fig. 4a is a schematic diagram of a sorting flow of a self-defined sorting type based on log keywords according to a first embodiment of a log data sorting method provided by the present invention;

fig. 4b is a schematic diagram of a sorting result of a custom sorting type based on log keywords according to the first embodiment of the log data sorting method provided by the present invention;

FIG. 5 is a flowchart illustrating a data analysis process of a first embodiment of a log data sorting method according to the present invention;

fig. 6 is a general detailed flowchart of a log data sorting method according to a first embodiment of the present invention.

Detailed Description

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate preferred embodiments of the invention and together with the description, serve to explain the principles of the invention and not to limit the scope of the invention.

Example 1

As shown in fig. 1, embodiment 1 of the present invention provides a log data sorting method, including the following steps:

s1, inquiring a log database based on the specified keywords, and grouping and de-duplicating log data in the log database to obtain a log data set;

s2, customizing a classification type, and setting a type attribute of the classification type and keywords associated with the classification type;

and S3, classifying and sorting the log data set based on the classification type to obtain a result set.

In this embodiment, a log database ClickHouse is firstly queried, where an original log data table is stored in the log database, and the original log data table includes log data that has been analyzed and stored for further analysis. The ClickHouse is inquired to obtain the log data set which is subjected to grouping and de-duplication based on the specified keywords, the aggregation statistics of the log data based on the specified keywords is realized, the problem of data disorder is solved primarily by the log data subjected to the aggregation statistics, meanwhile, the grouping based on the specified keywords provides a basis for subsequent classification and sequencing, and the subsequent classification and sequencing can be performed smoothly. Based on the grouped and deduplicated data, classification and additional information thereof are set, specifically, the classification type is defined by self, the type attribute is set, and keywords associated with the classification type are set, so that a mapping relation between the classification type and the keywords is established, and the log data can be conveniently classified according to the keywords subsequently. Meanwhile, the sorting mode is set, such as from high to low, from large to small, in time sequence and the like, so that sorting after classification is facilitated. After the additional information of classification and sorting is set, the classification and sorting of the log data can be completed, and a result set is obtained. The result set obtained after sorting has more prepared attributes (sorting name, sorting type, danger level, sequence number and the like), and the analysis result of data analysis based on the log data after sorting has more accuracy, referential property, guidance and value.

The invention can quickly, conveniently and efficiently extract the keywords to be classified, and perform custom classification and comprehensive sequencing on the log data, so that the log data is not disordered and disordered in sequencing, and the accuracy of the later-stage business aggregate analysis is improved.

Preferably, the log database is queried based on the specified keyword, and the log data in the log database is subjected to grouping deduplication to obtain a log data set, specifically:

This embodiment sets two types of statistical durations: hourly and daily. The statistical time of each hour is 5 minutes per hour, and the statistical time of each day is 0 point and 5 minutes per day.

Aggregating and counting the hourly log data to obtain a log data hour table, which specifically comprises the following steps: firstly, carrying out data aggregation timing, wherein the timing time jobhour is 5 minutes per hour, executing an sql statement (the specified keyword is determined by an sql statement predefined field) to aggregate, deduplicate and count the log data of the last hour in an original log data table, and inserting the log data into a log data hour table in a ClickHouse database; for example: 3: and 05, performing clustering statistics on log data from 2 points to 3 points.

Aggregating and counting the hourly log data to obtain a log data daily table, which specifically comprises the following steps: firstly, carrying out data aggregation timing, wherein the timing time jobday is 0 point and 5 points of each day, executing an sql statement (the specified keyword is determined by an sql statement predefined field) to aggregate the log data of the previous day in the duplicate removal statistical log data hour table, and inserting the aggregated log data hour table into a log data day table in a ClickHouse database; for example: and 5, performing aggregation statistics on log data from 23 days to 24 days by 0 point 05 on 24 days.

And combining the log data hour table and the log data daily table to obtain a log data set.

Preferably, the specified keyword is a log keyword or a destination ip.

The specified keywords can be set as required according to subsequent classification requirements. This example gives distance descriptions for two specific keywords.

As shown in fig. 2a, fig. 2a shows a flow of grouping, de-duplicating, aggregating and counting the log data set of the original log data table based on the log key. Timing jobhour of the log keywords is 5 minutes per hour, and an sql statement (the log keywords are determined by predefined fields of the sql statement) is executed to aggregate, remove duplicate and count the log keyword data of one hour on an original log data table, and the log keyword data is inserted into a log keyword hour table in a ClickHouse database; for example: 3.05 aggregation statistics data from 2 to 3 points. Timing jobday of the log keywords is 0 point 5 points per day, and a sql statement (the log keywords are determined by predefined fields of the sql statement) is executed to aggregate, deduplicate and count log keyword data of the previous day in a log keyword hour table and insert the aggregated and deduplicated log keyword data into a log keyword table day table in a ClickHouse database; for example: and 5, performing aggregation statistics on log data from 23 days to 24 days by 0 point 05 on 24 days.

The log data page obtained by performing aggregation statistics based on the log keywords is shown in fig. 2b, and as can be seen from fig. 2b, the attack type is selected as the designated field to perform aggregate grouping deduplication, and fig. 2b shows 10 different designated keywords (HTTP _ IIS _ directory list, HTTP/ect/name _ access, HTTP sensitive information scanning, XSS _ url attack, malicious scanning, HTTP protocol violation, HTTP script embedding attack, FINGER _ root user query, HTTP _/etc/access), which are grouped and deduplicated based on the designated keywords, and further performs aggregate statistics of statistical items of total attack times, high-level total attack times, medium-level total attack times, and low-level total attack times.

As shown in fig. 3a, fig. 3a shows a flow of the log data set after performing grouping deduplication and aggregation statistics on the original log data table based on the destination ip. The timing jobhour of the target ip is 5 minutes per hour, and sql statements (the target ip is determined by predefined fields of the sql statements) are executed to aggregate and deduplicate the target ip data of one hour on an original log data table and are inserted into a target ip aggregation hour table in a ClickHouse database; for example: 3.05 aggregation statistics data from 2 to 3 points. The destination ip timing jobday is 0 point 5 point of each day, and a sql statement (the destination ip is determined by a predefined field of the sql statement) is executed to aggregate and deduplicate destination ip data of the previous day in a destination ip aggregation hour table and insert the destination ip aggregation day table into a ClickHouse database; for example: and 5, performing aggregation statistics on log data from 23 days to 24 days by 0 point 05 on 24 days.

The log data page obtained by performing aggregation statistics based on the destination ip is shown in fig. 3b, the attack address in fig. 3b, that is, the destination ip, performs aggregation grouping deduplication based on the attack address, and further performs aggregation statistics on statistical items such as asset host ip (originating attack address), log keywords (attack type), total attack number, and the like, and the aggregation statistical result with the destination ip of 192.168.5.1 is shown in fig. 3 b.

Preferably, the log data is subjected to group deduplication based on the specified keyword to obtain a log data table of a corresponding time period, specifically:

dividing log data with the same value of the specified keywords into a group;

And executing the sql statement (the specified key is determined by a predefined field of the sql statement) to perform grouping and deduplication on the log data in the original log data table.

Preferably, the type attribute of the classification type is set, specifically:

The classification attributes can be set as required, for example, in this embodiment, the attack events in the firewall logs are counted, so that in addition to the names and the serial numbers, the security levels are also set, the security levels are described for the custom classification types, and the security levels of each type of log data can be distinguished and identified visually and quickly during data analysis.

Specifically, as shown in fig. 4b, fig. 4b shows a category type customized in this embodiment, the name is "HTTP intrusion attack", the security level is "high", the sequence number is "6", and the associated keywords include "HTTP _ IIS _ directory list", "HTTP sensitive information scan", and the like.

Preferably, the log data set is sorted based on the classification type to obtain a result set, specifically:

This embodiment is also an explanation of a sorting process using a log keyword as a designated keyword. Specifically, as shown in fig. 4a, after defining the classification type by user, setting the type attribute (including the classification name, the security level, the serial number, etc.), and selecting the associated log keyword, query the log keyword hour table or the log keyword day table, screen out the log data in the log keyword hour table or the log keyword day table, which includes the log keyword associated with the classification type, and store the log data in the MySQL database. Specifically, according to the statistical time selected by the user, the query in the log keyword hour table or the query in the log keyword day table is selected, when the statistical time selected by the user is longer than the set time (one week), the log keyword day table is queried, otherwise, the log keyword hour table is queried.

Fig. 4b shows the result of sorting and ordering the custom classification type named "HTTP intrusion attack" in the present embodiment, and fig. 4b shows all log data associated with "HTTP intrusion attack" and orders the log data from high to low according to the total attack times.

Preferably, as shown in fig. 1, the method further includes: s4, performing storage and storage on the result set; and S5, performing data analysis based on the stored result set.

And storing the sorted log data into a MySQL database for subsequent statistical analysis.

Preferably, the data analysis is performed based on the stored result set, specifically:

There are many ways to statistically analyze log data, and this embodiment is given by way of illustration of two types: performing reverse statistical analysis on the log to obtain a required index; and viewing the analyzed data content based on the set classification as the visual angle.

Specifically, as shown in fig. 5 and fig. 6, fig. 5 shows a process of analyzing based on sorted and sorted data, and fig. 6 shows a detailed process of analyzing sorted and data interaction between databases:

1. displaying an asset host list on a page, the page displaying dimension: the asset ip.

2. And acquiring all the data after the custom classification and the comprehensive sequencing from the MySQL database, wherein the data comprises additional information such as a result set, custom classification types, sequencing modes and the like.

3. Circulating the data in the result set, recombining the data in the result set to recombine two data structure types of 'cumtomNameInfoMap' and 'cumtomNameEventNamesMap' in the image; the custom NameInfoMap comprises names, security levels, sequence numbers, classification types and associated keywords of all custom classification types, and the custom NameEventNamesMap comprises a mapping relation between the names of the custom classification types and the associated keywords;

4. inquiring a 'destination ip aggregation hour table' or a 'destination ip aggregation day table' in a ClickHouse according to the condition of an asset host ip and a time range to acquire keywords associated with the destination ip so as to obtain a keyword set of the destination ip;

5. and (3) circularly matching the keywords inquired in the step (4) with the cumtomNameEventNamesMap to perform algorithm calculation:

the method comprises the following steps that 1, whether the number of times of keywords appearing in the custom NameEventNamesMap is larger than or equal to the number of set keywords is judged in a circulating mode;

condition 2, if condition 1 holds, acquiring a key (name) for storing the customNameEventNamesMap;

condition 3, if condition 1 and condition 2 are true, obtaining value (all data of classification type corresponding to name) of the customNameInfoMap set through key (name);

and 4, if the

conditions

1,2 and 3 are met, putting all the obtained values (all data of the classification types corresponding to the names) into the target associated with the destination ip, and returning to the page display.

6. And (5) putting the data calculated by the algorithm in the step (5) into an object (HTML _ view) displayed by the asset ip for displaying.

Example 2

Embodiment 2 of the present invention provides a log data sorting and sorting apparatus, which includes a processor and a memory, where the memory stores a computer program, and when the computer program is executed by the processor, the log data sorting and sorting apparatus implements the log data sorting and sorting method provided in embodiment 1.

The log data sorting device provided by the embodiment of the invention is used for realizing the log data sorting method, so that the log data sorting method has the technical effect, and the log data sorting device also has the technical effect, and is not repeated herein.

Example 3

Embodiment 3 of the present invention provides a computer storage medium on which a computer program is stored, the computer program, when executed by a processor, implementing the log data sorting method provided in embodiment 1.

The computer storage medium provided by the embodiment of the invention is used for realizing the log data classification and sorting method, so that the technical effect of the log data classification and sorting method is also achieved by the computer storage medium, and the description is omitted here.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. A log data classification and sorting method is characterized by comprising the following steps:

2. The log data sorting method according to claim 1, wherein the log database is queried based on a specified keyword, and the log data in the log database is subjected to group deduplication to obtain a log data set, specifically:

3. The log data sorting method according to claim 1, wherein the specified keyword is a log keyword or a destination ip.

4. The log data sorting method according to claim 1, wherein the log data is subjected to group deduplication based on a specified keyword to obtain a log data table of a corresponding time period, specifically:

dividing log data with the same value of the specified keywords into a group;

5. The log data classification and sorting method according to claim 1, wherein a type attribute of the classification type is set, specifically:

6. The log data sorting method according to claim 1, wherein the log data sets are sorted based on the classification type to obtain a result set, specifically:

7. The log data sorting method of claim 1, further comprising: and warehousing and storing the result set, and analyzing data based on the stored result set.

8. The log data sorting method according to claim 7, wherein the data analysis is performed based on the stored result set, specifically:

9. A log data sorting device, comprising a processor and a memory, wherein the memory stores a computer program, and the computer program is executed by the processor to implement the log data sorting method according to any one of claims 1 to 8.

10. A computer storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the log data sorting method of any one of claims 1-8.