CN112632020A - Log information type extraction method and mining method based on spark big data platform - Google Patents

Log information type extraction method and mining method based on spark big data platform Download PDF

Info

Publication number
CN112632020A
CN112632020A CN202011560919.8A CN202011560919A CN112632020A CN 112632020 A CN112632020 A CN 112632020A CN 202011560919 A CN202011560919 A CN 202011560919A CN 112632020 A CN112632020 A CN 112632020A
Authority
CN
China
Prior art keywords
log
dimensional array
elements
dimensional
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011560919.8A
Other languages
Chinese (zh)
Other versions
CN112632020B (en
Inventor
王红伟
文占婷
刘恕涛
薛彬彬
岳桂华
陈锦
王禹
成林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 30 Research Institute
China Information Technology Security Evaluation Center
Original Assignee
CETC 30 Research Institute
China Information Technology Security Evaluation Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 30 Research Institute, China Information Technology Security Evaluation Center filed Critical CETC 30 Research Institute
Priority to CN202011560919.8A priority Critical patent/CN112632020B/en
Publication of CN112632020A publication Critical patent/CN112632020A/en
Application granted granted Critical
Publication of CN112632020B publication Critical patent/CN112632020B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention relates to the technical field of computer information systems, and discloses a method for extracting log information types based on a spark big data platform, which comprises the following steps: preprocessing offline log data, filtering log entries which cannot be identified, and storing the log entries into the HDFS; replacing the conventional variable with a wildcard character, meanwhile, conducting normalization processing on the log entries, completing simple wildcard processing, and temporarily storing the data after the wildcard processing into the HDFS; filtering the data subjected to wildcard processing according to a time window, filtering and splitting log data into an effective log set and an invalid log set, and temporarily storing the log data into an HDFS after duplication removal; and respectively calculating the log information types of the effective logs and the ineffective logs by using an iterative packet mining method, and storing the result into the HDFS. The scheme carries out automatic analysis processing on the log, is favorable for recovering and using data, and can efficiently and accurately identify different log information types. The invention also discloses a time window filtering method and an iterative grouping mining method.

Description

Log information type extraction method and mining method based on spark big data platform
Technical Field
The invention relates to the technical field of computer information systems, in particular to a method for extracting and mining log information types based on a spark big data platform.
Background
The Spark big data processing platform mainly comprises two parts: the system comprises an efficient and easy-to-use Spark big data processing framework and a high-reliability distributed file storage system HDFS. A large amount of metadata is stored in the HDFS, and the HDFS divides a file into blocks and stores a plurality of copies so as to prevent the problem that a server or a hard disk is accidentally unavailable; the core concept of the Spark framework is an elastic Distributed data set (RDD), which represents an immutable data set that has been fragmented and can be processed in parallel. The operation on RDD is divided into two categories: transformation and action. transformation means constructing a conventional data set into a new RDD or generating a new RDD from an existing RDD; action refers to the generation of the final result by calculation of the RDD. The recent Spark is rich in transformation and action operation interfaces, which greatly reduces the program development period and simplifies the difficulty of program development.
In a computer information system, the amount of system log data is huge, especially in the current cloud environment, the log data is increased by hundreds of G each day, and the log data often contains some important information, such as: the system operation condition, the user access and use, the invasion and illegal operation of the malicious user and the like are very important for the log analysis and processing of the cloud platform.
Due to the huge number of logs, a method for automatically processing log data instead of the traditional manual detection and analysis is urgently needed. Due to the procedural characteristics of the system, the log information type of a system is limited, and the log information type of the system needs to be known firstly to perform automatic log analysis processing. Therefore, the log information type of the cloud platform needs to be extracted from a large amount of offline log data, and a basic support is provided for subsequent log automatic analysis and processing.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: aiming at the existing problems, a log information type extraction method and a log information type mining method based on a spark big data platform are provided.
The technical scheme adopted by the invention is as follows: a cloud platform log information type extraction method based on a spark big data platform comprises the following steps:
step S1, preprocessing the off-line log data, filtering out unidentifiable log entries, and storing the filtered log data into the HDFS;
step S2, replacing the conventional variables with wildcards, and meanwhile, conducting normalization processing on the log entries to complete simple wildcard processing, and temporarily storing the data after the wildcard processing into the HDFS;
step S3, filtering the time of the data after the wildcard processing according to a time window, filtering and splitting the log data into an effective log set and an invalid log set, and temporarily storing the log data into an HDFS after duplication removal;
and step S4, aiming at the effective log set and the ineffective log set obtained in the step S3, converting the log set into a one-dimensional array taking log information as elements, combining the two-dimensional array according to the array length to form a two-dimensional array, carrying out feature position segmentation, combining the two-dimensional array into a one-dimensional array to form a complete log message type, and storing the final result into the HDFS.
Further, the filtering manner in step S1 adopts the following method: and performing time regular matching on the header information of each log of the log data, and putting the header information into the HDFS if the matching is successful.
Further, the method for filtering according to the time window in step S3 is as follows:
step S31, taking out data (taking out a day or collecting several days or all data according to the situation) from the data temporarily stored in the HDFS after the wildcard processing, and marking as C;
step S32 is to set a time stamp T, execute the first time of taking the base time stamp T to T1 (for example, T1 to 5S, this value is an empirical value), divide the taken log data C by T according to the time of the log header, and form a plurality of time window log sets with an aggregation time T or less
Figure BDA0002859347980000021
Wherein C isiRepresenting the ith set, K being the number of divided sets, and the set time being the log set CiThe difference between the latest and earliest log times;
step S33, taking each log in the log set with the largest number of logs, if a log is in all log sets, considering the log as an invalid log, and storing the invalid log in a temporary invalid log set CinvalidMeanwhile, deleting the log in the original log set;
step S34, re-taking the timestamp T-T1 ^ (2^ N), where N is the number of loop executions (automatically increased by 1 after each calculation), repeating steps S32-S33, if T is greater than the earliest log time T of the fetched data C (data of one day or several days or all)minAnd the latest log time TmaxThe difference exits the loop and the rest log data is stored into a temporary effective log set Cvalid
Further, in step S31, when data is taken out from the data temporarily stored in the HDFS after the wildcard processing, one or several days or all of the data is taken out.
Further, the stepsAt S33, if the log data C is incomplete and the time of the first log set C1 and the last log set CN is less than T, the first log set C is excluded1And the last Log set Ck
Further, the method for performing iterative packet mining in step S4 includes:
step S41, respectively taking out valid and invalid log sets from the HDFS, splitting each log according to a blank space, and converting the split log into a one-dimensional array, wherein elements of the one-dimensional array are words or symbols in log information;
step S42, counting the length of each one-dimensional array, and placing the arrays with the same length together to form a two-dimensional array;
step S43, for each two-dimensional array generated in step S42, counting the number of non-repetitive elements in each column in the two-dimensional array, assuming that the number of non-repetitive elements in the X-th column is the smallest (if there are a plurality of columns with the same number and the smallest number, the column with the smallest sequence number is selected), considering that the position X is the feature division position, dividing the two-dimensional array according to the elements in the feature division position X to form a plurality of new two-dimensional arrays, wherein the elements in the X-th column of the newly generated two-dimensional arrays are all the same;
setting a partition coefficient value, if the ratio of the number of rows of the newly generated two-dimensional array to the number of rows of its parent (before partitioning) cluster two-dimensional array is less than the partition coefficient value, temporarily placing the newly generated two-dimensional array in CtempIn the middle, subsequent treatment is not carried out;
step S44, for each two-dimensional array generated in step S43, counting the number of non-repetitive elements in each column of the two-dimensional array, and if the number of elements in some columns is consistent and the number of columns with consistent number of elements is the largest, defining the two columns with the smallest number of columns as bijective positions, which are marked as positions P1 and P2, and forming a one-dimensional array C1 and a one-dimensional array C2 by the elements in the P1 and P2 columns, respectively;
step S45, if the elements in the one-dimensional array C1 and the one-dimensional array C2 are in one-to-one mapping relation, the two-dimensional array is divided according to the elements on the characteristic division position P1 or position P2 to form a plurality of new two-dimensional arrays;
if the mapping relation of the elements 1 to M or M to 1 exists in the set C1 and the set C2, wherein M is a natural number larger than 1, the ratio of the number of the non-repetitive elements in the set of the M end to the number of the columns of the two-dimensional array is calculated, if the ratio is larger than a preset upper ratio limit, the characteristic segmentation position is selected to be a position corresponding to the set of the 1 end, and if the ratio is smaller than a preset lower ratio limit, the characteristic segmentation position is selected to be a position corresponding to the set of the M end; then, the two-dimensional array is divided according to elements on the characteristic division positions to form a plurality of new two-dimensional arrays;
if the mapping relation of the element N to the element M exists in the set C1 and the set C2, wherein N, M are both natural numbers larger than 1, the two-dimensional array is reserved;
comparing the ratio of the total number of rows of the array in the newly generated two-dimensional array to the total number of rows of the two-dimensional array of the parent (before division) thereof with the division coefficient value, and temporarily placing the two-dimensional array in C if the ratio is less than the division coefficient valuetempIn the middle, subsequent treatment is not carried out;
step S45, counting the number of non-repetitive elements on each column in the two-dimensional array in the same two-dimensional array, and if the number of elements on a column is not 1, changing the elements on the column to a wildcard "+; if the number of the one-dimensional arrays is 1, directly reserving the two-dimensional arrays, combining the two-dimensional arrays into one-dimensional arrays, connecting the one-dimensional arrays by using a blank space, and storing the one-dimensional arrays into the HDFS.
The invention also discloses a spark big data platform-based time window filtering method, which comprises the following steps:
step S31, taking out data (taking out a day or collecting several days or all data according to the situation) from the data temporarily stored in the HDFS after the wildcard processing, and marking as C;
step S32 is to set a time stamp T, execute the first time of taking the base time stamp T to T1 (for example, T1 to 5S, this value is an empirical value), divide the taken log data C by T according to the time of the log header, and form a plurality of time window log sets with an aggregation time T or less
Figure BDA0002859347980000031
Wherein C isiRepresenting the ith set, K being the number of divided sets, and the set time being the log set CiThe difference between the latest and earliest log times;
step S33, taking each log in the log set with the largest number of logs, if a log is in all log sets, considering the log as an invalid log, and storing the invalid log in a temporary invalid log set CinvalidMeanwhile, deleting the log in the original log set;
step S34, re-taking the timestamp T-T1 ^ (2^ N), where N is the number of loop executions (automatically increased by 1 after each calculation), repeating steps S32-S33, if T is greater than the earliest log time T of the fetched data C (data of one day or several days or all)minAnd the latest log time TmaxThe difference exits the loop and the rest log data is stored into a temporary effective log set Cvalid
The invention also discloses an iterative packet mining method based on the spark big data platform, which comprises the following steps:
step S41, respectively taking out valid and invalid log sets from the HDFS, splitting each log according to a blank space, and converting the split log into a one-dimensional array, wherein elements of the one-dimensional array are words or symbols in log information;
step S42, counting the length of each one-dimensional array, and placing the arrays with the same length together to form a two-dimensional array;
step S43, for each two-dimensional array generated in step S42, counting the number of non-repetitive elements in each column in the two-dimensional array, assuming that the number of non-repetitive elements in the X-th column is the smallest (if there are a plurality of columns with the same number and the smallest number, the column with the smallest sequence number is selected), considering that the position X is the feature division position, dividing the two-dimensional array according to the elements in the feature division position X to form a plurality of new two-dimensional arrays, wherein the elements in the X-th column of the newly generated two-dimensional arrays are all the same;
setting a partition coefficient value if a newly generated two-dimensional arrayThe ratio of the number of rows of (C) to the number of rows of its parent (pre-partition) cluster two-dimensional array is less than the partition coefficient value, the newly created two-dimensional array is temporarily placed in CtempIn the middle, subsequent treatment is not carried out;
step S44, for each two-dimensional array generated in step S43, counting the number of non-repetitive elements in each column of the two-dimensional array, and if the number of elements in some columns is consistent and the number of columns with consistent number of elements is the largest, defining the two columns with the smallest number of columns as bijective positions, which are marked as positions P1 and P2, and forming a one-dimensional array C1 and a one-dimensional array C2 by the elements in the P1 and P2 columns, respectively;
step S45, if the elements in the one-dimensional array C1 and the one-dimensional array C2 are in one-to-one mapping relation, the two-dimensional array is divided according to the elements on the characteristic division position P1 or position P2 to form a plurality of new two-dimensional arrays;
if the mapping relation of the elements 1 to M or M to 1 exists in the set C1 and the set C2, wherein M is a natural number larger than 1, the ratio of the number of the non-repetitive elements in the set of the M end to the number of the columns of the two-dimensional array is calculated, if the ratio is larger than a preset upper ratio limit, the characteristic segmentation position is selected to be a position corresponding to the set of the 1 end, and if the ratio is smaller than a preset lower ratio limit, the characteristic segmentation position is selected to be a position corresponding to the set of the M end; then, the two-dimensional array is divided according to elements on the characteristic division positions to form a plurality of new two-dimensional arrays;
if the mapping relation of the element N to the element M exists in the set C1 and the set C2, wherein N, M are both natural numbers larger than 1, the two-dimensional array is reserved;
comparing the ratio of the total number of rows of the array in the newly generated two-dimensional array to the total number of rows of the two-dimensional array of the parent (before division) thereof with the division coefficient value, and temporarily placing the two-dimensional array in C if the ratio is less than the division coefficient valuetempIn the middle, subsequent treatment is not carried out;
step S45, counting the number of non-repetitive elements on each column in the two-dimensional array in the same two-dimensional array, and if the number of elements on a column is not 1, changing the elements on the column to a wildcard "+; if the number of the one-dimensional arrays is 1, directly reserving the two-dimensional arrays, combining the two-dimensional arrays into one-dimensional arrays, connecting the one-dimensional arrays by using a blank space, and storing the one-dimensional arrays into the HDFS.
The invention also discloses a cloud platform log information type extraction device based on the spark big data platform, which comprises the following steps:
the log preprocessing device is used for preprocessing the offline log data, filtering log entries which cannot be identified, and storing the filtered log data into the HDFS;
the simple wildcard device is used for replacing the conventional variable with a wildcard character, simultaneously performing normalization processing on the log entries to finish simple wildcard processing, and temporarily storing the data after the wildcard processing into the HDFS;
the time window filtering device is used for filtering the time of the data subjected to the wildcard processing according to a time window, filtering and splitting the log data into an effective log set and an invalid log set, and temporarily storing the log data into the HDFS after duplication removal;
and the iteration grouping and mining device is used for converting the log set into a one-dimensional array taking the log information as an element, combining the two-dimensional array according to the array length to form a two-dimensional array, carrying out feature position segmentation, combining the two-dimensional array into a one-dimensional array to form a complete log message type, and storing the final result into the HDFS.
Compared with the prior art, the beneficial effects of adopting the technical scheme are as follows:
1) the invention provides a cloud platform log information type extraction method based on a Spark big data platform, which can analyze the log information type of the system in a large amount of off-line log data so as to facilitate automatic analysis and processing of logs;
2) the time window filtering method provided by the invention can identify and segment invalid periodic logs and valid event logs in the cloud platform logs, and can quickly sort invalid log information according to the obtained types;
3) the iterative packet mining method provided by the invention can efficiently and accurately identify different log information types;
4) because the data volume of the offline log is huge, the Spark big data technology is introduced into the invention, and the invention is involved in devices such as preprocessing, filtering, dividing and the like of the log, and can obtain the result quickly and efficiently;
5) the invention introduces HDFS distributed storage, and stores the original offline log data and the result output in the HDFS distributed storage, thereby being beneficial to the recovery and the use of the data.
Drawings
Fig. 1 is a schematic diagram of a principle of a cloud platform log information type extraction device based on a spark big data platform according to the present invention.
Fig. 2 is a schematic diagram illustrating a log preprocessing principle in an embodiment of the present invention.
Fig. 3 is a schematic diagram of a simple generic principle in an embodiment of the present invention.
Fig. 4 is a schematic diagram illustrating a principle of extracting log information types of a cloud platform based on a spark big data platform according to the present invention.
FIG. 5 is a schematic diagram illustrating a time window filtering method according to an embodiment of the present invention.
Fig. 6 is a schematic diagram illustrating a principle of an iterative packet mining method according to an embodiment of the present invention.
Detailed Description
In order to automatically analyze and process the cloud platform system log, the technical scheme of the invention is a device for extracting log information types as many as possible from a large amount of off-line cloud platform log data based on a Spark big data platform. As shown in fig. 1, the device relates to a log preprocessing module, a simple wildcard module, a time window filtering module, an iterative packet mining module, etc., original offline log data and modules are temporarily output and temporarily stored in an HDFS (hadoop Distributed File system), and final result output is stored in the HDFS, and all the processing of the device is performed on a Spark big data platform.
Offline log data
The offline log data includes historical log data generated by the cloud platform system within a period of continuous time, for example, in a distributed cloud operating system, log files on all distributed servers need to be merged and collected, and then all log message types of the system can be completely analyzed. If the system is large or the running time is long, the log data volume can be large, and the cloud platform can reach TB magnitude.
HDFS
The HDFS is called a Hadoop Distributed File System (Hadoop Distributed File System) and stores a large amount of data in a block manner, and each block of data has an extra backup, so that recovery is facilitated. For the storage of the original log, the HDFS performs split storage according to the day, and the output of the log information type is stored according to the valid and invalid.
Log preprocessing module
As shown in fig. 2, before storing the offline log data into the HDFS, preprocessing is required to be performed, header information is identified for each piece of log data, a preprocessing device mainly performs time regular matching (identifying time) on the header information of each piece of log of the offline log data (the header of the log needs to be identified by manual experience), different time formats of the system may be different, and even though the time formats of the modules of the same system may be different, a machine learning (or experience) method needs to be used to obtain the time format. As a typical time matching regular expression: r '^ d {4} - \ d {2} - \ d {2} (([0-1] [0-9] |2[0-4]): 0-5] [0-9]) (, | -) (\\ d +)' can be matched with time in the form of "% Y-% M-% d-% H:% M:% S,% f". For each log, if the time format can be matched at the head of the log, storing the log into an HDFS file of a corresponding day; if not, the log is skipped and the next-hop log data is identified.
Simple universal module
As shown in fig. 3, the module mainly functions to perform simple regular replacement of the daily log, such as replacing the IP address, the time except the head of the log, and the conventional UUID with wildcard characters "", replacing "/", "\ t", "═ in the log with" (",") "," [ "," { "," "}", "" "," "and" "(empty characters), and if there are multiple spaces in the log, they will be merged into one. The replaced log is then partitioned into arrays according to the spaces, and if an element in an array can be converted into an integer or floating-point type number, the element is replaced by a wildcard character. Finally, the arrays are merged according to the blank space and combined with the head time to form the shape as follows: a log format of a 'header time log information body' is sorted according to header time and temporarily stored in the HDFS.
Based on the log preprocessing module, the simple wildcard module, the time window filtering module, the iterative grouping mining module and the like, as shown in fig. 4, a cloud platform log information type extraction method based on a spark big data platform comprises the following steps: step S1, preprocessing the off-line log data, filtering out unidentifiable log entries, and storing the filtered log data into the HDFS; step S2, replacing conventional variables (such as IP addresses, UUIDs, time character strings and the like) with wildcards, simultaneously carrying out normalization processing on log entries, completing simple wildcard processing, and temporarily storing data after the wildcard processing into the HDFS; step S3, filtering the time of the data after the wildcard processing according to a time window, filtering and splitting the log data into an effective log set and an invalid log set, and temporarily storing the log data into an HDFS after duplication removal; and step S4, aiming at the effective log set and the ineffective log set obtained in the step S3, converting the log set into a one-dimensional array taking log information as elements, combining the two-dimensional array according to the array length to form a two-dimensional array, carrying out feature position segmentation, combining the two-dimensional array into a one-dimensional array to form a complete log message type, and storing the final result into the HDFS.
Example 1:
a time window filtering module: in cloud platform log data, log types can be simply divided into two categories: periodic logs and event logs. Periodic logging refers to periodically outputting log information by a cloud platform system program (e.g., detecting all virtual machine state outputs every 30 seconds in a cloud operating system); the event type log refers to a log generated due to certain events occurring in the cloud platform system (such as a virtual machine created by a user in a cloud operating system), and the log has no periodicity. The processing and analyzing value of the periodic log to the log is not great, and the periodic log is called as an invalid log; event logs are of great value and we call them valid logs. However, both the invalid log and the valid log need to determine the log information type, so that the subsequent filtering processing is convenient. The main purpose of the time window filtering means is to separate the two types of logs as much as possible.
As shown in fig. 5, the process of time window filtering is as follows:
step S31, taking out data (taking out a day or collecting several days or all data according to the situation) from the data temporarily stored in the HDFS after the wildcard processing, and marking as C;
step S32 is to set a time stamp T, execute the first time of taking the base time stamp T to T1 (for example, T1 to 5S, this value is an empirical value), divide the taken log data C by T according to the time of the log header, and form a plurality of time window log sets with an aggregation time T or less
Figure BDA0002859347980000071
Wherein C isiRepresenting the ith set, K being the number of divided sets, and the set time being the log set CiThe difference between the latest and earliest log times;
in step S33, a log set with the largest number of logs (C is assumed to be used)k) If a log is in all log sets CiK (if the log data C start-stop is incomplete, the first log set C1And the last Log set CkIf the set time is less than T, excluding the first log set C1And the last Log set Ck) If the log is an invalid log, the invalid log is stored in a temporary invalid log set CinvalidMeanwhile, deleting the log in the original log set;
step S34, re-taking the timestamp T-T1 × (2^ N), where N is the number of loop executions (automatically increased by 1 after each calculation), repeating steps S32-S33, if T is greater than the earliest log time T of the fetched data C (data of one or several days or all fetched in step S31)minAnd the latest log time TmaxThe difference exits the loop and the number of the remaining logs is countedBy logging into a temporary valid log collection Cvalid
In invalid log set CinvalidAnd a valid log set CvalidIn order to reduce the data volume, the head time part of the log is deleted, only the log information body is reserved, then redundant repeated log data in the set is deleted, and the result is temporarily stored in the HDFS.
Example 2:
an iterative packet mining module: iterative packet mining is a process of continuing to dig deep into the message types of the log based on time window filtering.
As shown in fig. 6, the method of iterative packet mining is:
step S41, splitting: respectively taking out effective log set C from HDFSvalidAnd invalid log set CinvalidSplitting each log according to a space, and converting the log into a one-dimensional array, wherein elements of the array are words or symbols in log information;
step S42, element number division: counting the length of each one-dimensional array, and placing the arrays with the same length together to form a two-dimensional array;
step S43, feature value segmentation: for each two-dimensional array generated in step S42, counting the number of non-repetitive elements on each column in the two-dimensional array, assuming that the number of non-repetitive elements on the X-th column is the smallest (if there are a plurality of columns with the same number and the smallest number, the column with the smallest sequence number is selected), considering X as a feature division position, and dividing the two-dimensional array according to the elements on the feature division position X to form a plurality of new two-dimensional arrays, where the elements on the X-th column of the newly generated two-dimensional array are all the same; here we define a partition coefficient PT (PT can be set to 0.15 based on empirical values), and if the ratio of the number of rows in the newly generated two-dimensional array to the number of rows in its parent two-dimensional array is less than PT, then we consider the number of rows in the new two-dimensional array to be too small to form a log information type, and temporarily place the new two-dimensional array into CtempIn the middle, subsequent treatment is not carried out;
step S44, bijective segmentation: for each two-dimensional array generated in the last step, counting twoThe number of non-repeating elements on each column in the dimension array is marked as Qi(i 1, 2.,. T, T is the total number of columns in the two-dimensional array), and then Q is countediFrequency of (i ═ 1, 2., T), i.e. QiNumerical statistics of (i ═ 1, 2., T) (e.g., Q)i5 times, Q, equal to 3i3 times equal to 6). Assuming the maximum frequency of Qa、Qb、Qc… (a, b, c are any values between 1 and T). That is, the number of elements in some columns is consistent, and the number of columns is the largest, we define the two columns with the smallest column number as the bijective position, which is denoted as P1 and P2, and the elements in the P1 and P2 columns constitute the one-dimensional array C1 and the one-dimensional array C2, respectively. Now, the mapping relationship of the elements in the one-dimensional array C1 and the one-dimensional array C2 needs to be examined.
If the mapping relationship is 1-1 mapping (namely, the elements in the one-dimensional arrays C1 and C2 are in one-to-one correspondence), the selected bijective positions can be regarded as the positions of feature segmentation, and the two-dimensional array is segmented according to the elements at the feature segmentation positions (assumed to be P1 and also to be P2) to form a plurality of new two-dimensional arrays;
if the mapping relationship is "1-M" or "M-1" mapping (i.e. there is one-to-many or many-to-one condition of elements in the one-dimensional arrays C1 and C2), it is stated that a constant is at "1" end position (if the mapping relationship between the elements in the one-dimensional arrays C1 and C2 is one-to-many, i.e. "1-M", then "1" end position refers to the position P1 corresponding to the one-dimensional array C1), a constant may be at "M" end position, then we calculate the ratio of the number of non-repetitive elements in the "M" end set to the number of columns of the two-dimensional array, if the ratio is greater than a preset UpperBound (e.g. UpperBound ═ 0.9), we consider the "M" end as a variable, then the feature partition bit is "1" end, if the ratio is less than a preset LowerBound (e.g. LowerBound bound ═ 0.1), we consider the "M" end as a constant, then the feature map bit is "1" end (for "M-M" (for example, position P2; for the "M-1" mapping, position P1). Then, the two-dimensional array is divided according to elements on the characteristic division positions to form a plurality of new two-dimensional arrays;
if the mapping relationship is an "N-M" mapping (i.e., a "N-M" mapping)The condition that elements are many-to-many exists in the one-dimensional array C1 and the one-dimensional array C2), the condition that the feature segmentation position is not obvious is shown, and the two-dimensional array is directly reserved and is temporarily placed in the CtempIn (3), no treatment is subsequently performed.
Finally, the ratio of the total row number of the newly generated two-dimensional array to the total row number of the parent two-dimensional array (the parent two-dimensional array refers to the two-dimensional array before the division) is compared with the division coefficient PT, and if the ratio is smaller than the PT, the new two-dimensional array is temporarily placed in the CtempIn the middle, subsequent treatment is not carried out;
and step S45, merging the two-dimensional arrays to form the log information type. We compare the two-dimensional array generated in step S44, and CtempAnd (5) carrying out induction and combination. In the same two-dimensional array, counting the number of non-repetitive elements on each column in the two-dimensional array, and if the number of elements on a certain column is not 1, changing the elements on the certain column into wildcard characters'; if 1, the reservation is made directly. Therefore, a two-dimensional array is combined into a one-dimensional array, and finally the one-dimensional array is connected by using a blank space to form a complete log message type. These message types are eventually saved to the HDFS.
The invention is not limited to the foregoing embodiments. The invention extends to any novel feature or any novel combination of features disclosed in this specification and any novel method or process steps or any novel combination of features disclosed. Those skilled in the art to which the invention pertains will appreciate that insubstantial changes or modifications can be made without departing from the spirit of the invention as defined by the appended claims.

Claims (9)

1. The method for extracting the log information type based on the spark big data platform is characterized by comprising the following steps:
step S1, preprocessing the off-line log data, filtering out unidentifiable log entries, and storing the filtered log data into the HDFS;
step S2, replacing the conventional variables with wildcards, and meanwhile, conducting normalization processing on the log entries to complete simple wildcard processing, and temporarily storing the data after the wildcard processing into the HDFS;
step S3, filtering the time of the data after the wildcard processing according to a time window, filtering and splitting the log data into an effective log set and an invalid log set, and temporarily storing the log data into an HDFS after duplication removal;
and step S4, aiming at the effective log set and the ineffective log set obtained in the step S3, converting the log set into a one-dimensional array taking log information as elements, combining the two-dimensional array according to the array length to form a two-dimensional array, carrying out feature position segmentation, combining the two-dimensional array into a one-dimensional array to form a complete log message type, and storing the final result into the HDFS.
2. The spark big data platform-based log information type extraction method as claimed in claim 1, wherein the filtering manner in step S1 adopts the following method: and performing time regular matching on the header information of each log of the log data, and putting the header information into the HDFS if the matching is successful.
3. The method for extracting log information type based on spark big data platform as claimed in claim 1, wherein the filtering according to time window in step S3 is performed by:
step S31, taking out data from the data temporarily stored in the HDFS after the wildcard processing, and recording as C;
step S32, setting time stamp T, executing first time taking base time stamp T1, dividing the taken log data C by T according to the time of the log header, and forming a plurality of time window log sets with set time less than or equal to T
Figure FDA0002859347970000011
Wherein C isiRepresenting the ith set, K being the number of divided sets, and the set time being the log set CiThe difference between the latest and earliest log times;
step S33, taking each log in the log set with the largest number of logs, if a log is in all log sets, considering the log as an invalid log, and using the invalid logInvalid journal logging into temporary invalid journal collection CinvalidMeanwhile, deleting the log in the original log set;
step S34, re-fetching the timestamp T-T1 ^ (2^ N), where N is the number of times of loop execution, repeating steps S32-S33, if T is greater than the earliest log time T of the fetched data CminAnd the latest log time TmaxThe difference exits the loop and the rest log data is stored into a temporary effective log set Cvalid
4. The spark big data platform based log information type extraction method as claimed in claim 3, wherein in step S31, when data is extracted from the data temporarily stored in the HDFS after the wildcard processing, one or more days or all of the data are extracted.
5. The spark big data platform based log information type extraction method as claimed in claim 3, wherein in step S33, if the log data C is incomplete at the beginning and the time of the first log set C1 and the last log set CN is less than T, the first log set C is excluded1And the last Log set Ck
6. The spark big data platform-based log information type extraction method as claimed in claim 1, wherein the iterative packet mining method in step S4 is:
step S41, respectively taking out valid and invalid log sets from the HDFS, splitting each log according to a blank space, and converting the split log into a one-dimensional array, wherein elements of the one-dimensional array are words or symbols in log information;
step S42, counting the length of each one-dimensional array, and placing the arrays with the same length together to form a two-dimensional array;
step S43, for each two-dimensional array generated in the step S42, counting the number of non-repetitive elements on each column in the two-dimensional array, assuming that the number of non-repetitive elements on the X-th column is the minimum, considering that the X-th column is a feature division position, dividing the two-dimensional array according to the elements on the feature division position X to form a plurality of new two-dimensional arrays, wherein the elements on the X-th column of the newly generated two-dimensional arrays are all the same;
setting a partition coefficient value, if the ratio of the number of rows of the newly generated two-dimensional array to the number of rows of the parent two-dimensional array is smaller than the partition coefficient value, temporarily placing the newly generated two-dimensional array in CtempIn the middle, subsequent treatment is not carried out;
step S44, for each two-dimensional array generated in step S43, counting the number of non-repetitive elements in each column of the two-dimensional array, and if the number of elements in some columns is consistent and the number of columns with consistent number of elements is the largest, defining the two columns with the smallest number of columns as bijective positions, which are marked as positions P1 and P2, and forming a one-dimensional array C1 and a one-dimensional array C2 by the elements in the P1 and P2 columns, respectively;
step S45, if the elements in the one-dimensional array C1 and the one-dimensional array C2 are in one-to-one mapping relation, the two-dimensional array is divided according to the elements on the characteristic division position P1 or position P2 to form a plurality of new two-dimensional arrays;
if the mapping relation of the elements 1 to M or M to 1 exists in the set C1 and the set C2, wherein M is a natural number larger than 1, the ratio of the number of the non-repetitive elements in the set of the M end to the number of the columns of the two-dimensional array is calculated, if the ratio is larger than a preset upper ratio limit, the characteristic segmentation position is selected to be a position corresponding to the set of the 1 end, and if the ratio is smaller than a preset lower ratio limit, the characteristic segmentation position is selected to be a position corresponding to the set of the M end; then, the two-dimensional array is divided according to elements on the characteristic division positions to form a plurality of new two-dimensional arrays;
if the mapping relation of the element N to the element M exists in the set C1 and the set C2, wherein N, M are both natural numbers larger than 1, the two-dimensional array is reserved;
comparing the ratio of the total row number of the array in the newly generated two-dimensional array to the total row number of the two-dimensional array of the parent thereof with the partition coefficient value, and temporarily placing the two-dimensional array in the C if the ratio is less than the partition coefficient valuetempIn the middle, subsequent treatment is not carried out;
step S45, counting the number of non-repetitive elements on each column in the two-dimensional array in the same two-dimensional array, and if the number of the elements on the column is not 1, changing the elements on the column into wildcards; if the number of the one-dimensional arrays is 1, directly reserving the two-dimensional arrays, combining the two-dimensional arrays into one-dimensional arrays, connecting the one-dimensional arrays by using a blank space, and storing the one-dimensional arrays into the HDFS.
7. The spark big data platform based log information type extraction method as claimed in claim 6, wherein in step S43, if there are a plurality of columns with the same number and the smallest number, the column with the smallest sequence number is selected as the feature dividing position.
8. The log information type mining method based on spark big data platform is characterized by comprising the following steps:
step S41, respectively taking out valid and invalid log sets from the HDFS, splitting each log according to a blank space, and converting the split log into a one-dimensional array, wherein elements of the one-dimensional array are words or symbols in log information;
step S42, counting the length of each one-dimensional array, and placing the arrays with the same length together to form a two-dimensional array;
step S43, for each two-dimensional array generated in step S42, counting the number of non-repetitive elements in each column in the two-dimensional array, assuming that the number of non-repetitive elements in the X-th column is the smallest (if there are a plurality of columns with the same number and the smallest number, the column with the smallest sequence number is selected), considering that the position X is the feature division position, dividing the two-dimensional array according to the elements in the feature division position X to form a plurality of new two-dimensional arrays, wherein the elements in the X-th column of the newly generated two-dimensional arrays are all the same;
setting a partition coefficient value, if the ratio of the number of rows of the newly generated two-dimensional array to the number of rows of the parent cluster two-dimensional array is less than the partition coefficient value, temporarily placing the newly generated two-dimensional array in CtempIn the middle, subsequent treatment is not carried out;
step S44, for each two-dimensional array generated in step S43, counting the number of non-repetitive elements in each column of the two-dimensional array, and if the number of elements in some columns is consistent and the number of columns with consistent number of elements is the largest, defining the two columns with the smallest number of columns as bijective positions, which are marked as positions P1 and P2, and forming a one-dimensional array C1 and a one-dimensional array C2 by the elements in the P1 and P2 columns, respectively;
step S45, if the elements in the one-dimensional array C1 and the one-dimensional array C2 are in one-to-one mapping relation, the two-dimensional array is divided according to the elements on the characteristic division position P1 or position P2 to form a plurality of new two-dimensional arrays;
if the mapping relation of the elements 1 to M or M to 1 exists in the set C1 and the set C2, wherein M is a natural number larger than 1, the ratio of the number of the non-repetitive elements in the set of the M end to the number of the columns of the two-dimensional array is calculated, if the ratio is larger than a preset upper ratio limit, the characteristic segmentation position is selected to be a position corresponding to the set of the 1 end, and if the ratio is smaller than a preset lower ratio limit, the characteristic segmentation position is selected to be a position corresponding to the set of the M end; then, the two-dimensional array is divided according to elements on the characteristic division positions to form a plurality of new two-dimensional arrays;
if the mapping relation of the element N to the element M exists in the set C1 and the set C2, wherein N, M are both natural numbers larger than 1, the two-dimensional array is reserved;
comparing the ratio of the total row number of the array in the newly generated two-dimensional array to the total row number of the parent two-dimensional array with the partition coefficient value, and temporarily placing the two-dimensional array in the C if the ratio is less than the partition coefficient valuetempIn the middle, subsequent treatment is not carried out;
step S45, counting the number of non-repetitive elements on each column in the two-dimensional array in the same two-dimensional array, and if the number of the elements on the column is not 1, changing the elements on the column into wildcards; if the number of the one-dimensional arrays is 1, directly reserving the two-dimensional arrays, combining the two-dimensional arrays into one-dimensional arrays, connecting the one-dimensional arrays by using a blank space, and storing the one-dimensional arrays into the HDFS.
9. The spark big data platform based log information type extraction method as claimed in claim 8, wherein in step S43, if there are a plurality of columns with the same number and the smallest number, the column with the smallest sequence number is selected as the feature dividing position.
CN202011560919.8A 2020-12-25 2020-12-25 Log information type extraction method and mining method based on spark big data platform Active CN112632020B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011560919.8A CN112632020B (en) 2020-12-25 2020-12-25 Log information type extraction method and mining method based on spark big data platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011560919.8A CN112632020B (en) 2020-12-25 2020-12-25 Log information type extraction method and mining method based on spark big data platform

Publications (2)

Publication Number Publication Date
CN112632020A true CN112632020A (en) 2021-04-09
CN112632020B CN112632020B (en) 2022-03-18

Family

ID=75325159

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011560919.8A Active CN112632020B (en) 2020-12-25 2020-12-25 Log information type extraction method and mining method based on spark big data platform

Country Status (1)

Country Link
CN (1) CN112632020B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104636494A (en) * 2015-03-04 2015-05-20 浪潮电子信息产业股份有限公司 Log audit checking system based on Spark big data platform
US9135324B1 (en) * 2013-03-15 2015-09-15 Ca, Inc. System and method for analysis of process data and discovery of situational and complex applications
WO2017092444A1 (en) * 2015-12-02 2017-06-08 中兴通讯股份有限公司 Log data mining method and system based on hadoop
CN107480190A (en) * 2017-07-11 2017-12-15 国家计算机网络与信息安全管理中心 A kind of filter method and device of non-artificial access log
CN107704594A (en) * 2017-10-13 2018-02-16 东南大学 Power system daily record data real-time processing method based on SparkStreaming
CN108399199A (en) * 2018-01-30 2018-08-14 武汉大学 A kind of collection of the application software running log based on Spark and service processing system and method
CN109271349A (en) * 2018-09-29 2019-01-25 四川长虹电器股份有限公司 A kind of rules process method based on log versatility regulation engine
CN109460339A (en) * 2018-10-16 2019-03-12 北京趣拿软件科技有限公司 The streaming computing system of log
CN110427298A (en) * 2019-07-10 2019-11-08 武汉大学 A kind of Automatic Feature Extraction method of distributed information log
CN110690984A (en) * 2018-07-05 2020-01-14 上海宝信软件股份有限公司 Spark-based big data weblog acquisition, analysis and early warning method and system
CN111949633A (en) * 2020-08-03 2020-11-17 杭州电子科技大学 ICT system operation log analysis method based on parallel stream processing
CN111950263A (en) * 2020-08-10 2020-11-17 中山大学 Log analysis method and system and electronic equipment
CN112069048A (en) * 2020-09-09 2020-12-11 北京明略昭辉科技有限公司 Log processing method, device and storage medium

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9135324B1 (en) * 2013-03-15 2015-09-15 Ca, Inc. System and method for analysis of process data and discovery of situational and complex applications
CN104636494A (en) * 2015-03-04 2015-05-20 浪潮电子信息产业股份有限公司 Log audit checking system based on Spark big data platform
WO2017092444A1 (en) * 2015-12-02 2017-06-08 中兴通讯股份有限公司 Log data mining method and system based on hadoop
CN107480190A (en) * 2017-07-11 2017-12-15 国家计算机网络与信息安全管理中心 A kind of filter method and device of non-artificial access log
CN107704594A (en) * 2017-10-13 2018-02-16 东南大学 Power system daily record data real-time processing method based on SparkStreaming
CN108399199A (en) * 2018-01-30 2018-08-14 武汉大学 A kind of collection of the application software running log based on Spark and service processing system and method
CN110690984A (en) * 2018-07-05 2020-01-14 上海宝信软件股份有限公司 Spark-based big data weblog acquisition, analysis and early warning method and system
CN109271349A (en) * 2018-09-29 2019-01-25 四川长虹电器股份有限公司 A kind of rules process method based on log versatility regulation engine
CN109460339A (en) * 2018-10-16 2019-03-12 北京趣拿软件科技有限公司 The streaming computing system of log
CN110427298A (en) * 2019-07-10 2019-11-08 武汉大学 A kind of Automatic Feature Extraction method of distributed information log
CN111949633A (en) * 2020-08-03 2020-11-17 杭州电子科技大学 ICT system operation log analysis method based on parallel stream processing
CN111950263A (en) * 2020-08-10 2020-11-17 中山大学 Log analysis method and system and electronic equipment
CN112069048A (en) * 2020-09-09 2020-12-11 北京明略昭辉科技有限公司 Log processing method, device and storage medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
OIFENGO: "Spark基于搜狗日志数据分析", 《HTTPS://BLOG.CSDN.NET/WEIXIN_39381833/ARTICLE/DETAILS/85938712》 *
WANG RUI 等: "Model Construction and Data Management of Running Log in Supporting SaaS Software Performance Analysis", 《DOI:10.18293/SEKE2017-128》 *
林宗缪 等: "基于Spark的网络日志分析平台研究与设计", 《自动化与仪器仪表》 *
涂金林: "基于Spark的电力系统日志数据的分析处理", 《中国优秀博硕士学位论文全文数据库(硕士)工程科技Ⅱ辑》 *

Also Published As

Publication number Publication date
CN112632020B (en) 2022-03-18

Similar Documents

Publication Publication Date Title
CN108664375B (en) Method for detecting abnormal behavior of computer network system user
CN109522290B (en) HBase data block recovery and data record extraction method
CN110569214B (en) Index construction method and device for log file and electronic equipment
CN114637989B (en) APT attack tracing method, system and storage medium based on distributed system
CN107832333B (en) Method and system for constructing user network data fingerprint based on distributed processing and DPI data
US8756312B2 (en) Multi-tier message correlation
CN111274218A (en) Multi-source log data processing method for power information system
CN1783092A (en) Data analysis device and data analysis method
CN108804661A (en) Data de-duplication method based on fuzzy clustering in a kind of cloud storage system
CN111737203A (en) Database history log backtracking method, device, system, equipment and storage medium
CN114817243A (en) Method, device and equipment for establishing database joint index and storage medium
CN114385452A (en) Intelligent cluster log monitoring and analyzing method
CN115269438A (en) Automatic testing method and device for image processing algorithm
CN113722416A (en) Data cleaning method, device and equipment and readable storage medium
CN112632020B (en) Log information type extraction method and mining method based on spark big data platform
CN107577809A (en) Offline small documents processing method and processing device
CN116821053A (en) Data reporting method, device, computer equipment and storage medium
CN110909380B (en) Abnormal file access behavior monitoring method and device
CN111680072A (en) Social information data-based partitioning system and method
CN111314109A (en) Weak key-based large-scale Internet of things equipment firmware identification method
CN115859932A (en) Log template extraction method and device, electronic equipment and storage medium
CN110705462B (en) Hadoop-based distributed video key frame extraction method
CN111817867A (en) Method and system for multi-log collaborative analysis in distributed environment
CN111752727B (en) Log analysis-based three-layer association recognition method for database
WO2010038481A1 (en) Computer-readable recording medium containing a sentence extraction program, sentence extraction method, and sentence extraction device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant