CN112632020A

CN112632020A - Log information type extraction method and mining method based on spark big data platform

Info

Publication number: CN112632020A
Application number: CN202011560919.8A
Authority: CN
Inventors: 王红伟; 文占婷; 刘恕涛; 薛彬彬; 岳桂华; 陈锦; 王禹; 成林
Original assignee: CETC 30 Research Institute; China Information Technology Security Evaluation Center
Current assignee: CETC 30 Research Institute; China Information Technology Security Evaluation Center
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2021-04-09
Anticipated expiration: 2040-12-25
Also published as: CN112632020B

Abstract

The invention relates to the technical field of computer information systems, and discloses a method for extracting log information types based on a spark big data platform, which comprises the following steps: preprocessing offline log data, filtering log entries which cannot be identified, and storing the log entries into the HDFS; replacing the conventional variable with a wildcard character, meanwhile, conducting normalization processing on the log entries, completing simple wildcard processing, and temporarily storing the data after the wildcard processing into the HDFS; filtering the data subjected to wildcard processing according to a time window, filtering and splitting log data into an effective log set and an invalid log set, and temporarily storing the log data into an HDFS after duplication removal; and respectively calculating the log information types of the effective logs and the ineffective logs by using an iterative packet mining method, and storing the result into the HDFS. The scheme carries out automatic analysis processing on the log, is favorable for recovering and using data, and can efficiently and accurately identify different log information types. The invention also discloses a time window filtering method and an iterative grouping mining method.

Description

Log information type extraction method and mining method based on spark big data platform

Technical Field

The invention relates to the technical field of computer information systems, in particular to a method for extracting and mining log information types based on a spark big data platform.

Background

The Spark big data processing platform mainly comprises two parts: the system comprises an efficient and easy-to-use Spark big data processing framework and a high-reliability distributed file storage system HDFS. A large amount of metadata is stored in the HDFS, and the HDFS divides a file into blocks and stores a plurality of copies so as to prevent the problem that a server or a hard disk is accidentally unavailable; the core concept of the Spark framework is an elastic Distributed data set (RDD), which represents an immutable data set that has been fragmented and can be processed in parallel. The operation on RDD is divided into two categories: transformation and action. transformation means constructing a conventional data set into a new RDD or generating a new RDD from an existing RDD; action refers to the generation of the final result by calculation of the RDD. The recent Spark is rich in transformation and action operation interfaces, which greatly reduces the program development period and simplifies the difficulty of program development.

In a computer information system, the amount of system log data is huge, especially in the current cloud environment, the log data is increased by hundreds of G each day, and the log data often contains some important information, such as: the system operation condition, the user access and use, the invasion and illegal operation of the malicious user and the like are very important for the log analysis and processing of the cloud platform.

Due to the huge number of logs, a method for automatically processing log data instead of the traditional manual detection and analysis is urgently needed. Due to the procedural characteristics of the system, the log information type of a system is limited, and the log information type of the system needs to be known firstly to perform automatic log analysis processing. Therefore, the log information type of the cloud platform needs to be extracted from a large amount of offline log data, and a basic support is provided for subsequent log automatic analysis and processing.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the existing problems, a log information type extraction method and a log information type mining method based on a spark big data platform are provided.

The technical scheme adopted by the invention is as follows: a cloud platform log information type extraction method based on a spark big data platform comprises the following steps:

step S1, preprocessing the off-line log data, filtering out unidentifiable log entries, and storing the filtered log data into the HDFS;

step S2, replacing the conventional variables with wildcards, and meanwhile, conducting normalization processing on the log entries to complete simple wildcard processing, and temporarily storing the data after the wildcard processing into the HDFS;

step S3, filtering the time of the data after the wildcard processing according to a time window, filtering and splitting the log data into an effective log set and an invalid log set, and temporarily storing the log data into an HDFS after duplication removal;

and step S4, aiming at the effective log set and the ineffective log set obtained in the step S3, converting the log set into a one-dimensional array taking log information as elements, combining the two-dimensional array according to the array length to form a two-dimensional array, carrying out feature position segmentation, combining the two-dimensional array into a one-dimensional array to form a complete log message type, and storing the final result into the HDFS.

Further, the filtering manner in step S1 adopts the following method: and performing time regular matching on the header information of each log of the log data, and putting the header information into the HDFS if the matching is successful.

Further, the method for filtering according to the time window in step S3 is as follows:

step S31, taking out data (taking out a day or collecting several days or all data according to the situation) from the data temporarily stored in the HDFS after the wildcard processing, and marking as C;

step S32 is to set a time stamp T, execute the first time of taking the base time stamp T to T1 (for example, T1 to 5S, this value is an empirical value), divide the taken log data C by T according to the time of the log header, and form a plurality of time window log sets with an aggregation time T or less

Wherein C is_iRepresenting the ith set, K being the number of divided sets, and the set time being the log set C_iThe difference between the latest and earliest log times;

step S33, taking each log in the log set with the largest number of logs, if a log is in all log sets, considering the log as an invalid log, and storing the invalid log in a temporary invalid log set C_invalidMeanwhile, deleting the log in the original log set;

step S34, re-taking the timestamp T-T1 ^ (2^ N), where N is the number of loop executions (automatically increased by 1 after each calculation), repeating steps S32-S33, if T is greater than the earliest log time T of the fetched data C (data of one day or several days or all)_minAnd the latest log time T_maxThe difference exits the loop and the rest log data is stored into a temporary effective log set C_valid。

Further, in step S31, when data is taken out from the data temporarily stored in the HDFS after the wildcard processing, one or several days or all of the data is taken out.

Further, the stepsAt S33, if the log data C is incomplete and the time of the first log set C1 and the last log set CN is less than T, the first log set C is excluded₁And the last Log set C_k。

Further, the method for performing iterative packet mining in step S4 includes:

step S41, respectively taking out valid and invalid log sets from the HDFS, splitting each log according to a blank space, and converting the split log into a one-dimensional array, wherein elements of the one-dimensional array are words or symbols in log information;

step S42, counting the length of each one-dimensional array, and placing the arrays with the same length together to form a two-dimensional array;

step S43, for each two-dimensional array generated in step S42, counting the number of non-repetitive elements in each column in the two-dimensional array, assuming that the number of non-repetitive elements in the X-th column is the smallest (if there are a plurality of columns with the same number and the smallest number, the column with the smallest sequence number is selected), considering that the position X is the feature division position, dividing the two-dimensional array according to the elements in the feature division position X to form a plurality of new two-dimensional arrays, wherein the elements in the X-th column of the newly generated two-dimensional arrays are all the same;

setting a partition coefficient value, if the ratio of the number of rows of the newly generated two-dimensional array to the number of rows of its parent (before partitioning) cluster two-dimensional array is less than the partition coefficient value, temporarily placing the newly generated two-dimensional array in C_tempIn the middle, subsequent treatment is not carried out;

step S44, for each two-dimensional array generated in step S43, counting the number of non-repetitive elements in each column of the two-dimensional array, and if the number of elements in some columns is consistent and the number of columns with consistent number of elements is the largest, defining the two columns with the smallest number of columns as bijective positions, which are marked as positions P1 and P2, and forming a one-dimensional array C1 and a one-dimensional array C2 by the elements in the P1 and P2 columns, respectively;

step S45, if the elements in the one-dimensional array C1 and the one-dimensional array C2 are in one-to-one mapping relation, the two-dimensional array is divided according to the elements on the characteristic division position P1 or position P2 to form a plurality of new two-dimensional arrays;

if the mapping relation of the elements 1 to M or M to 1 exists in the set C1 and the set C2, wherein M is a natural number larger than 1, the ratio of the number of the non-repetitive elements in the set of the M end to the number of the columns of the two-dimensional array is calculated, if the ratio is larger than a preset upper ratio limit, the characteristic segmentation position is selected to be a position corresponding to the set of the 1 end, and if the ratio is smaller than a preset lower ratio limit, the characteristic segmentation position is selected to be a position corresponding to the set of the M end; then, the two-dimensional array is divided according to elements on the characteristic division positions to form a plurality of new two-dimensional arrays;

if the mapping relation of the element N to the element M exists in the set C1 and the set C2, wherein N, M are both natural numbers larger than 1, the two-dimensional array is reserved;

comparing the ratio of the total number of rows of the array in the newly generated two-dimensional array to the total number of rows of the two-dimensional array of the parent (before division) thereof with the division coefficient value, and temporarily placing the two-dimensional array in C if the ratio is less than the division coefficient value_tempIn the middle, subsequent treatment is not carried out;

step S45, counting the number of non-repetitive elements on each column in the two-dimensional array in the same two-dimensional array, and if the number of elements on a column is not 1, changing the elements on the column to a wildcard "+; if the number of the one-dimensional arrays is 1, directly reserving the two-dimensional arrays, combining the two-dimensional arrays into one-dimensional arrays, connecting the one-dimensional arrays by using a blank space, and storing the one-dimensional arrays into the HDFS.

The invention also discloses a spark big data platform-based time window filtering method, which comprises the following steps:

The invention also discloses an iterative packet mining method based on the spark big data platform, which comprises the following steps:

setting a partition coefficient value if a newly generated two-dimensional arrayThe ratio of the number of rows of (C) to the number of rows of its parent (pre-partition) cluster two-dimensional array is less than the partition coefficient value, the newly created two-dimensional array is temporarily placed in C_tempIn the middle, subsequent treatment is not carried out;

The invention also discloses a cloud platform log information type extraction device based on the spark big data platform, which comprises the following steps:

the log preprocessing device is used for preprocessing the offline log data, filtering log entries which cannot be identified, and storing the filtered log data into the HDFS;

the simple wildcard device is used for replacing the conventional variable with a wildcard character, simultaneously performing normalization processing on the log entries to finish simple wildcard processing, and temporarily storing the data after the wildcard processing into the HDFS;

the time window filtering device is used for filtering the time of the data subjected to the wildcard processing according to a time window, filtering and splitting the log data into an effective log set and an invalid log set, and temporarily storing the log data into the HDFS after duplication removal;

and the iteration grouping and mining device is used for converting the log set into a one-dimensional array taking the log information as an element, combining the two-dimensional array according to the array length to form a two-dimensional array, carrying out feature position segmentation, combining the two-dimensional array into a one-dimensional array to form a complete log message type, and storing the final result into the HDFS.

Compared with the prior art, the beneficial effects of adopting the technical scheme are as follows:

1) the invention provides a cloud platform log information type extraction method based on a Spark big data platform, which can analyze the log information type of the system in a large amount of off-line log data so as to facilitate automatic analysis and processing of logs;

2) the time window filtering method provided by the invention can identify and segment invalid periodic logs and valid event logs in the cloud platform logs, and can quickly sort invalid log information according to the obtained types;

3) the iterative packet mining method provided by the invention can efficiently and accurately identify different log information types;

4) because the data volume of the offline log is huge, the Spark big data technology is introduced into the invention, and the invention is involved in devices such as preprocessing, filtering, dividing and the like of the log, and can obtain the result quickly and efficiently;

5) the invention introduces HDFS distributed storage, and stores the original offline log data and the result output in the HDFS distributed storage, thereby being beneficial to the recovery and the use of the data.

Drawings

Fig. 1 is a schematic diagram of a principle of a cloud platform log information type extraction device based on a spark big data platform according to the present invention.

Fig. 2 is a schematic diagram illustrating a log preprocessing principle in an embodiment of the present invention.

Fig. 3 is a schematic diagram of a simple generic principle in an embodiment of the present invention.

Fig. 4 is a schematic diagram illustrating a principle of extracting log information types of a cloud platform based on a spark big data platform according to the present invention.

FIG. 5 is a schematic diagram illustrating a time window filtering method according to an embodiment of the present invention.

Fig. 6 is a schematic diagram illustrating a principle of an iterative packet mining method according to an embodiment of the present invention.

Detailed Description

In order to automatically analyze and process the cloud platform system log, the technical scheme of the invention is a device for extracting log information types as many as possible from a large amount of off-line cloud platform log data based on a Spark big data platform. As shown in fig. 1, the device relates to a log preprocessing module, a simple wildcard module, a time window filtering module, an iterative packet mining module, etc., original offline log data and modules are temporarily output and temporarily stored in an HDFS (hadoop Distributed File system), and final result output is stored in the HDFS, and all the processing of the device is performed on a Spark big data platform.

Offline log data

The offline log data includes historical log data generated by the cloud platform system within a period of continuous time, for example, in a distributed cloud operating system, log files on all distributed servers need to be merged and collected, and then all log message types of the system can be completely analyzed. If the system is large or the running time is long, the log data volume can be large, and the cloud platform can reach TB magnitude.

HDFS

The HDFS is called a Hadoop Distributed File System (Hadoop Distributed File System) and stores a large amount of data in a block manner, and each block of data has an extra backup, so that recovery is facilitated. For the storage of the original log, the HDFS performs split storage according to the day, and the output of the log information type is stored according to the valid and invalid.

Log preprocessing module

As shown in fig. 2, before storing the offline log data into the HDFS, preprocessing is required to be performed, header information is identified for each piece of log data, a preprocessing device mainly performs time regular matching (identifying time) on the header information of each piece of log of the offline log data (the header of the log needs to be identified by manual experience), different time formats of the system may be different, and even though the time formats of the modules of the same system may be different, a machine learning (or experience) method needs to be used to obtain the time format. As a typical time matching regular expression: r '^ d {4} - \ d {2} - \ d {2} (([0-1] [0-9] |2[0-4]): 0-5] [0-9]) (, | -) (\\ d +)' can be matched with time in the form of "% Y-% M-% d-% H:% M:% S,% f". For each log, if the time format can be matched at the head of the log, storing the log into an HDFS file of a corresponding day; if not, the log is skipped and the next-hop log data is identified.

Simple universal module

As shown in fig. 3, the module mainly functions to perform simple regular replacement of the daily log, such as replacing the IP address, the time except the head of the log, and the conventional UUID with wildcard characters "", replacing "/", "\ t", "═ in the log with" (",") "," [ "," { "," "}", "" "," "and" "(empty characters), and if there are multiple spaces in the log, they will be merged into one. The replaced log is then partitioned into arrays according to the spaces, and if an element in an array can be converted into an integer or floating-point type number, the element is replaced by a wildcard character. Finally, the arrays are merged according to the blank space and combined with the head time to form the shape as follows: a log format of a 'header time log information body' is sorted according to header time and temporarily stored in the HDFS.

Based on the log preprocessing module, the simple wildcard module, the time window filtering module, the iterative grouping mining module and the like, as shown in fig. 4, a cloud platform log information type extraction method based on a spark big data platform comprises the following steps: step S1, preprocessing the off-line log data, filtering out unidentifiable log entries, and storing the filtered log data into the HDFS; step S2, replacing conventional variables (such as IP addresses, UUIDs, time character strings and the like) with wildcards, simultaneously carrying out normalization processing on log entries, completing simple wildcard processing, and temporarily storing data after the wildcard processing into the HDFS; step S3, filtering the time of the data after the wildcard processing according to a time window, filtering and splitting the log data into an effective log set and an invalid log set, and temporarily storing the log data into an HDFS after duplication removal; and step S4, aiming at the effective log set and the ineffective log set obtained in the step S3, converting the log set into a one-dimensional array taking log information as elements, combining the two-dimensional array according to the array length to form a two-dimensional array, carrying out feature position segmentation, combining the two-dimensional array into a one-dimensional array to form a complete log message type, and storing the final result into the HDFS.

Example 1:

a time window filtering module: in cloud platform log data, log types can be simply divided into two categories: periodic logs and event logs. Periodic logging refers to periodically outputting log information by a cloud platform system program (e.g., detecting all virtual machine state outputs every 30 seconds in a cloud operating system); the event type log refers to a log generated due to certain events occurring in the cloud platform system (such as a virtual machine created by a user in a cloud operating system), and the log has no periodicity. The processing and analyzing value of the periodic log to the log is not great, and the periodic log is called as an invalid log; event logs are of great value and we call them valid logs. However, both the invalid log and the valid log need to determine the log information type, so that the subsequent filtering processing is convenient. The main purpose of the time window filtering means is to separate the two types of logs as much as possible.

As shown in fig. 5, the process of time window filtering is as follows:

in step S33, a log set with the largest number of logs (C is assumed to be used)_k) If a log is in all log sets C_iK (if the log data C start-stop is incomplete, the first log set C₁And the last Log set C_kIf the set time is less than T, excluding the first log set C₁And the last Log set C_k) If the log is an invalid log, the invalid log is stored in a temporary invalid log set C_invalidMeanwhile, deleting the log in the original log set;

step S34, re-taking the timestamp T-T1 × (2^ N), where N is the number of loop executions (automatically increased by 1 after each calculation), repeating steps S32-S33, if T is greater than the earliest log time T of the fetched data C (data of one or several days or all fetched in step S31)_minAnd the latest log time T_maxThe difference exits the loop and the number of the remaining logs is countedBy logging into a temporary valid log collection C_valid。

In invalid log set C_invalidAnd a valid log set C_validIn order to reduce the data volume, the head time part of the log is deleted, only the log information body is reserved, then redundant repeated log data in the set is deleted, and the result is temporarily stored in the HDFS.

Example 2:

an iterative packet mining module: iterative packet mining is a process of continuing to dig deep into the message types of the log based on time window filtering.

As shown in fig. 6, the method of iterative packet mining is:

step S41, splitting: respectively taking out effective log set C from HDFS_validAnd invalid log set C_invalidSplitting each log according to a space, and converting the log into a one-dimensional array, wherein elements of the array are words or symbols in log information;

step S42, element number division: counting the length of each one-dimensional array, and placing the arrays with the same length together to form a two-dimensional array;

step S43, feature value segmentation: for each two-dimensional array generated in step S42, counting the number of non-repetitive elements on each column in the two-dimensional array, assuming that the number of non-repetitive elements on the X-th column is the smallest (if there are a plurality of columns with the same number and the smallest number, the column with the smallest sequence number is selected), considering X as a feature division position, and dividing the two-dimensional array according to the elements on the feature division position X to form a plurality of new two-dimensional arrays, where the elements on the X-th column of the newly generated two-dimensional array are all the same; here we define a partition coefficient PT (PT can be set to 0.15 based on empirical values), and if the ratio of the number of rows in the newly generated two-dimensional array to the number of rows in its parent two-dimensional array is less than PT, then we consider the number of rows in the new two-dimensional array to be too small to form a log information type, and temporarily place the new two-dimensional array into C_tempIn the middle, subsequent treatment is not carried out;

step S44, bijective segmentation: for each two-dimensional array generated in the last step, counting twoThe number of non-repeating elements on each column in the dimension array is marked as Q_i(i 1, 2.,. T, T is the total number of columns in the two-dimensional array), and then Q is counted_iFrequency of (i ═ 1, 2., T), i.e. Q_iNumerical statistics of (i ═ 1, 2., T) (e.g., Q)_i5 times, Q, equal to 3_i3 times equal to 6). Assuming the maximum frequency of Q_a、Q_b、Q_c… (a, b, c are any values between 1 and T). That is, the number of elements in some columns is consistent, and the number of columns is the largest, we define the two columns with the smallest column number as the bijective position, which is denoted as P1 and P2, and the elements in the P1 and P2 columns constitute the one-dimensional array C1 and the one-dimensional array C2, respectively. Now, the mapping relationship of the elements in the one-dimensional array C1 and the one-dimensional array C2 needs to be examined.

If the mapping relationship is 1-1 mapping (namely, the elements in the one-dimensional arrays C1 and C2 are in one-to-one correspondence), the selected bijective positions can be regarded as the positions of feature segmentation, and the two-dimensional array is segmented according to the elements at the feature segmentation positions (assumed to be P1 and also to be P2) to form a plurality of new two-dimensional arrays;

if the mapping relationship is "1-M" or "M-1" mapping (i.e. there is one-to-many or many-to-one condition of elements in the one-dimensional arrays C1 and C2), it is stated that a constant is at "1" end position (if the mapping relationship between the elements in the one-dimensional arrays C1 and C2 is one-to-many, i.e. "1-M", then "1" end position refers to the position P1 corresponding to the one-dimensional array C1), a constant may be at "M" end position, then we calculate the ratio of the number of non-repetitive elements in the "M" end set to the number of columns of the two-dimensional array, if the ratio is greater than a preset UpperBound (e.g. UpperBound ═ 0.9), we consider the "M" end as a variable, then the feature partition bit is "1" end, if the ratio is less than a preset LowerBound (e.g. LowerBound bound ═ 0.1), we consider the "M" end as a constant, then the feature map bit is "1" end (for "M-M" (for example, position P2; for the "M-1" mapping, position P1). Then, the two-dimensional array is divided according to elements on the characteristic division positions to form a plurality of new two-dimensional arrays;

if the mapping relationship is an "N-M" mapping (i.e., a "N-M" mapping)The condition that elements are many-to-many exists in the one-dimensional array C1 and the one-dimensional array C2), the condition that the feature segmentation position is not obvious is shown, and the two-dimensional array is directly reserved and is temporarily placed in the C_tempIn (3), no treatment is subsequently performed.

Finally, the ratio of the total row number of the newly generated two-dimensional array to the total row number of the parent two-dimensional array (the parent two-dimensional array refers to the two-dimensional array before the division) is compared with the division coefficient PT, and if the ratio is smaller than the PT, the new two-dimensional array is temporarily placed in the C_tempIn the middle, subsequent treatment is not carried out;

and step S45, merging the two-dimensional arrays to form the log information type. We compare the two-dimensional array generated in step S44, and C_tempAnd (5) carrying out induction and combination. In the same two-dimensional array, counting the number of non-repetitive elements on each column in the two-dimensional array, and if the number of elements on a certain column is not 1, changing the elements on the certain column into wildcard characters'; if 1, the reservation is made directly. Therefore, a two-dimensional array is combined into a one-dimensional array, and finally the one-dimensional array is connected by using a blank space to form a complete log message type. These message types are eventually saved to the HDFS.

The invention is not limited to the foregoing embodiments. The invention extends to any novel feature or any novel combination of features disclosed in this specification and any novel method or process steps or any novel combination of features disclosed. Those skilled in the art to which the invention pertains will appreciate that insubstantial changes or modifications can be made without departing from the spirit of the invention as defined by the appended claims.

Claims

1. The method for extracting the log information type based on the spark big data platform is characterized by comprising the following steps:

2. The spark big data platform-based log information type extraction method as claimed in claim 1, wherein the filtering manner in step S1 adopts the following method: and performing time regular matching on the header information of each log of the log data, and putting the header information into the HDFS if the matching is successful.

3. The method for extracting log information type based on spark big data platform as claimed in claim 1, wherein the filtering according to time window in step S3 is performed by:

step S31, taking out data from the data temporarily stored in the HDFS after the wildcard processing, and recording as C;

step S32, setting time stamp T, executing first time taking base time stamp T1, dividing the taken log data C by T according to the time of the log header, and forming a plurality of time window log sets with set time less than or equal to T

step S33, taking each log in the log set with the largest number of logs, if a log is in all log sets, considering the log as an invalid log, and using the invalid logInvalid journal logging into temporary invalid journal collection C_invalidMeanwhile, deleting the log in the original log set;

step S34, re-fetching the timestamp T-T1 ^ (2^ N), where N is the number of times of loop execution, repeating steps S32-S33, if T is greater than the earliest log time T of the fetched data C_minAnd the latest log time T_maxThe difference exits the loop and the rest log data is stored into a temporary effective log set C_valid。

4. The spark big data platform based log information type extraction method as claimed in claim 3, wherein in step S31, when data is extracted from the data temporarily stored in the HDFS after the wildcard processing, one or more days or all of the data are extracted.

5. The spark big data platform based log information type extraction method as claimed in claim 3, wherein in step S33, if the log data C is incomplete at the beginning and the time of the first log set C1 and the last log set CN is less than T, the first log set C is excluded₁And the last Log set C_k。

6. The spark big data platform-based log information type extraction method as claimed in claim 1, wherein the iterative packet mining method in step S4 is:

step S43, for each two-dimensional array generated in the step S42, counting the number of non-repetitive elements on each column in the two-dimensional array, assuming that the number of non-repetitive elements on the X-th column is the minimum, considering that the X-th column is a feature division position, dividing the two-dimensional array according to the elements on the feature division position X to form a plurality of new two-dimensional arrays, wherein the elements on the X-th column of the newly generated two-dimensional arrays are all the same;

setting a partition coefficient value, if the ratio of the number of rows of the newly generated two-dimensional array to the number of rows of the parent two-dimensional array is smaller than the partition coefficient value, temporarily placing the newly generated two-dimensional array in C_tempIn the middle, subsequent treatment is not carried out;

comparing the ratio of the total row number of the array in the newly generated two-dimensional array to the total row number of the two-dimensional array of the parent thereof with the partition coefficient value, and temporarily placing the two-dimensional array in the C if the ratio is less than the partition coefficient value_tempIn the middle, subsequent treatment is not carried out;

step S45, counting the number of non-repetitive elements on each column in the two-dimensional array in the same two-dimensional array, and if the number of the elements on the column is not 1, changing the elements on the column into wildcards; if the number of the one-dimensional arrays is 1, directly reserving the two-dimensional arrays, combining the two-dimensional arrays into one-dimensional arrays, connecting the one-dimensional arrays by using a blank space, and storing the one-dimensional arrays into the HDFS.

7. The spark big data platform based log information type extraction method as claimed in claim 6, wherein in step S43, if there are a plurality of columns with the same number and the smallest number, the column with the smallest sequence number is selected as the feature dividing position.

8. The log information type mining method based on spark big data platform is characterized by comprising the following steps:

setting a partition coefficient value, if the ratio of the number of rows of the newly generated two-dimensional array to the number of rows of the parent cluster two-dimensional array is less than the partition coefficient value, temporarily placing the newly generated two-dimensional array in C_tempIn the middle, subsequent treatment is not carried out;

comparing the ratio of the total row number of the array in the newly generated two-dimensional array to the total row number of the parent two-dimensional array with the partition coefficient value, and temporarily placing the two-dimensional array in the C if the ratio is less than the partition coefficient value_tempIn the middle, subsequent treatment is not carried out;

9. The spark big data platform based log information type extraction method as claimed in claim 8, wherein in step S43, if there are a plurality of columns with the same number and the smallest number, the column with the smallest sequence number is selected as the feature dividing position.