CN117473094B - Log classification method and system - Google Patents

Log classification method and system Download PDF

Info

Publication number
CN117473094B
CN117473094B CN202311811547.5A CN202311811547A CN117473094B CN 117473094 B CN117473094 B CN 117473094B CN 202311811547 A CN202311811547 A CN 202311811547A CN 117473094 B CN117473094 B CN 117473094B
Authority
CN
China
Prior art keywords
log
classified
prime
congruence
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311811547.5A
Other languages
Chinese (zh)
Other versions
CN117473094A (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Juming Network Technology Co ltd
Original Assignee
Nanjing Juming Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Juming Network Technology Co ltd filed Critical Nanjing Juming Network Technology Co ltd
Priority to CN202311811547.5A priority Critical patent/CN117473094B/en
Publication of CN117473094A publication Critical patent/CN117473094A/en
Application granted granted Critical
Publication of CN117473094B publication Critical patent/CN117473094B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The application discloses a log classification method and system. The method comprises the following steps: the log acquisition node generates a prime number set and forms the prime number set into a prime number matrix; performing congruence operation on the logs to be classified by using the prime number matrix; transmitting the logs to be classified, the prime matrix and the congruence operation result to a storage node for storage; the storage node initiates a classification request to the cloud service node, and uploads a log to be classified, a prime matrix and a congruence operation result to the cloud service node by using an encryption algorithm; the cloud service node divides the logs to be classified according to the prime matrix and the congruence operation result; and clustering the division results by using a clustering algorithm. And the calculation tasks are balanced and shared, a simple encryption algorithm is used for replacing a complex encryption algorithm to carry out communication, meanwhile, clustering is carried out based on congruence operation results, the accuracy of classifying the new modes is improved, and the calculation load is effectively reduced. The method and the device solve the technical problems that the accuracy of classifying the new mode is low and the calculation load is large.

Description

Log classification method and system
Technical Field
The application relates to the field of network security, in particular to a log classification method and system.
Background
In the field of network security, analysis of generalization results of logs or alarms generated by various systems, devices or products is one of the most main basic capabilities, whether the logs can be reasonably classified or clustered, whether the proper generalization can be correctly analyzed, and whether the user can be provided with assurance of key information, so related researches and methods in the field are continuously proposed, and the method is mainly focused on automatic classification of log information.
At present, for general automatic log classification, a main adopted method is to classify by using a TFIDF method basically based on text after word segmentation, and although the TFIDF-based method is simpler and quicker, the word sequence relation between word segmentation cannot be reflected, so that poor effect and wrong classification can be possibly caused; the more accurate mode is to vectorize word segmentation results, and the hidden Markov mode, the recurrent neural network or the long-short-term memory network are comprehensively utilized to classify logs in different possible modes.
In addition, in the current environment, the calculation amount for log classification is large, the calculation capacity of the distributed acquisition equipment for directly acquiring log information is limited, the distributed acquisition equipment can only be carried out according to a specific and organized generalized model, and classification accuracy is affected by directly classifying a new and historically unrecognizable mode, so that classification work can be generally carried out on other special nodes with stronger calculation capacity, but the nodes can be local to a user or in a cloud, calculation tasks are still too concentrated, an encryption algorithm with higher intensity is used in the transmission process, and the calculation amount is large for the load of the cloud and the acquisition equipment and cannot meet the actual needs.
Aiming at the problems of low accuracy and large calculation load of classifying new modes in the related technology, no effective solution is proposed at present.
Disclosure of Invention
The main purpose of the application is to provide a log classification method and system, so as to solve the problems of low accuracy and large calculation load of classifying new modes.
To achieve the above object, according to one aspect of the present application, there is provided a log classification method.
The log classification method according to the application comprises the following steps: the log acquisition node generates a prime number set and forms the prime number set into a prime number matrix; performing congruence operation on the logs to be classified by using the prime number matrix; transmitting the logs to be classified, the prime matrix and the congruence operation result to a storage node for storage; the storage node initiates a classification request to the cloud service node, and uploads a log to be classified, a prime matrix and a congruence operation result to the cloud service node by using an encryption algorithm; the cloud service node divides the logs to be classified according to the prime matrix and the congruence operation result; and clustering the division results by using a clustering algorithm.
Further, the obtaining of the log to be classified includes: and the log acquisition node performs word segmentation on the original log and removes stop words to obtain the log to be classified.
Further, performing congruence operation on the log to be classified by using the prime matrix further comprises: performing finite field number theory inverse operation on the congruence operation result to generate check data; transmitting the logs to be classified, the prime number matrix, the congruence operation result and the check data to a storage node for storage; the storage node initiates a classification request to the cloud service node, and uploads a log to be classified, a prime matrix, a congruence operation result and check data to the cloud service node by using an encryption algorithm; the cloud service node performs the inverse verification of the number theory of the logs to be classified based on the verification data; dividing the logs to be classified which pass verification according to the congruence operation result; and clustering the division results by using a clustering algorithm.
Further, the prime number sets are regenerated at a fixed frequency.
Further, performing congruence operation on the log to be classified by using the prime matrix comprises: defining a prime matrix p= { P ij -a }; define the log to be classified w= { W i And converts it into a full string set s= { c i -a }; grouping the character strings in the S according to 16 bytes, and filling character strings with less than 16 bytes through preset character strings; for the same packet in S, { p., j congruence operation, then for the next packet, { p., (j+1)modn and performing congruence operation.
Further, the clustering algorithm is a DBSCAN algorithm, a cosine similarity algorithm or a Jaccard similarity algorithm.
Further, the clustering algorithm is used for clustering the division result, and then the clustering algorithm further comprises the following steps: the cloud service node transmits the clustering operation result back to the storage node by using an encryption algorithm; and the storage node stores according to the clustering operation result.
Further, the columns of the prime matrix are longitudinal congruence operation factors and are used for carrying out simple congruence calculation of different modes on the same block of content results to be calculated; and the behavior transverse congruence calculation segment of the prime matrix is used for calculating aiming at different contents to be calculated.
To achieve the above object, according to another aspect of the present application, there is provided a log classification system.
The log classification system according to the present application includes: the log acquisition node is used for generating a prime number set and forming the prime number set into a prime number matrix; performing congruence operation on the logs to be classified by using the prime number matrix; transmitting the logs to be classified, the prime matrix and the congruence operation result to a storage node for storage; the storage node is used for initiating a classification request to the cloud service node, and uploading the log to be classified, the prime matrix and the congruence operation result to the cloud service node by using an encryption algorithm; the cloud service node is used for dividing logs to be classified according to the prime matrix and the congruence operation result; and clustering the division results by using a clustering algorithm.
Further, the log acquisition node is further used for carrying out finite field number theory inverse operation on the congruence operation result to generate check data; transmitting the logs to be classified, the prime number matrix, the congruence operation result and the check data to a storage node for storage; the storage node is also used for initiating a classification request to the cloud service node, and uploading the log to be classified, the prime number matrix, the congruence operation result and the check data to the cloud service node by using an encryption algorithm; the cloud service node is further used for carrying out the inverse verification of the number theory of the logs to be classified based on the verification data; dividing the logs to be classified which pass verification according to the congruence operation result; and clustering the division results by using a clustering algorithm.
In the embodiment of the application, a multistage log classification architecture with matched log acquisition nodes, storage nodes and cloud service nodes is adopted, so that calculation tasks can be balanced and shared; the method has the advantages that the plurality of different prime modulus operations are adopted to carry out certain-degree congruence operation on the logs to be classified, a simple encryption algorithm can be used for replacing a complex encryption algorithm to carry out communication, and clustering is carried out based on the congruence operation results, so that the technical effects of improving the accuracy of classifying the new mode and effectively reducing the calculation load are achieved, and the technical problems of low accuracy of classifying the new mode and high calculation load are solved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, are included to provide a further understanding of the application and to provide a further understanding of the application with regard to the other features, objects and advantages of the application. The drawings of the illustrative embodiments of the present application and their descriptions are for the purpose of illustrating the present application and are not to be construed as unduly limiting the present application. In the drawings:
FIG. 1 is a flow diagram of a log classification method according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a log classification device according to an embodiment of the present application.
Detailed Description
In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate in order to describe the embodiments of the present application described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In the present application, the terms "upper", "lower", "left", "right", "front", "rear", "top", "bottom", "inner", "outer", "middle", "vertical", "horizontal", "lateral", "longitudinal" and the like indicate an azimuth or a positional relationship based on that shown in the drawings. These terms are only used to better describe the present invention and its embodiments and are not intended to limit the scope of the indicated devices, elements or components to the particular orientations or to configure and operate in the particular orientations.
Also, some of the terms described above may be used to indicate other meanings in addition to orientation or positional relationships, for example, the term "upper" may also be used to indicate some sort of attachment or connection in some cases. The specific meaning of these terms in the present invention will be understood by those of ordinary skill in the art according to the specific circumstances.
Furthermore, the terms "mounted," "configured," "provided," "connected," "coupled," and "sleeved" are to be construed broadly. For example, it may be a fixed connection, a removable connection, or a unitary construction; may be a mechanical connection, or an electrical connection; may be directly connected, or indirectly connected through intervening media, or may be in internal communication between two devices, elements, or components. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.
It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
According to an embodiment of the present invention, there is provided a log classification method, as shown in fig. 1, including steps S101 to S106 as follows:
s101, a prime number set is generated by a log acquisition node and is formed into a prime number matrix;
specifically, the log collection node may automatically generate a prime number set, which is regenerated at a fixed frequency; i.e. regenerating prime sets at intervals, such as hourly, daily, etc., depending on the user's requirements for security strength, differential attacks can be prevented to some extent.
Combining the generated prime number sets into an m x n matrix, wherein the columns of the matrix are longitudinal congruence operation factors, and the rows are called transverse congruence calculation fragments; in order to accelerate the calculation, the rows and columns are not too large, and if m is 3, n is 3; and the prime number selected is not too large, which is appointed in the patent2 32 Inside.
Note that the reason why prime numbers are selected as the modulus is that the calculation results can be distributed relatively uniformly by utilizing the finite field characteristics of the modulus prime number residual system, and that square residuals can be adopted, however, the calculation amount is large and the distribution results are unbalanced.
The acquisition node generates a prime modulus set (in order to ensure certain calculation efficiency, the prime numbers are not too large), the prime modulus set is organized into an m multiplied by n matrix, each column is called a longitudinal congruence operation factor, and the function of the prime modulus set is to perform simple congruence calculation of different modulus on the same block of content result to be calculated, wherein the meaning of the prime modulus set is to ensure the accuracy of clustering to a certain extent; the same row is called a horizontal congruence calculation segment, and the function of the horizontal congruence calculation segment is mainly to calculate different contents to be calculated, and the function of the horizontal congruence calculation segment is to ensure that differential attack is prevented to a certain extent, wherein the meaning of preventing differential attack is that even if an attacker obtains a decrypted congruence calculation result (a result obtained after processing based on prime number set and logs to be classified later), some known information cannot be utilized and what the original information is can be deduced from the information through certain calculation, even if the information is not encrypted.
In the embodiment of the application, only the classification of the unrecognized mode is aimed at, so that the process of unidentified and new types of logs in the system is simplified, otherwise, generalized script making personnel face a large amount of unclassified information, and the newly-appearing log types are difficult to cover at a time. It should be noted that this classification is unsupervised.
It should be understood that the generalizable log further generalizes the collected related log by using the deployed log collection node. Since generalization is performed depending on patterns that have been historically identified, logging of identified patterns is typically performed on a regular basis, such patterns being based on historical accumulation, i.e., the product provider performs a fine analysis from historically obtained logs from which relevant classification information and other content, such as TCP/IP quintuples, user information, file access information, etc., are obtained.
It will be appreciated by those skilled in the art that prime numbers refer to integers that can only be divisible by 1 and themselves.
It will be appreciated by those skilled in the art that congruence means that given a positive integer m, if two integers a and b satisfy a-b that is divisible by m, i.e., (a-b)/m gives an integer, then the integer a is said to be congruent to the integer m, denoted as a≡b (mod m). The modulo m congruence is an equivalence relation of integers.
Step S102, congruence operation is carried out on the logs to be classified by utilizing a prime matrix;
specifically, performing congruence operation on the log to be classified by using the prime matrix comprises:
defining a prime matrix p= { P ij -a }; the prime number is the product of the number of rows and the number of columns, and is written in a matrix form:
these prime numbers are not necessarily all different;
define the log to be classified w= { W i And converts it into a full string set s= { c i -a }; where the string length is
Grouping the character strings in the S according to 16 bytes, and filling character strings with less than 16 bytes through preset character strings;
for the same packet in S, { p., j congruence operation, result in concatenation, i.e. one packet length will be from 2 128 Becomes as followsBecause they contain m different primes (because this primenet is m rows);
for the next packet, { p., (j+1)modn and performing congruence operation, so that all groups are circularly traversed, and different groups use different prime matrix arrays to perform operation.
The processing for one packet is one of the followingExample (original packet information uses seg) k Representation):
seg k ≡a 1k (p 1i )
seg k ≡a 2k (p 2i )
seg k ≡a mk (p mi )
finally, a packet seg k Represented as a 1k a 2k… a mk Other groupings are handled similarly, except that a prime number set of different columns in the matrix is used;
all packets after overall processing are composed of seg 1 seg 2… seg K Is converted into a 11 a 21… a m1... a 1K a 2K… a mK Where K is the number of packets:
,
,/>indicating that it is not divisible;
'[ ]' is a rounding operation.
Step S103, transmitting the log to be classified, the prime matrix and the congruence operation result to a storage node for storage;
the log acquisition node transmits the processed congruence operation result, prime matrix and the original content of the log to be classified to the storage node, and the storage node stores the data.
Step S104, the storage node initiates a classification request to the cloud service node, and the log to be classified, the prime matrix and the congruence operation result are uploaded to the cloud service node by using an encryption algorithm;
and the storage node initiates a classification request task to the cloud service node, and simultaneously uploads the log to be classified, the prime matrix and the congruence operation result to the cloud service node by using a simple encryption algorithm. Here, the task uploaded by the storage node may include data from a plurality of different log collection nodes.
It is to be understood that, because a plurality of different prime-modulus are utilized to perform a certain degree of congruence operation on the word to be classified, the network transmission of the prime-modulus set can adopt a general business-secret or national-secret algorithm; the algorithm is much less computationally intensive than other encryption algorithms, such as 3DES, AES, homomorphic encryption (common homomorphic encryption methods such as the Paillier algorithm), etc.
Step S105, the cloud service node divides the logs to be classified according to the prime matrix and the congruence operation result;
and S106, clustering operation is carried out on the division results by using a clustering algorithm.
Dividing according to the uploading task and prime matrixes thereof, wherein data to be classified with the same prime matrix are used as a classification calculation task, the classification task performs clustering operation on the data by using clustering algorithms such as DBSCAN, cosine similarity, jaccard similarity and the like, and if the similarity of the data exceeds a certain set threshold (such as 85%), the data are classified into one class; the reason that the prime matrix must be uploaded instead of the acquisition node identification is that the prime matrix may change dynamically.
It should be appreciated that the three-level structure is adopted to classify the original log information, and the structure comprises a log collecting component, an intermediate storage node and a cloud log classifying and calculating node, wherein the log collecting component and the intermediate storage node are deployed in the internal environment of the user, the cloud log classifying and calculating node is deployed in the external environment of the user, a plurality of prime models are generated by each collecting node, namely prime model sets generated by the collecting nodes are different, and the prime model sets are transferred to the cloud log classifying and calculating node by the intermediate storage node.
From the above description, it can be seen that the following technical effects are achieved:
in the embodiment of the application, a multistage log classification architecture with matched log acquisition nodes, storage nodes and cloud service nodes is adopted, so that calculation tasks can be balanced and shared; the method has the advantages that the plurality of different prime modulus operations are adopted to carry out certain-degree congruence operation on the logs to be classified, a simple encryption algorithm can be used for replacing a complex encryption algorithm to carry out communication, and clustering is carried out based on the congruence operation results, so that the technical effects of improving the accuracy of classifying the new mode and effectively reducing the calculation load are achieved, and the technical problems of low accuracy of classifying the new mode and high calculation load are solved.
According to an embodiment of the present invention, preferably, the obtaining of the log to be classified includes:
and the log acquisition node performs word segmentation on the original log and removes stop words to obtain the log to be classified.
The log collection node performs word segmentation on the original log by using a word segmentation algorithm according to a dictionary (containing Chinese and English), and removes stop words, wherein the stop words mainly comprise punctuation marks, digit strings which are common and nonsensical and month information (such as Jan and Feb) in the log, and the stop words are removed, are not taken as the final word segmentation result, and only the word segmentation (Chinese is also) based on the dictionary generally.
According to an embodiment of the present invention, preferably, performing congruence operation on the log to be classified by using the prime matrix further includes:
performing finite field number theory inverse operation on the congruence operation result to generate check data;
transmitting the logs to be classified, the prime number matrix, the congruence operation result and the check data to a storage node for storage;
the storage node initiates a classification request to the cloud service node, and uploads a log to be classified, a prime matrix, a congruence operation result and check data to the cloud service node by using an encryption algorithm;
the cloud service node performs the inverse verification of the number theory of the logs to be classified based on the verification data;
dividing the logs to be classified which pass verification according to the congruence operation result;
and clustering the division results by using a clustering algorithm.
The log acquisition points generate check data with a certain lower probability (such as 1%) according to the data scale, and the check data is generated by the inverse operation of the finite field number theory, and the method is as follows:
selecting a congruence calculation result of data needing to be subjected to classification operation, and inverting a packet number theory:
the log acquisition node transmits the processed congruence operation result, prime number matrix, check data and the original content of the log to be classified to the storage node, and the storage node stores the data.
And the storage node initiates a classification request task to the cloud service node, and simultaneously uploads the log to be classified, the prime matrix, the check data and the congruence operation result to the cloud service node by using a simple encryption algorithm. Here, the task uploaded by the storage node may include data from a plurality of different log collection nodes.
After receiving the related data, the cloud service node verifies the data with the computable verification, divides the data to be classified according to the uploading task and the prime matrix thereof after the data are subjected to inverse verification through the number theory, takes the data to be classified with the same prime matrix as a classification computing task, performs clustering operation on the data by using clustering algorithms such as DBSCAN, cosine similarity and Jaccard similarity, and classifies the data into one class if the similarity of the data exceeds a certain set threshold (such as 85 percent); the reason that the prime matrix must be uploaded instead of the acquisition node identification is that the prime matrix may change dynamically.
The cloud service node can check whether the data to be processed is not tampered or not by using the prime matrix and the inverse operation thereof according to part of the data to be checked, and the calculation load of the inverse calculation of the prime matrix is not very large, so that the calculation load is further reduced on the premise of ensuring the data safety.
According to the embodiment of the present invention, preferably, the clustering algorithm is further used to perform clustering operation on the division result, and the method further includes:
the cloud service node transmits the clustering operation result back to the storage node by using an encryption algorithm;
and the storage node stores according to the clustering operation result.
If the classification nodes are deployed in the cloud, especially on public clouds, such work may cause leakage of sensitive information, even if a strong encryption algorithm is used during transmission, the complete information is stored in the cloud, and once the cloud host is trapped, the sensitive log information may be leaked. In order to solve the problem, the information to be classified and subjected to congruence calculation is classified according to the classification task and the acquired prime modulus matrix, the classified result is fed back to the intermediate storage node, and the cloud computing node cannot land the classified data on any storage resource of the cloud, so that the situation that log information is leaked due to sinking of a cloud host is avoided.
In addition, the classification algorithm provided by the patent is only carried out on public mode information in different original log information, so that the cloud data leakage is not substantially affected.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.
According to an embodiment of the present invention, there is also provided a system for implementing the above log classification method, as shown in fig. 2, the system includes:
the log acquisition node is used for generating a prime number set and forming the prime number set into a prime number matrix; performing congruence operation on the logs to be classified by using the prime number matrix; transmitting the logs to be classified, the prime matrix and the congruence operation result to a storage node for storage;
specifically, the log collection node may automatically generate a prime number set, which is regenerated at a fixed frequency; i.e. regenerating prime sets at intervals, such as hourly, daily, etc., depending on the user's requirements for security strength, differential attacks can be prevented to some extent.
Combining the generated prime number sets into an m x n matrix, wherein the columns of the matrix are longitudinal congruence operation factors, and the rows are called transverse congruence calculation fragments; in order to accelerate the calculation, the rows and columns are not too large, and if m is 3, n is 3; moreover, the prime number selected is not too large, and the patent is appointed as 2 32 Inside.
Note that the reason why prime numbers are selected as the modulus is that the calculation results can be distributed relatively uniformly by utilizing the finite field characteristics of the modulus prime number residual system, and that square residuals can be adopted, however, the calculation amount is large and the distribution results are unbalanced.
The acquisition node generates a prime modulus set (in order to ensure certain calculation efficiency, the prime numbers are not too large), the prime modulus set is organized into an m multiplied by n matrix, each column is called a longitudinal congruence operation factor, and the function of the prime modulus set is to perform simple congruence calculation of different modulus on the same block of content result to be calculated, wherein the meaning of the prime modulus set is to ensure the accuracy of clustering to a certain extent; the same row is called a horizontal congruence calculation segment, and the function of the horizontal congruence calculation segment is mainly to calculate different contents to be calculated, and the function of the horizontal congruence calculation segment is to ensure that differential attack is prevented to a certain extent, wherein the meaning of preventing differential attack is that even if an attacker obtains a decrypted congruence calculation result (a result obtained after processing based on prime number set and logs to be classified later), some known information cannot be utilized and what the original information is can be deduced from the information through certain calculation, even if the information is not encrypted.
In the embodiment of the application, only the classification of the unrecognized mode is aimed at, so that the process of unidentified and new types of logs in the system is simplified, otherwise, generalized script making personnel face a large amount of unclassified information, and the newly-appearing log types are difficult to cover at a time. It should be noted that this classification is unsupervised.
It should be understood that the generalizable log further generalizes the collected related log by using the deployed log collection node. Since generalization is performed depending on patterns that have been historically identified, logging of identified patterns is typically performed on a regular basis, such patterns being based on historical accumulation, i.e., the product provider performs a fine analysis from historically obtained logs from which relevant classification information and other content, such as TCP/IP quintuples, user information, file access information, etc., are obtained.
It will be appreciated by those skilled in the art that prime numbers refer to integers that can only be divisible by 1 and themselves.
It will be appreciated by those skilled in the art that congruence means that given a positive integer m, if two integers a and b satisfy a-b that is divisible by m, i.e., (a-b)/m gives an integer, then the integer a is said to be congruent to the integer m, denoted as a≡b (mod m). The modulo m congruence is an equivalence relation of integers.
The method for carrying out congruence operation on the logs to be classified by using the prime number matrix comprises the following steps:
defining a prime matrix p= { P ij -a }; the prime number is the product of the number of rows and the number of columns, and is written in a matrix form:
these prime numbers are not necessarily all different;
define the log to be classified w= { W i And converts it into a full string set s= { c i -a }; where the string length is
Grouping the character strings in the S according to 16 bytes, and filling character strings with less than 16 bytes through preset character strings;
for the same packet in S, { p., j congruence operation, result in concatenation, i.e. one packet length will be from 2 128 Becomes as followsBecause they contain m different prime numbersSince this prime matrix is m rows);
for the next packet, { p., (j+1)modn and performing congruence operation, so that all groups are circularly traversed, and different groups use different prime matrix arrays to perform operation.
The following is an example of processing for one packet (original packet information uses seg k Representation):
seg k ≡a 1k (p 1i )
seg k ≡a 2k (p 2i )
seg k ≡a mk (p mi )
finally, a packet seg k Represented as a 1k a 2k… a mk Other groupings are handled similarly, except that a prime number set of different columns in the matrix is used;
all packets after overall processing are composed of seg 1 seg 2… seg K Is converted into a 11 a 21… a m1... a 1K a 2K… a mK Where K is the number of packets:
,
,/>indicating that it is not divisible;
'[ ]' is a rounding operation.
The log acquisition node transmits the processed congruence operation result, prime matrix and the original content of the log to be classified to the storage node, and the storage node stores the data.
The storage node is used for initiating a classification request to the cloud service node, and uploading the log to be classified, the prime matrix and the congruence operation result to the cloud service node by using an encryption algorithm;
and the storage node initiates a classification request task to the cloud service node, and simultaneously uploads the log to be classified, the prime matrix and the congruence operation result to the cloud service node by using a simple encryption algorithm. Here, the task uploaded by the storage node may include data from a plurality of different log collection nodes.
It is to be understood that, because a plurality of different prime-modulus are utilized to perform a certain degree of congruence operation on the word to be classified, the network transmission of the prime-modulus set can adopt a general business-secret or national-secret algorithm; the algorithm is much less computationally intensive than other encryption algorithms, such as 3DES, AES, homomorphic encryption (common homomorphic encryption methods such as the Paillier algorithm), etc.
The cloud service node is used for dividing logs to be classified according to the prime matrix and the congruence operation result; and clustering the division results by using a clustering algorithm.
Dividing according to the uploading task and prime matrixes thereof, wherein data to be classified with the same prime matrix are used as a classification calculation task, the classification task performs clustering operation on the data by using clustering algorithms such as DBSCAN, cosine similarity, jaccard similarity and the like, and if the similarity of the data exceeds a certain set threshold (such as 85%), the data are classified into one class; the reason that the prime matrix must be uploaded instead of the acquisition node identification is that the prime matrix may change dynamically.
It should be appreciated that the three-level structure is adopted to classify the original log information, and the structure comprises a log collecting component, an intermediate storage node and a cloud log classifying and calculating node, wherein the log collecting component and the intermediate storage node are deployed in the internal environment of the user, the cloud log classifying and calculating node is deployed in the external environment of the user, a plurality of prime models are generated by each collecting node, namely prime model sets generated by the collecting nodes are different, and the prime model sets are transferred to the cloud log classifying and calculating node by the intermediate storage node.
From the above description, it can be seen that the following technical effects are achieved:
in the embodiment of the application, a multistage log classification architecture with matched log acquisition nodes, storage nodes and cloud service nodes is adopted, so that calculation tasks can be balanced and shared; the method has the advantages that the plurality of different prime modulus operations are adopted to carry out certain-degree congruence operation on the logs to be classified, a simple encryption algorithm can be used for replacing a complex encryption algorithm to carry out communication, and clustering is carried out based on the congruence operation results, so that the technical effects of improving the accuracy of classifying the new mode and effectively reducing the calculation load are achieved, and the technical problems of low accuracy of classifying the new mode and high calculation load are solved.
According to an embodiment of the present invention, preferably, the obtaining of the log to be classified includes:
and the log acquisition node performs word segmentation on the original log and removes stop words to obtain the log to be classified.
The log collection node performs word segmentation on the original log by using a word segmentation algorithm according to a dictionary (containing Chinese and English), and removes stop words, wherein the stop words mainly comprise punctuation marks, digit strings which are common and nonsensical and month information (such as Jan and Feb) in the log, and the stop words are removed, are not taken as the final word segmentation result, and only the word segmentation (Chinese is also) based on the dictionary generally.
According to the embodiment of the invention, preferably, the log acquisition node is further used for carrying out finite field number theory inverse operation on the congruence operation result to generate check data; transmitting the logs to be classified, the prime number matrix, the congruence operation result and the check data to a storage node for storage;
the storage node is also used for initiating a classification request to the cloud service node, and uploading the log to be classified, the prime number matrix, the congruence operation result and the check data to the cloud service node by using an encryption algorithm;
the cloud service node is further used for carrying out the inverse verification of the number theory of the logs to be classified based on the verification data;
dividing the logs to be classified which pass verification according to the congruence operation result;
and clustering the division results by using a clustering algorithm.
The log acquisition points generate check data with a certain lower probability (such as 1%) according to the data scale, and the check data is generated by the inverse operation of the finite field number theory, and the method is as follows:
selecting a congruence calculation result of data needing to be subjected to classification operation, and inverting a packet number theory:
the log acquisition node transmits the processed congruence operation result, prime number matrix, check data and the original content of the log to be classified to the storage node, and the storage node stores the data.
And the storage node initiates a classification request task to the cloud service node, and simultaneously uploads the log to be classified, the prime matrix, the check data and the congruence operation result to the cloud service node by using a simple encryption algorithm. Here, the task uploaded by the storage node may include data from a plurality of different log collection nodes.
After receiving the related data, the cloud service node verifies the data with the computable verification, divides the data to be classified according to the uploading task and the prime matrix thereof after the data are subjected to inverse verification through the number theory, takes the data to be classified with the same prime matrix as a classification computing task, performs clustering operation on the data by using clustering algorithms such as DBSCAN, cosine similarity and Jaccard similarity, and classifies the data into one class if the similarity of the data exceeds a certain set threshold (such as 85 percent); the reason that the prime matrix must be uploaded instead of the acquisition node identification is that the prime matrix may change dynamically.
The cloud service node can check whether the data to be processed is not tampered or not by using the prime matrix and the inverse operation thereof according to part of the data to be checked, and the calculation load of the inverse calculation of the prime matrix is not very large, so that the calculation load is further reduced on the premise of ensuring the data safety.
According to the embodiment of the invention, the cloud service node is preferably further used for transmitting the clustering operation result back to the storage node by using an encryption algorithm;
and the storage node is also used for storing according to the clustering operation result.
If the classification nodes are deployed in the cloud, especially on public clouds, such work may cause leakage of sensitive information, even if a strong encryption algorithm is used during transmission, the complete information is stored in the cloud, and once the cloud host is trapped, the sensitive log information may be leaked. In order to solve the problem, the information to be classified and subjected to congruence calculation is classified according to the classification task and the acquired prime modulus matrix, the classified result is fed back to the intermediate storage node, and the cloud computing node cannot land the classified data on any storage resource of the cloud, so that the situation that log information is leaked due to sinking of a cloud host is avoided.
In addition, the classification algorithm provided by the patent is only carried out on public mode information in different original log information, so that the cloud data leakage is not substantially affected.
It will be apparent to those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, or they may alternatively be implemented in program code executable by computing devices, such that they may be stored in a memory device for execution by the computing devices, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the same, but rather, various modifications and variations may be made by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

Claims (10)

1. A method of sorting logs, comprising:
the log acquisition node generates a prime number set and forms the prime number set into a prime number matrix;
performing congruence operation on the logs to be classified by using the prime number matrix;
transmitting the logs to be classified, the prime matrix and the congruence operation result to a storage node for storage;
the storage node initiates a classification request to the cloud service node, and uploads a log to be classified, a prime matrix and a congruence operation result to the cloud service node by using an encryption algorithm;
the cloud service node divides the logs to be classified according to the prime matrix and the congruence operation result;
and clustering the division results by using a clustering algorithm.
2. The log classification method according to claim 1, wherein the acquisition of the log to be classified includes:
and the log acquisition node performs word segmentation on the original log and removes stop words to obtain the log to be classified.
3. The log classification method of claim 1, wherein the performing a congruence operation on the log to be classified using the prime matrix further comprises:
performing finite field number theory inverse operation on the congruence operation result to generate check data;
transmitting the logs to be classified, the prime number matrix, the congruence operation result and the check data to a storage node for storage;
the storage node initiates a classification request to the cloud service node, and uploads a log to be classified, a prime matrix, a congruence operation result and check data to the cloud service node by using an encryption algorithm;
the cloud service node performs the inverse verification of the number theory of the logs to be classified based on the verification data;
dividing the logs to be classified which pass verification according to the congruence operation result;
and clustering the division results by using a clustering algorithm.
4. The log classification method of claim 1, wherein the prime number sets are regenerated at a fixed frequency.
5. The log classification method of claim 1, wherein performing congruence operations on the log to be classified using the prime matrix comprises:
defining a prime matrix p= { P ij };
Define the log to be classified w= { W i And converts it into a full string set s= { c i };
Grouping the character strings in the S according to 16 bytes, and filling character strings with less than 16 bytes through preset character strings;
for the same packet in S, { p., j congruence operation, then for the next packet, { p., (j+1)modn and performing congruence operation.
6. A method of log classification as claimed in claim 1 or 3 wherein the clustering algorithm is a DBSCAN algorithm, a cosine similarity algorithm or a Jaccard similarity algorithm.
7. A log classifying method according to claim 1 or 3, wherein the clustering algorithm is used to perform clustering operation on the division result, and further comprising:
the cloud service node transmits the clustering operation result back to the storage node by using an encryption algorithm;
and the storage node stores according to the clustering operation result.
8. The method according to claim 1, wherein columns of the prime matrix are longitudinal congruence operation factors for performing simple congruence calculations of different modes on the same block of content results to be calculated; and the behavior transverse congruence calculation segment of the prime matrix is used for calculating aiming at different contents to be calculated.
9. A log classification system, comprising:
the log acquisition node is used for generating a prime number set and forming the prime number set into a prime number matrix; performing congruence operation on the logs to be classified by using the prime number matrix; transmitting the logs to be classified, the prime matrix and the congruence operation result to a storage node for storage;
the storage node is used for initiating a classification request to the cloud service node, and uploading the log to be classified, the prime matrix and the congruence operation result to the cloud service node by using an encryption algorithm;
the cloud service node is used for dividing logs to be classified according to the prime matrix and the congruence operation result; and clustering the division results by using a clustering algorithm.
10. The log classification system of claim 9, wherein the log classification system,
the log acquisition node is also used for carrying out finite field number theory inverse operation on the congruence operation result to generate check data; transmitting the logs to be classified, the prime number matrix, the congruence operation result and the check data to a storage node for storage;
the storage node is also used for initiating a classification request to the cloud service node, and uploading the log to be classified, the prime number matrix, the congruence operation result and the check data to the cloud service node by using an encryption algorithm;
the cloud service node is further used for carrying out the inverse verification of the number theory of the logs to be classified based on the verification data;
dividing the logs to be classified which pass verification according to the congruence operation result;
and clustering the division results by using a clustering algorithm.
CN202311811547.5A 2023-12-27 2023-12-27 Log classification method and system Active CN117473094B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311811547.5A CN117473094B (en) 2023-12-27 2023-12-27 Log classification method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311811547.5A CN117473094B (en) 2023-12-27 2023-12-27 Log classification method and system

Publications (2)

Publication Number Publication Date
CN117473094A CN117473094A (en) 2024-01-30
CN117473094B true CN117473094B (en) 2024-03-22

Family

ID=89639976

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311811547.5A Active CN117473094B (en) 2023-12-27 2023-12-27 Log classification method and system

Country Status (1)

Country Link
CN (1) CN117473094B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110263009A (en) * 2019-06-21 2019-09-20 深圳前海微众银行股份有限公司 Generation method, device, equipment and the readable storage medium storing program for executing of log classifying rules
CN112860648A (en) * 2020-12-30 2021-05-28 苏宁消费金融有限公司 Intelligent analysis method based on log platform
CN113535667A (en) * 2020-04-20 2021-10-22 烽火通信科技股份有限公司 Method, device and system for automatically analyzing system logs

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10528407B2 (en) * 2017-07-20 2020-01-07 Vmware, Inc. Integrated statistical log data mining for mean time auto-resolution

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110263009A (en) * 2019-06-21 2019-09-20 深圳前海微众银行股份有限公司 Generation method, device, equipment and the readable storage medium storing program for executing of log classifying rules
CN113535667A (en) * 2020-04-20 2021-10-22 烽火通信科技股份有限公司 Method, device and system for automatically analyzing system logs
CN112860648A (en) * 2020-12-30 2021-05-28 苏宁消费金融有限公司 Intelligent analysis method based on log platform

Also Published As

Publication number Publication date
CN117473094A (en) 2024-01-30

Similar Documents

Publication Publication Date Title
Talukder et al. A dependable hybrid machine learning model for network intrusion detection
Zebin et al. An explainable AI-based intrusion detection system for DNS over HTTPS (DoH) attacks
CN110572362B (en) Network attack detection method and device for multiple types of unbalanced abnormal traffic
US10691795B2 (en) Quantitative unified analytic neural networks
CN110019876B (en) Data query method, electronic device and storage medium
EP3948604B1 (en) Computer security
Kaur A comparison of two hybrid ensemble techniques for network anomaly detection in spark distributed environment
IL285979B (en) A deep embedded self-taught learning system and method for detecting suspicious network behaviours
GB2583892A (en) Adaptive computer security
EP3948603B1 (en) Pre-emptive computer security
GB2582609A (en) Pre-emptive computer security
Graham et al. Finding and visualizing graph clusters using pagerank optimization
Singh et al. Intrusion detection system using data mining a review
CN117294497A (en) Network traffic abnormality detection method and device, electronic equipment and storage medium
CN113434857A (en) User behavior safety analysis method and system applying deep learning
CN117081858A (en) Intrusion behavior detection method, system, equipment and medium based on multi-decision tree
CN117473094B (en) Log classification method and system
US20230273924A1 (en) Trimming blackhole clusters
Leevy et al. Feature evaluation for IoT botnet traffic classification
CN114998001A (en) Service class identification method, device, equipment, storage medium and program product
CN114385436A (en) Server grouping method and device, electronic equipment and storage medium
CN105095752B (en) The recognition methods of viral data packet, apparatus and system
Tian et al. Intrusion detection method based on deep learning
Erokhin et al. The Dataset Features Selection for Detecting and Classifying Network Attacks
Kumar et al. A comparative study of machine learning methods for generation of digital forensic validated data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant