CN115277180B

CN115277180B - Block chain log anomaly detection and tracing system

Info

Publication number: CN115277180B
Application number: CN202210882913.5A
Authority: CN
Inventors: 牛伟纳; 张小松; 廖旭涵; 赵丽睿; 周孝笑; 朱宇坤; 张然
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-07-26
Filing date: 2022-07-26
Publication date: 2023-04-28
Anticipated expiration: 2042-07-26
Also published as: CN115277180A

Abstract

The present invention relates to the field of blockchain applications. The system for detecting and tracing the abnormal state of the blockchain log is provided, and aims to solve the problem of the lack of the abnormal state detection function of the data in the current blockchain architecture, and can safely and reliably realize the data detection with high accuracy. Extracting a template from the data log, and counting the number features; training a model through the characteristic representation of the log, wherein the characteristic is divided into a number characteristic and a time sequence characteristic; for a log sequence to be detected, firstly, processing data through a data processing module, combining a model trained by a quantity time sequence model training module after the data processing, outputting a numerical value of 0-1 by the model, respectively recording the numerical value as a time sequence model deviation degree and a quantity deviation degree, and comprehensively calculating the final deviation degree; and writing the logs exceeding the deviation threshold value into a table, giving threat marks, giving a log sequence to which the threat logs belong as tracing output, and if abnormal false alarms are found during audit, marking the abnormal false alarms to enable the system to dynamically adjust the threshold value, so that the accuracy is increased.

Description

Block chain log anomaly detection and tracing system

Technical Field

The invention belongs to the field of block chain data security, and provides a block chain log anomaly detection and tracing system.

Background

Blockchain technology is one of the most prevalent technologies today and has been widely used in many scenarios in finance, supply chain, etc. Blockchain technology can be generally divided into three implementation forms of public chains, alliance chains and private chains. In the early stage of blockchain application, a public chain is used as a main expression form, all people can participate in supervision, and the authenticity of the uplink information is strongest. But the number of participants is too large, resulting in inefficient operation. When the enterprise is used in a small scale, the private chain is selected to realize the blockchain, the number of people involved in the private chain is small, but the centralization degree is too high, and the system can only generally operate in a single-center industry. The alliance chain combining the advantages of the two is the block chain form selected by most application at present, and the alliance chain is supervised by a plurality of main participation parts, each part can independently control individuals which want to be authorized to participate in the block chain network, and the individuals participate in supervision as a part of the part after being registered. The information on the chain is transparent to all individuals involved, and operations such as data addition and the like are supervised by groups, so that the chain has traceability and non-tamper property. Currently, application targets on using blockchains at home and abroad mainly comprise guarantee of untampereability, credibility integrity and traceability of auditable data.

The log is the most representative auditable data, and is used for recording operation information such as various parameters in the operation period of the system, and a system developer can discover problems and position the problems in time to solve the problems periodically or when abnormal behaviors occur through the audit log. But existing logging systems have some problems. If the system is attacked artificially, the log records can be tampered by the attacker, so that the developer cannot locate errors through false log records, and the difficulty of the developer in repairing the system and locating the problems is increased. In addition, a widely used log abnormality detection method is generally that developers detect abnormalities by means of keyword searching, regular expressions and the like in combination with log abnormality levels according to their domain knowledge. However, this approach relies heavily on manual work, which is more difficult as the system becomes larger and more complex.

The log is time sequence text data, which is composed of time stamp and text message, and records the operation state of the service in real time, and the log has a certain number of corresponding relations, for example, if several files are opened, the files should be closed, if the log execution sequence is wrong or the corresponding relations are incorrect, the abnormality is possible. However, the specifications of the current logs are not uniform, the log formats printed by different types of equipment are different, the log data also has the unstructured characteristic, the logs are difficult to process in a batch automatic mode, and the problems make log analysis very difficult.

The log analyzer based on the fixed depth tree is to preprocess original log information through a simple regular expression set by domain knowledge, then search a log group according to a special design rule encoded in the internal node of the tree, if a matched log group is found, the log information can be matched with log events stored in the log group, if no matched log group is found, a new log group is created, finally all logs can be attributed to the log group, which is equivalent to classified logs, and the same form of log extracts a common mode to be a template.

Disclosure of Invention

The invention discloses a block chain automatic log anomaly detection and tracing scheme based on a alliance chain. In conventional blockchain applications, automated anomaly detection of data on a federated chain is not performed, but rather is performed manually by experience. As the variety of data on the federation chain increases, it becomes increasingly complex and relying on manual detection alone is not feasible. Therefore, a reasonable and high-accuracy log anomaly detection technology needs to be researched, and the automation capability of the system is improved. The log anomaly detection and tracing scheme solves the problem of the lack of the data anomaly detection function in the current block chain architecture, and can safely and reliably realize data detection with high accuracy.

The invention adopts the following technical scheme to solve the technical problems:

a blockchain log anomaly detection and tracing system, comprising:

and a data processing module: extracting a template from the data log, wherein the log template comprises a quantitative part and a variable, structuring unstructured log data into a template log which is easy to analyze, and counting according to the number characteristics of the template, wherein the number characteristics are the number of occurrence of words in the template and the number of occurrence of combined words;

the quantity time sequence model training module: training a model through characteristic representation of a log, wherein the characteristic is divided into a number characteristic and a time sequence characteristic, the number characteristic is the number of times of word occurrence and the number of times of word combination occurrence in the template, and the time sequence characteristic is the sequence of the log;

and the deviation degree calculating module is used for: for a log sequence to be detected, firstly, processing data through a data processing module, combining a model trained by a quantity time sequence model training module after the data processing, outputting a numerical value of 0-1 by the model, respectively recording the numerical value as a time sequence model deviation degree and a quantity deviation degree, and comprehensively calculating the final deviation degree;

an anomaly tracing module: and writing the logs exceeding the deviation threshold value into a table, giving threat marks, giving a log sequence to which the threat logs belong as tracing output, and if abnormal false alarms are found during audit, marking the abnormal false alarms to enable the system to dynamically adjust the threshold value, so that the accuracy is increased.

In the above technical solution, the data processing module adopts a drain log template extractor and combines multidimensional feature combinations to output statistical features, and specifically includes:

1) A drain log template extractor extracts templates from the existing log of the blockchain network;

2) Respectively counting the occurrence times of words and the occurrence times of combined words in the templates for the templates extracted by the drain;

3) When a node uploads a log in a blockchain network, using drain to classify the log into a corresponding template and statistical quantity characteristics;

in the above technical solution, in the number timing model training module:

respectively acquiring time sequence characteristics and quantity characteristics of a log sequence;

training the time sequence features in a GRU model based on an attention mechanism to obtain a time sequence model;

putting the quantity features into a decision tree based on gradient lifting for training to obtain a quantity model;

and (5) saving the time sequence model and the number model with highest precision in the training process.

A attention-mechanism-based GRU model comprising the steps of:

A. the log is text data, the extracted template is also a text template, semantic conversion is needed before the text is input into the model, and the input log template text is converted into a log template vector by adopting a semantic vector trained by glove; the log is text, the program cannot process, the vector is number, and the program can process. The glove vocabulary is a one-to-one correspondence between words and numbers, and the words can be converted into numbers by looking up a table.

B. A sliding window mode is adopted, and batch log template vectors are converted into log template sequence vectors;

C. inputting the log template sequence vector into a model, and allowing the model to learn time sequence characteristics;

D. and (5) saving a training result to obtain a time sequence model.

In the technical proposal, an abnormality tracing module,

1) Setting a threshold value, judging the deviation value, and marking the deviation value exceeding the threshold value as abnormal (marked as 1);

2) The false-alarm data can be marked as false alarm (marked as 0), whether the threshold is adjusted is judged according to whether the abnormal quantity below the deviation value of the false alarm is particularly small or not in a certain time, and if the abnormal quantity is particularly small, the threshold is lower, and the improvement is needed;

3) The log data to be detected is processed by a data processing module and then is input into a trained model in a quantity time sequence model training module to obtain a deviation value;

4) Judging whether the mark is abnormal according to the threshold value;

5) And if the log sequence is abnormal, tracing to output the log sequence related to the abnormality.

In the technical proposal, tracing the abnormal output process,

1) Caching the one-to-one correspondence between the original text of the log to be tested and the vector in the memory;

2) If the mark is abnormal, obtaining original log text information of the log vector and related log sequence information through table lookup;

3) Otherwise, the buffer is emptied, and the next log is detected.

By adopting the technical scheme, the invention has the following beneficial effects:

1. the log data is stored on the alliance chain, so that the reliability and reliability of log data audit can be ensured.

2. And by combining the advantages of deep learning and machine learning, log abnormality is automatically detected by using the time sequence characteristics and the quantity characteristics of the logs.

3. The method effectively removes irrelevant features and influences and improves the accuracy of results by a comprehensive deviation degree calculation method of attention mechanism, quantity feature screening and final re-weight distribution in time sequence features.

Drawings

FIG. 1 is a basic flow of blockchain log anomaly detection and tracing;

FIG. 2 illustrates a process for log anomaly detection;

FIG. 3 is a block chain log anomaly detection and tracing system architecture

Detailed Description

The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the particular embodiments described herein are illustrative only and are not intended to limit the invention, i.e., the embodiments described are merely some, but not all, of the embodiments of the invention.

The detailed description of the embodiments of the invention is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present invention.

It is noted that relational terms such as "first" and "second", and the like, are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The implementation of the blockchain log anomaly detection and tracing system comprises the following steps:

and a data processing module: extracting a log template from the unstructured log, and counting the occurrence times of fixed words in the template, wherein the occurrence times of the fixed words are combined.

The quantity time sequence model training module: and converting the log sequence into a digital vector, and inputting the digital vector into a time sequence model for training. And meanwhile, inputting the quantity features counted by the data processing module into a quantity model for training, and storing a training result.

And the deviation degree calculating module is used for: and inputting the log sequence to be tested into the trained results in the number time sequence model training module, respectively calculating the time sequence deviation degree and the number deviation degree, and then calculating the comprehensive deviation degree.

An anomaly tracing module: and marking whether the deviation value is abnormal or not according to the deviation value, and if so, tracing to output a related log sequence, wherein an operator can dynamically adjust the deviation threshold value through feedback.

The main flow of the scheme for the four modules comprises the following steps:

A. the data processing module extracts log templates from a large amount of unstructured log data existing on the blockchain network, and counts the occurrence times of fixed words and the occurrence times of fixed word combinations corresponding to each template.

B. And converting the log text sequence into a digital vector through a Glove vocabulary, temporarily caching the digital vector, inputting sequences containing time sequence information into a time sequence model, inputting the quantity features of template statistics corresponding to the sequences into a quantity model for training respectively, and storing a training result with highest accuracy.

C. The log sequence to be measured is processed by a data processing module, then is input into a trained model in a number time sequence model, the time sequence deviation degree t and the number deviation degree n are calculated respectively, the weight distribution is carried out on the influence of the time sequence deviation degree and the number deviation degree on the final deviation degree, and the comprehensive deviation degree y=w1t+w2n is calculated.

D. And determining whether the mark is abnormal according to whether the value y of the comprehensive deviation degree is larger than a set threshold value m, if so, outputting an abnormal log sequence through log associated information cached in the memory, otherwise, clearing the cache.

Further, in the process of processing the transaction in step B, first, the log template is parsed, and we use a drain log parser. Firstly, replacing conventional variable information with a mask, and then carrying out classification aggregation on prefix similarity according to the length of the log to finally obtain a log template.

Specifically, in our system, we combine both the timing and quantity features of the log sequence. The timing characteristic is the order of log execution, such as the order of log execution after new files are written and then deleted. The number features are that the file is opened several times and closed several times, and the relation of the number is corresponding.

On this basis, we use classical, high-accuracy deep learning algorithms and machine learning methods:

1) In the time sequence training process, a GRU (Gated Recurrent Unit) algorithm based on an attention mechanism is used, a certain specific sequence can be gathered according to the action exerted by the sequence on the input time sequence, irrelevant sequence noise is ignored, and a good time sequence model is obtained through training.

2) In the process of quantity training, a method of gradient lifting decision tree is used for screening quantity one-dimensional features and quantity two-dimensional features, irrelevant features are removed, and finally effective quantity models with different anomalies are obtained.

In step D, the threshold is initially set to a low value by human, and contains as much anomaly information as possible. The algorithm of personnel feedback dynamic adjustment threshold value is to observe the false alarm rate, namely if the false alarm number is far more than the abnormal number under the condition that the false alarm deviation value is smaller than or equal to the false alarm deviation value and is larger than or equal to the threshold value, if so, the threshold value is updated to be the false alarm deviation value.

The technical scheme of the invention is further described as follows:

1. extraction of log templates

In order to solve the problem that the specifications of the current log are not uniform, log formats printed by different types of equipment are different, and log data are unstructured, so that the log is convenient for personnel to analyze. The log is extracted into the log template by using the log analyzer, so that the types of the processed logs are clearer, and the processing difficulty is reduced. The main steps of the log parser include the following five steps.

1. Pretreatment: the obvious portions are mask replaced using regular expressions.

2. Log length classification: the logs are classified according to the number of tokens in the original log.

3. Sorting logs: the logs are classified according to the preset log depth and are generally set to be 4, fine adjustment can be performed according to actual scenes, and the depth can influence the number and the accuracy of the nodes traversed by searching.

4. Journal classification: at the position ofAfter categorizing, simseq=according to the similarity algorithm

Wherein

I-th letter representing log sequence 1, < ->

Representing the i-th letter of log sequence 2,

judging whether the sequences belong to the class, if not, adding the classes, wherein t1 and t2 refer to letters corresponding to the same positions of the two sequences, and n is the length of a longer sequence in sequence comparison.

2. Period of log data presence:

the purpose of log anomaly detection is to trace the source anomaly log, locate the threat and check in time. However, the process logs in the anomaly detection are converted into digital vectors, the digital vectors are not readable, and due to the large number of processed logs, all log sequence relations cannot be saved, so that how the period is set and how the content of the cache is selected are of great significance to the anomaly detection of the logs. The specific caching process is as follows:

step one, marking a serial number on an original log in a data processing module, and establishing a cache table, wherein the table entry is a content log serial number-log template, and the original log is replaced by the serial number in the subsequent intermediate process, so that the utilization rate of time and space is improved.

Step two, in the number time sequence model training module, converting an original log text through a glove vocabulary to obtain a digital vector, and establishing a cache table for the log serial number and the digital vector, wherein the table entry is the log serial number-digital vector.

And thirdly, calculating the deviation degree of the log to be tested in a deviation degree calculation module, and establishing a corresponding cache table entry as the log serial number-comprehensive deviation degree.

And step four, in the abnormal tracing module, if the deviation exceeds a threshold value, returning to the log sequence according to the log sequence number and the window size originally set by the system, and emptying the cache table.

3. Format in which log data exists

In the process of log exception handling, four basic data formats are mainly:

1) Raw log data: receiving block blk _ -354458 src:/10.250.19.102:39325 dest:/10.250.19.102:50010.

2) Log template: receiving block [ ID ] src: [. Times ] dest:/[ IPANDPORT ].

3) glove vocabulary: receiving: [ 300-dimension number vector ], block: [ 300-dimension number vector ], src: [ 300-dimension number vector ], dest: [ 300-dimension number vector ], and a row of journaled corresponding vector is formed by adding each word.

4) The number features are as follows: receiving 1, block:1, src:1, dest:1, receiving-Block 1, receiving-src 1 …

Expressed as vectors [1, … ], normalized, and calculated as the sum of the number of occurrences, and finally expressed as vectors [1/n,1/n,1/n, … ], where Receiving-block is the combination of the words Receiving and block, and Receiving-src is the combination of the words Receiving and src.

Examples

The specific data execution process is as follows:

step one, inputting an original log Receiving block blk _ -354458 src:/10.250.19.102:39325 dest:/10.250.19.102:50010 into a data processing module, classifying to obtain a log template Receivingblock [ ID ] src: [ x ] dest:/[ IPANDPORT ], and counting the number characteristics of Receiving:1, block:1, src:1, dest:1, receiving-block 1, receiving-src 1 …, normalized to obtain vector [1/n,1/n,1/n,1/n,1/n,1/n, … ].

Step two, converting the original log receiving block blk _ -354458 src:/10.250.19.102:39325 dest:/10.250.19.102:50010 into a 300-dimensional digital vector through table lookup (glove vocabulary), and obtaining a log sequence. And (3) inputting the log sequence into a time sequence model, inputting the digital vector obtained in the step one into a quantity model to respectively obtain a time sequence deviation degree (0-1) and a quantity deviation degree (0-1), and then calculating to obtain a comprehensive deviation degree (0-1).

The model will train a very complex function during the training phase by the input log template vector and the corresponding labels, e.g., normal log label 0 and abnormal label 1. After training, the log template vector is input, which is equivalent to using the trained function to obtain an output result, wherein the value of the output result is 0-1.

And step three, judging whether the comprehensive deviation degree reaches a threshold value, if so, marking the log as abnormal, and outputting other original logs related to the input log instance.

Claims

1. The utility model provides a block chain log anomaly detection and traceability system which characterized in that:

and a data processing module: extracting a template from the data log, wherein the log template comprises a quantitative part and a variable, unstructured log data is structured into a template log which is easy to analyze, and according to the template, the number characteristics are counted, wherein the number characteristics are the number of word occurrences in the template and the number of word occurrences in combination;

an anomaly tracing module: writing the logs exceeding the deviation threshold value into a table, giving threat marks, giving a log sequence to which the threat logs belong as tracing output, and if abnormal false alarms are found during audit, marking the abnormal false alarms to enable a system to dynamically adjust the threshold value, so that the accuracy is increased;

in the number timing model training module:

saving a time sequence model and a number model with highest precision in the training process;

a attention-mechanism-based GRU model comprising the steps of:

A. the log is text data, the extracted template is also a text template, semantic conversion is needed before the text is input into the model, and the input log template text is converted into a log template vector by adopting a semantic vector trained by glove;

D. and (5) saving a training result to obtain a time sequence model.

2. The blockchain log anomaly detection and tracing system of claim 1, wherein: the data processing module adopts a drain log template extractor and combines multidimensional feature combination to output statistical features, and the method specifically comprises the following steps:

3) When the node uploads the log in the blockchain network, the log is classified into a corresponding template and quantity characteristics by using drain for statistics.

3. The blockchain log anomaly detection and tracing system of claim 1, wherein: an abnormality tracing module for tracing the abnormality of the object,

1) Setting a threshold value, judging the deviation value, and marking the deviation value exceeding the threshold value as abnormal;

2) The false alarm data can be marked as false alarm, whether the threshold value is adjusted is judged according to whether the abnormal quantity below the deviation value of the false alarm is particularly small or not in a certain time, and if the abnormal quantity is particularly small, the threshold value is lower, and the abnormal quantity needs to be improved;

4) Judging whether the mark is abnormal according to the threshold value;

4. A blockchain log anomaly detection and tracing system as in claim 3 wherein: the process of outputting the abnormal state is traced,

3) Otherwise, the buffer is emptied, and the next log is detected.