CN111143840B

CN111143840B - Method and system for identifying abnormity of host operation instruction

Info

Publication number: CN111143840B
Application number: CN201911406512.7A
Authority: CN
Inventors: 殷钱安; 梁淑云; 刘胜; 马影; 陶景龙; 王启凡; 魏国富; 徐�明; 余贤喆; 周晓勇
Original assignee: Information and Data Security Solutions Co Ltd
Current assignee: Information and Data Security Solutions Co Ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2022-01-25
Anticipated expiration: 2039-12-31
Also published as: CN111143840A

Abstract

The invention provides a method and a system for identifying the abnormity of a host operating instruction, which comprises the following steps of S1, extracting sample data; s2, data processing; obtaining a behavior sequence record and the use frequency of each host operating instruction; s3: screening by using an emergency instruction to obtain a target operation instruction sequence; s4: training a compact prediction tree to obtain a target compact prediction tree; s5: predicting a compact prediction tree to obtain a training data set with a label; s6: training an operation instruction vector by using word2vec to form a pre-training vector; s7: establishing a classification recognition model by using Bi-LSTM; s8: and predicting by using a classification model. The invention adopts the compact prediction tree to analyze the operation instruction sequence of the user host, and researches the behavior relation among the instruction behavior sequences, thereby judging whether the operation instruction of the user host is abnormal or not. Based on the method, the internal relation among the user operation instructions is fully considered, the logical relation of the instructions in the time dimension is researched, and the abnormal object identification accuracy of the host operation instructions is improved.

Description

Method and system for identifying abnormity of host operation instruction

Technical Field

The invention relates to computer data security, in particular to a method and a system for identifying abnormal operation instructions of a host.

Background

Computer system security is one of the key contents of information security, and has become the core technology of computer information systems, and is also an important basis and supplement of network security.

Modern information technology is continuously developed, and computer application relates to various industries. For computer information security, currently, a computer network information security mechanism is established in the country, and security protection is performed in the fields of information security and the like, but a certain management difficulty exists in a computer with a high use degree, and under the condition that the computer network information security protection mechanism is not complete at the present stage, the computer system still has threats of an information network technology, such as dangerous behaviors of information leakage, information tampering and the like, and potential safety hazards are caused to data information security. How to effectively identify dangerous operation behaviors in a computer system and protect the safety in the computer system still remains one of the problems to be solved for network information safety.

The existing abnormal data mining methods mainly comprise a distance-based method, a statistical-based method, a density-based method and a clustering-based method. Numerous scholars at home and abroad deeply research the theory of the method and obtain great results, but still have some defects and shortcomings. For example, in distance-based methods, the selection of distance functions and parameters presents certain difficulties; in the statistical-based method, the distribution condition of data is required to be known in advance, but the distribution function of the data is difficult to obtain in advance; the time complexity is greater in density-based methods; clustering-based methods focus mainly on the clustering problem. These problems limit the application of the abnormal data mining method, and mainly deal with deterministic data, and lack of effective theoretical models and methods for uncertain information processing and discrete sequence data cannot take into account the inherent logical relationship between sequence behaviors. For the sequence anomaly detection method, a common Markov model and a directed graph model are used for processing a large data set, and the efficiency is low.

The classification recognition algorithm model in the prior art is constructed based on instruction characteristics, the relation among instructions is not considered completely, and the internal logic relation of the instructions before and after the host operation on the time dimension cannot be fully utilized.

Disclosure of Invention

The invention aims to provide a method for identifying the abnormal high identification rate of a host operation instruction based on a time relation.

The invention solves the technical problems through the following technical means:

a method for identifying abnormality of host operation instruction comprises the following steps:

s1, sample data extraction

Extracting system operation instruction log data in a specified time period as original sample data;

s2, data processing

Based on the sample data extracted in S1, the data is distinguished by a set period, the data is processed into a unique index by taking the account number of the host computer of the user as an ID, the set period and the ID form a unique index, the instructions are arranged according to the time sequence, the instruction behaviors are combined to form a behavior sequence record,

according to the sample data obtained in the S1, counting the use frequency of each host operating instruction;

s3: screening of unusual instructions

Performing ascending arrangement on the operation instruction frequency obtained in the step S2, and screening out operation instructions smaller than a set threshold value from the sorted frequency number sequence to obtain a target operation instruction sequence;

s4: compact predictive tree training

Inputting the behavior sequence record in the S2 data into an array, and performing model training by using a compact prediction tree; and obtaining the target compact prediction tree.

S5: compact prediction tree prediction

Selecting the user account and the corresponding behavior sequence record containing the instructions according to the target operation instruction sequence screened out by S3,

predicting a plurality of instructions which may appear by using a target operation instruction sequence screened by an unused instruction S3 based on a target compact prediction tree trained in S4, if the predicted instruction set does not contain the unused instruction of actual data, judging that the behavior of the user operation instruction is abnormal, and finally obtaining a training data set with a label;

s6: training operational instruction vectors using word2vec

Using the host operation instruction sequence obtained in the step S2 as input, and performing pre-training by using a word2vec algorithm to form a pre-training vector;

s7: establishing classification recognition model by utilizing Bi-LSTM

Based on the pre-training vector obtained in S6, inputting the training data set obtained in S5 into a Bi-LSTM algorithm, and training into a classification model for predicting whether the target is abnormal or not;

s8: and predicting by using a classification model.

Preferably, in the step S3, the operation command smaller than the set threshold is screened out from the sorted frequent number sequence by using the characteristic of the quantile.

Preferably, in step S5, the compact prediction tree prediction step includes:

the first step is as follows: finding a sequence similar to the target operation instruction sequence, and searching through the following steps: finding a unique item of the target operation instruction sequence; searching a sequence ID set with a specific unique item; then, the intersection of all the unique item sets is taken;

the second step is that: finding the subsequent sequence of each sequence similar to the target operation instruction sequence specifically comprises: for each similar sequence, the subsequent sequence is defined as the longest subsequence of the target operation instruction sequence minus the items present in the target operation instruction sequence after the last item of the target operation instruction sequence in the similar sequence occurs;

the third step: adding elements in the subsequent sequence and the scores thereof into a score dictionary, performing descending sorting according to the scores according to the score dictionary obtained by the compact prediction tree, selecting a prediction operation instruction with a high score, and if the actual operation instruction is not in the prediction instruction, taking the account as an abnormal operation behavior; and (3) carrying out abnormality judgment on the operation instruction behaviors of all the user accounts, wherein 1 represents abnormal mark, and 0 represents normal mark, so as to form a labeled training data set.

Preferably, the step of pre-training by using the word2vec algorithm in step S6 is as follows:

the first step is as follows: taking the operation instruction sequence as a text structure, wherein each operation instruction corresponds to a word in the text; generating a vocabulary list for an input text, counting word frequency of each word, sequencing from high to low according to the word frequency, and taking the most frequent V words to form the vocabulary list; each word has a one-hot vector, the dimension of the vector is V, if the word appears in the vocabulary table, the corresponding position in the vocabulary table in the vector is 1, and the other positions are all 0; if the vocabulary does not appear, the vector is all 0;

the second step is that: generating a one-hot vector for each word of an input text, and reserving the original position of each word;

the third step: determining the dimension N of the word vector;

the fourth step: determining the window size and the batch size in the bag-of-words model, adopting softmax, carrying out iterative training on a neural network for a certain number of times, and obtaining a parameter matrix from an input layer to a hidden layer, wherein the transposition of each row in the matrix is the word vector of the corresponding word, namely the vector of the corresponding instruction.

Preferably, the identification process in step S8 is to process the data to be identified by the method in step S2 to obtain a behavior sequence record, and then convert the behavior sequence record into a vector by the method in step S6, and input the vector into the classification model for identification.

The invention also provides a system for identifying the abnormity of the host operating instruction, which comprises

The sample data extraction module extracts the system operation instruction log data in a specified time period as original sample data;

the data processing module is used for distinguishing in a set period based on sample data, processing the sample data into a unique index formed by taking a user host account as an ID and the set period and the ID, arranging the instructions according to a time sequence, combining the instruction behaviors to form a behavior sequence record,

according to the sample data, counting the use frequency of each host operating instruction;

the non-use instruction screening module is used for performing ascending arrangement on the operation instruction frequency, screening out the operation instructions smaller than a set threshold value from the sorted frequency sequence to obtain a target operation instruction sequence;

the compact prediction tree training module is used for converting the behavior sequence record input into an array and performing model training by using a compact prediction tree; and obtaining the target compact prediction tree.

The compact prediction tree prediction module selects user account numbers containing the instructions and corresponding behavior sequence records according to the screened target operation instruction sequences,

predicting a plurality of instructions which may appear by using a target operation instruction sequence screened by non-use instructions based on a trained target compact prediction tree, and if the predicted instruction set does not contain the non-use instructions of actual data, judging that the behavior of user operation instructions is abnormal, and finally obtaining a training data set with labels;

the training operation instruction vector module is used for inputting the obtained host operation instruction sequence and pre-training by utilizing a word2vec algorithm to form a pre-training vector;

establishing a classification recognition model module, inputting the obtained training data set into a Bi-LSTM algorithm based on the obtained pre-training vector, and training the pre-training vector into a classification model for predicting whether the target is abnormal or not;

and the prediction module is used for predicting by utilizing the classification model.

Preferably, the unused instruction screening module screens out the operation instructions smaller than the set threshold value from the sorted frequency number series by using the characteristic of quantile.

Preferably, in the compact prediction tree prediction module, the compact prediction tree prediction step is as follows:

Preferably, the pre-training by using word2vec algorithm in the training operation instruction vector module comprises the following steps:

the third step: determining the dimension N of the word vector;

Preferably, the identification process of the prediction module is specifically that the data to be identified is processed by a non-common instruction screening module to obtain a behavior sequence record, and then the behavior sequence record is processed by a practice operation instruction vector module to be converted into a vector, and the vector is input to the classification model for identification.

The invention has the advantages that:

the invention adopts the compact prediction tree to analyze the operation instruction sequence of the user host, and researches the behavior relation among the instruction behavior sequences, thereby judging whether the operation instruction of the user host is abnormal or not. Meanwhile, the compact prediction tree has high calculation efficiency and exceeds other sequence analysis algorithms.

On the basis of identifying an object with abnormal host operating instructions based on a compact prediction tree, the method adopts a word2vec algorithm and a bilstm algorithm to combine to construct a model, and forms an adaptive, iterative and stable abnormal identification model. The Word2vec algorithm and the bilstm algorithm are combined to be applied to the abnormal recognition of the host operation instruction, the internal relation among the user operation instructions is fully considered, the logical relation of the instructions in the time dimension is researched, and the accuracy of the object recognition of the abnormal host operation instruction is improved.

Drawings

Fig. 1 is a flowchart of a method for identifying an exception of a host operation instruction according to embodiment 1 of the present invention;

fig. 2 is example data of prediction tree training in a method for identifying an exception to a host operation instruction according to embodiment 1 of the present invention;

fig. 3 is a block diagram of a host operation instruction exception identification system according to embodiment 2 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

As shown in fig. 1, the present embodiment provides a method for identifying an exception of a host operating instruction, including the following steps:

s1: sample data extraction

System operation instruction log data of a certain quarter (which may also be a certain time period (month, year, etc.)) is extracted as original sample data.

S2 data processing

Based on the sample data extracted in S1, the data is divided by month, and the data is processed into a unique index by using the user host account as the ID, the month and the ID are arranged in time sequence, and the instruction behaviors are combined to form a behavior sequence record, for example, 6 m; root; cd, mv, cp, ls, ls, rm, …, reboot;

and counting the use frequency count of each host operation instruction according to the data obtained in the step S1.

S3: screening of unusual instructions

And (4) performing ascending arrangement according to the operation instruction frequency obtained in the step (S1), and screening out operation instructions with the quantiles smaller than one fourth (which can be other set thresholds and determined according to actual conditions) of the quantiles from the sorted frequency number sequence by utilizing the quantile characteristic.

S4: compact predictive tree training

And converting the input of the behavior sequence record in the S2 data into an array, and performing model training by using a compact prediction tree.

The compact prediction tree training steps are as follows:

for example, there are four sets of sequences, represented in dictionary form: { 'ID 1' { 'A', 'B', 'C' }, 'ID 2' { 'A', 'B', }, 'ID 3' { 'A', 'B', 'D', 'C' }, 'ID 4' { 'B', 'C' };

building a prediction tree, wherein the prediction tree is a tree consisting of nodes, and each node has 3 elements:

data item (item): actual data items stored in the nodes;

child node (children): a list of all child nodes of the node;

parent node (parent): a link or reference to the parent of this node;

the prediction tree is basically a data structure of a dictionary tree, and the whole training data is compressed into a tree form.

As shown in FIG. 2, starting with ID1, first from A, it is checked whether A is a child of the root node. If not, adding A to the child node list of the root node, and sequentially adding the child nodes according to the sequence order of the ID1 until the last element of the ID1, namely the node C, is added; similarly, the ID2, the ID3 and the ID4 are added to the child nodes according to the steps, and finally a trained data structure is formed.

S5: compact prediction tree prediction

According to the instructions screened out in S3, selecting the user account and the corresponding sequence record containing the instructions,

and (4) predicting the possible 10 instructions by utilizing the front instruction sequence of the emergency instruction based on the trained prediction tree of S4, and if the prediction instruction set does not contain the emergency instruction of the actual data, judging that the user operation instruction behavior is abnormal.

The compact prediction tree prediction steps are as follows:

in the prediction phase, a prediction is made for each sequence of data in the test set in an iterative manner. For a single row, we find a sequence similar to the row using the inverted index. Then we find a subsequent sequence of similar sequences, add terms in the subsequent sequence to the count dictionary, and give a score.

The first step is as follows: finding a sequence similar to the target sequence by the following steps:

finding unique terms for a target sequence

Finding a set of sequence IDs for the presence of a particular unique item

Then take the intersection of all unique item sets

The second step is that: finding the subsequent sequence of each sequence similar to the target sequence

For each similar sequence, the subsequent sequence is defined as the longest subsequence after the last entry in the target sequence in the similar sequence has occurred, minus the entries present in the target sequence

The third step: adding elements in subsequent sequences and their scores to a counting dictionary

The initial state of the count dictionary { }, is an empty dictionary, and if the term does not exist in the dictionary, the score is 1+ (1/number of similar sequences) + (1/number of current terms in the count dictionary +1) × 0.001, otherwise, the score is (1+ (1/number of similar sequences) + (1/number of current terms in the calculation table) × 0.001) × the original score.

And according to a score dictionary obtained by the compact prediction tree, sorting in a descending order according to scores, selecting the top 10 prediction operation instructions with high scores, and if the actual operation instructions are not in the prediction instructions, operating the account as abnormal operation behaviors. And (3) carrying out abnormality judgment on the operation instruction behaviors of all the user accounts, carrying out abnormality identification on the operation instruction of the user account within one quarter, wherein 1 represents that the mark is abnormal, and 0 represents that the mark is normal, so as to form a training data set with the mark.

S6: training operational instruction vectors using word2vec

In consideration of the logic structure among the operation instructions, the host operation instruction sequence obtained in S2 is used as input, and is pre-trained by using word2vec algorithm to form an operation instruction vector. The method mainly comprises the following steps:

the first step is to regard the operation instructions as text structures, and each operation instruction corresponds to a word in the text. And generating a vocabulary list for the input text, counting word frequency of each word, sequencing from high to low according to the word frequency, and taking the most frequent V words to form the vocabulary list. Each word has a one-hot vector, the dimension of the vector is V, if the word appears in the vocabulary table, the corresponding position in the vocabulary table in the vector is 1, and the other positions are all 0. If the vocabulary does not appear, the vector is all 0;

generating a one-hot vector for each word of the input text, wherein the original position of each word is kept as the word is context-dependent;

thirdly, determining the dimension N of the word vector;

determining the window size and batch size in the bag-of-words model, iteratively training for a certain number of times by adopting softmax and a neural network to obtain a parameter matrix from the input layer to the hidden layer, wherein the transpose of each row in the matrix is a word vector of a corresponding word, namely a vector of a corresponding instruction.

S7: establishing classification recognition model by utilizing Bi-LSTM

And based on the pre-training vector model obtained in the step S6, inputting the training data set obtained in the step S5 into a Bi-LSTM algorithm, and training the training data set into a classification model for predicting whether the target is abnormal or not.

S8: anomaly identification

And processing the data to be recognized by the method of the step S2 to obtain a behavior sequence record, converting the behavior sequence record into a vector by the method of the step S6, and inputting the vector into the classification model for recognition.

In the embodiment, the compact prediction tree is adopted to analyze the operation instruction sequences of the user host, and the behavior relation among the instruction behavior sequences is researched, so that whether the operation instructions of the user host are abnormal or not is judged. Meanwhile, the compact prediction tree has high calculation efficiency and exceeds other sequence analysis algorithms.

In the embodiment, on the basis of identifying the abnormal object of the host operating instruction based on the compact prediction tree, a word2vec algorithm and a bilstm algorithm are combined to construct a model, so that an adaptive, iterative and stable abnormality identification model is formed. The Word2vec algorithm and the bilstm algorithm are combined to be applied to the abnormal recognition of the host operation instruction, the internal relation among the user operation instructions is fully considered, the logical relation of the instructions in the time dimension is researched, and the accuracy of the object recognition of the abnormal host operation instruction is improved.

Example 2

As shown in fig. 3, corresponding to embodiment 1, this embodiment further provides a system for recognizing an exception of a host operation command, which includes

the abnormal instruction screening module is used for carrying out ascending arrangement on the operation instruction frequency, and screening out the operation instructions smaller than a set threshold value from the sorted frequency number sequence by utilizing the characteristic of quantile to obtain a target operation instruction sequence;

the compact prediction tree training module is used for converting the behavior sequence record input into an array and performing model training by using a compact prediction tree; and obtaining the target compact prediction tree. The compact prediction tree prediction steps are as follows:

the training operation instruction vector module is used for inputting the obtained host operation instruction sequence and pre-training by utilizing a word2vec algorithm to form a pre-training vector; the pre-training step by using the word2vec algorithm is as follows:

the third step: determining the dimension N of the word vector;

and the prediction module is used for processing the data to be recognized by the non-common instruction screening module to obtain a behavior sequence record, converting the behavior sequence record into a vector through the processing of the practice operation instruction vector module, and inputting the vector into the classification model for recognition.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for identifying abnormality of host operation instruction is characterized in that: the method comprises the following steps:

s1, sample data extraction

s2, data processing

s3: screening of unusual instructions

s4: compact predictive tree training

Inputting the behavior sequence record in the S2 data into an array, and performing model training by using a compact prediction tree; obtaining a target compact prediction tree;

s5: compact prediction tree prediction

Selecting user accounts containing the instructions and corresponding behavior sequence records according to the target operation instruction sequence screened in the S3, wherein the compact prediction tree prediction step is as follows:

the third step: adding elements in the subsequent sequence and the scores thereof into a score dictionary, performing descending sorting according to the scores according to the score dictionary obtained by the compact prediction tree, selecting a prediction operation instruction with a high score, and if the actual operation instruction is not in the prediction instruction, taking the account as an abnormal operation behavior; carrying out abnormality judgment on all user account operation instruction behaviors, wherein 1 represents abnormal, and 0 represents normal, and forming a labeled training data set; predicting a plurality of possible instructions by using a target operation instruction sequence screened by the S3 unused instructions based on the S4 trained target compact prediction tree, judging that the user operation instruction behavior is abnormal if the predicted instruction set does not contain the unused instructions of actual data, and finally obtaining a training data set with a label;

s6: training operational instruction vectors using word2vec

s7: establishing classification recognition model by utilizing Bi-LSTM

s8: and predicting by using a classification model.

2. The method of claim 1, wherein the host operation command exception identification comprises: in the step S3, the operation command smaller than the set threshold is screened out from the sorted frequent number sequence by using the characteristic of quantile.

3. The method for identifying the abnormality of the host operation instruction according to the claim 1 or 2, characterized in that: the pre-training step in step S6 by using the word2vec algorithm is as follows:

the third step: determining the dimension N of the word vector;

4. The method for identifying the abnormality of the host operation instruction according to the claim 1 or 2, characterized in that: the identification process of step S8 is specifically to process the data to be identified by the method of step S2 to obtain a behavior sequence record, and then convert the behavior sequence record into a vector by the method of step S6, and input the vector to the classification model for identification.

5. A host operation instruction exception recognition system is characterized in that: comprises that

the compact prediction tree training module is used for converting the behavior sequence record input into an array and performing model training by using a compact prediction tree; obtaining a target compact prediction tree; the compact prediction tree prediction steps are as follows:

the third step: adding elements in the subsequent sequence and the scores thereof into a score dictionary, performing descending sorting according to the scores according to the score dictionary obtained by the compact prediction tree, selecting a prediction operation instruction with a high score, and if the actual operation instruction is not in the prediction instruction, taking the account as an abnormal operation behavior; carrying out abnormality judgment on all user account operation instruction behaviors, wherein 1 represents abnormal, and 0 represents normal, and forming a labeled training data set;

6. The system of claim 5, wherein the host operation command exception recognition system comprises: and the unused instruction screening module screens out the operation instructions smaller than a set threshold value from the sorted frequency number series by using the characteristic of quantile.

7. The system of claim 5 or 6, wherein: the pre-training step by using the word2vec algorithm in the training operation instruction vector module is as follows:

the third step: determining the dimension N of the word vector;

8. The system of claim 5 or 6, wherein: the identification process of the prediction module is specifically that the data to be identified is processed by a non-common instruction screening module to obtain a behavior sequence record, and then the behavior sequence record is processed by a practice operation instruction vector module to be converted into a vector which is input to the classification model for identification.