CN111143840A - Method and system for identifying abnormity of host operation instruction - Google Patents

Method and system for identifying abnormity of host operation instruction Download PDF

Info

Publication number
CN111143840A
CN111143840A CN201911406512.7A CN201911406512A CN111143840A CN 111143840 A CN111143840 A CN 111143840A CN 201911406512 A CN201911406512 A CN 201911406512A CN 111143840 A CN111143840 A CN 111143840A
Authority
CN
China
Prior art keywords
sequence
operation instruction
vector
training
instruction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911406512.7A
Other languages
Chinese (zh)
Other versions
CN111143840B (en
Inventor
殷钱安
梁淑云
刘胜
马影
陶景龙
王启凡
魏国富
徐�明
余贤喆
周晓勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information and Data Security Solutions Co Ltd
Original Assignee
Information and Data Security Solutions Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information and Data Security Solutions Co Ltd filed Critical Information and Data Security Solutions Co Ltd
Priority to CN201911406512.7A priority Critical patent/CN111143840B/en
Publication of CN111143840A publication Critical patent/CN111143840A/en
Application granted granted Critical
Publication of CN111143840B publication Critical patent/CN111143840B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/554Detecting local intrusion or implementing counter-measures involving event detection and direct action

Abstract

The invention provides a method and a system for identifying the abnormity of a host operating instruction, which comprises the following steps of S1, extracting sample data; s2, data processing; obtaining a behavior sequence record and the use frequency of each host operating instruction; s3: screening by using an emergency instruction to obtain a target operation instruction sequence; s4: training a compact prediction tree to obtain a target compact prediction tree; s5: predicting a compact prediction tree to obtain a training data set with a label; s6: training an operation instruction vector by using word2vec to form a pre-training vector; s7: establishing a classification recognition model by using Bi-LSTM; s8: and predicting by using a classification model. The invention adopts the compact prediction tree to analyze the operation instruction sequence of the user host, and researches the behavior relation among the instruction behavior sequences, thereby judging whether the operation instruction of the user host is abnormal or not. Based on the method, the internal relation among the user operation instructions is fully considered, the logical relation of the instructions in the time dimension is researched, and the abnormal object identification accuracy of the host operation instructions is improved.

Description

Method and system for identifying abnormity of host operation instruction
Technical Field
The invention relates to computer data security, in particular to a method and a system for identifying abnormal operation instructions of a host.
Background
Computer system security is one of the key contents of information security, and has become the core technology of computer information systems, and is also an important basis and supplement of network security.
Modern information technology is continuously developed, and computer application relates to various industries. For computer information security, currently, a computer network information security mechanism is established in the country, and security protection is performed in the fields of information security and the like, but a certain management difficulty exists in a computer with a high use degree, and under the condition that the computer network information security protection mechanism is not complete at the present stage, the computer system still has threats of an information network technology, such as dangerous behaviors of information leakage, information tampering and the like, and potential safety hazards are caused to data information security. How to effectively identify dangerous operation behaviors in a computer system and protect the safety in the computer system still remains one of the problems to be solved for network information safety.
The existing abnormal data mining methods mainly comprise a distance-based method, a statistical-based method, a density-based method and a clustering-based method. Numerous scholars at home and abroad deeply research the theory of the method and obtain great results, but still have some defects and shortcomings. For example, in distance-based methods, the selection of distance functions and parameters presents certain difficulties; in the statistical-based method, the distribution condition of data is required to be known in advance, but the distribution function of the data is difficult to obtain in advance; the time complexity is greater in density-based methods; clustering-based methods focus mainly on the clustering problem. These problems limit the application of the abnormal data mining method, and mainly deal with deterministic data, and lack of effective theoretical models and methods for uncertain information processing and discrete sequence data cannot take into account the inherent logical relationship between sequence behaviors. For the sequence anomaly detection method, a common Markov model and a directed graph model are used for processing a large data set, and the efficiency is low.
The classification recognition algorithm model in the prior art is constructed based on instruction characteristics, the relation among instructions is not considered completely, and the internal logic relation of the instructions before and after the host operation on the time dimension cannot be fully utilized.
Disclosure of Invention
The invention aims to provide a method for identifying the abnormal high identification rate of a host operation instruction based on a time relation.
The invention solves the technical problems through the following technical means:
a method for identifying abnormality of host operation instruction comprises the following steps:
s1, sample data extraction
Extracting system operation instruction log data in a specified time period as original sample data;
s2, data processing
Based on the sample data extracted in S1, the data is distinguished by a set period, the data is processed into a unique index by taking the account number of the host computer of the user as an ID, the set period and the ID form a unique index, the instructions are arranged according to the time sequence, the instruction behaviors are combined to form a behavior sequence record,
according to the sample data obtained in the S1, counting the use frequency of each host operating instruction;
s3: screening of unusual instructions
Performing ascending arrangement on the operation instruction frequency obtained in the step S2, and screening out operation instructions smaller than a set threshold value from the sorted frequency number sequence to obtain a target operation instruction sequence;
s4: compact predictive tree training
Inputting the behavior sequence record in the S2 data into an array, and performing model training by using a compact prediction tree; and obtaining the target compact prediction tree.
S5: compact prediction tree prediction
Selecting the user account and the corresponding behavior sequence record containing the instructions according to the target operation instruction sequence screened out by S3,
predicting a plurality of instructions which may appear by using a target operation instruction sequence screened by an unused instruction S3 based on a target compact prediction tree trained in S4, if the predicted instruction set does not contain the unused instruction of actual data, judging that the behavior of the user operation instruction is abnormal, and finally obtaining a training data set with a label;
s6: training operational instruction vectors using word2vec
Using the host operation instruction sequence obtained in the step S2 as input, and performing pre-training by using a word2vec algorithm to form a pre-training vector;
s7: establishing classification recognition model by utilizing Bi-LSTM
Based on the pre-training vector obtained in S6, inputting the training data set obtained in S5 into a Bi-LSTM algorithm, and training into a classification model for predicting whether the target is abnormal or not;
s8: and predicting by using a classification model.
Preferably, in the step S3, the operation command smaller than the set threshold is screened out from the sorted frequent number sequence by using the characteristic of the quantile.
Preferably, in step S5, the compact prediction tree prediction step includes:
the first step is as follows: finding a sequence similar to the target operation instruction sequence, and searching through the following steps: finding a unique item of the target operation instruction sequence; searching a sequence ID set with a specific unique item; then, the intersection of all the unique item sets is taken;
the second step is that: finding the subsequent sequence of each sequence similar to the target operation instruction sequence specifically comprises: for each similar sequence, the subsequent sequence is defined as the longest subsequence of the target operation instruction sequence minus the items present in the target operation instruction sequence after the last item of the target operation instruction sequence in the similar sequence occurs;
the third step: adding elements in the subsequent sequence and the scores thereof into a score dictionary, performing descending sorting according to the scores according to the score dictionary obtained by the compact prediction tree, selecting a prediction operation instruction with a high score, and if the actual operation instruction is not in the prediction instruction, taking the account as an abnormal operation behavior; and (3) carrying out abnormality judgment on the operation instruction behaviors of all the user accounts, wherein 1 represents abnormal mark, and 0 represents normal mark, so as to form a labeled training data set.
Preferably, the step of pre-training by using the word2vec algorithm in step S6 is as follows:
the first step is as follows: taking the operation instruction sequence as a text structure, wherein each operation instruction corresponds to a word in the text; generating a vocabulary list for an input text, counting word frequency of each word, sequencing from high to low according to the word frequency, and taking the most frequent V words to form the vocabulary list; each word has a one-hot vector, the dimension of the vector is V, if the word appears in the vocabulary table, the corresponding position in the vocabulary table in the vector is 1, and the other positions are all 0; if the vocabulary does not appear, the vector is all 0;
the second step is that: generating a one-hot vector for each word of an input text, and reserving the original position of each word;
the third step: determining the dimension N of the word vector;
the fourth step: determining the window size and the batch size in the bag-of-words model, adopting softmax, carrying out iterative training on a neural network for a certain number of times, and obtaining a parameter matrix from an input layer to a hidden layer, wherein the transposition of each row in the matrix is the word vector of the corresponding word, namely the vector of the corresponding instruction.
Preferably, the identification process in step S8 is to process the data to be identified by the method in step S2 to obtain a behavior sequence record, and then convert the behavior sequence record into a vector by the method in step S6, and input the vector into the classification model for identification.
The invention also provides a system for identifying the abnormity of the host operating instruction, which comprises
The sample data extraction module extracts the system operation instruction log data in a specified time period as original sample data;
the data processing module is used for distinguishing in a set period based on sample data, processing the sample data into a unique index formed by taking a user host account as an ID and the set period and the ID, arranging the instructions according to a time sequence, combining the instruction behaviors to form a behavior sequence record,
according to the sample data, counting the use frequency of each host operating instruction;
the non-use instruction screening module is used for performing ascending arrangement on the operation instruction frequency, screening out the operation instructions smaller than a set threshold value from the sorted frequency sequence to obtain a target operation instruction sequence;
the compact prediction tree training module is used for converting the behavior sequence record input into an array and performing model training by using a compact prediction tree; and obtaining the target compact prediction tree.
The compact prediction tree prediction module selects user account numbers containing the instructions and corresponding behavior sequence records according to the screened target operation instruction sequences,
predicting a plurality of instructions which may appear by using a target operation instruction sequence screened by non-use instructions based on a trained target compact prediction tree, and if the predicted instruction set does not contain the non-use instructions of actual data, judging that the behavior of user operation instructions is abnormal, and finally obtaining a training data set with labels;
the training operation instruction vector module is used for inputting the obtained host operation instruction sequence and pre-training by utilizing a word2vec algorithm to form a pre-training vector;
establishing a classification recognition model module, inputting the obtained training data set into a Bi-LSTM algorithm based on the obtained pre-training vector, and training the pre-training vector into a classification model for predicting whether the target is abnormal or not;
and the prediction module is used for predicting by utilizing the classification model.
Preferably, the unused instruction screening module screens out the operation instructions smaller than the set threshold value from the sorted frequency number series by using the characteristic of quantile.
Preferably, in the compact prediction tree prediction module, the compact prediction tree prediction step is as follows:
the first step is as follows: finding a sequence similar to the target operation instruction sequence, and searching through the following steps: finding a unique item of the target operation instruction sequence; searching a sequence ID set with a specific unique item; then, the intersection of all the unique item sets is taken;
the second step is that: finding the subsequent sequence of each sequence similar to the target operation instruction sequence specifically comprises: for each similar sequence, the subsequent sequence is defined as the longest subsequence of the target operation instruction sequence minus the items present in the target operation instruction sequence after the last item of the target operation instruction sequence in the similar sequence occurs;
the third step: adding elements in the subsequent sequence and the scores thereof into a score dictionary, performing descending sorting according to the scores according to the score dictionary obtained by the compact prediction tree, selecting a prediction operation instruction with a high score, and if the actual operation instruction is not in the prediction instruction, taking the account as an abnormal operation behavior; and (3) carrying out abnormality judgment on the operation instruction behaviors of all the user accounts, wherein 1 represents abnormal mark, and 0 represents normal mark, so as to form a labeled training data set.
Preferably, the pre-training by using word2vec algorithm in the training operation instruction vector module comprises the following steps:
the first step is as follows: taking the operation instruction sequence as a text structure, wherein each operation instruction corresponds to a word in the text; generating a vocabulary list for an input text, counting word frequency of each word, sequencing from high to low according to the word frequency, and taking the most frequent V words to form the vocabulary list; each word has a one-hot vector, the dimension of the vector is V, if the word appears in the vocabulary table, the corresponding position in the vocabulary table in the vector is 1, and the other positions are all 0; if the vocabulary does not appear, the vector is all 0;
the second step is that: generating a one-hot vector for each word of an input text, and reserving the original position of each word;
the third step: determining the dimension N of the word vector;
the fourth step: determining the window size and the batch size in the bag-of-words model, adopting softmax, carrying out iterative training on a neural network for a certain number of times, and obtaining a parameter matrix from an input layer to a hidden layer, wherein the transposition of each row in the matrix is the word vector of the corresponding word, namely the vector of the corresponding instruction.
Preferably, the identification process of the prediction module is specifically that the data to be identified is processed by a non-common instruction screening module to obtain a behavior sequence record, and then the behavior sequence record is processed by a practice operation instruction vector module to be converted into a vector, and the vector is input to the classification model for identification.
The invention has the advantages that:
the invention adopts the compact prediction tree to analyze the operation instruction sequence of the user host, and researches the behavior relation among the instruction behavior sequences, thereby judging whether the operation instruction of the user host is abnormal or not. Meanwhile, the compact prediction tree has high calculation efficiency and exceeds other sequence analysis algorithms.
On the basis of identifying an object with abnormal host operating instructions based on a compact prediction tree, the method adopts a word2vec algorithm and a bilstm algorithm to combine to construct a model, and forms an adaptive, iterative and stable abnormal identification model. The Word2vec algorithm and the bilstm algorithm are combined to be applied to the abnormal recognition of the host operation instruction, the internal relation among the user operation instructions is fully considered, the logical relation of the instructions in the time dimension is researched, and the accuracy of the object recognition of the abnormal host operation instruction is improved.
Drawings
Fig. 1 is a flowchart of a method for identifying an exception of a host operation instruction according to embodiment 1 of the present invention;
fig. 2 is example data of prediction tree training in a method for identifying an exception to a host operation instruction according to embodiment 1 of the present invention;
fig. 3 is a block diagram of a host operation instruction exception identification system according to embodiment 2 of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
As shown in fig. 1, the present embodiment provides a method for identifying an exception of a host operating instruction, including the following steps:
s1: sample data extraction
System operation instruction log data of a certain quarter (which may also be a certain time period (month, year, etc.)) is extracted as original sample data.
S2 data processing
Based on the sample data extracted in S1, the data is divided by month, and the data is processed into a unique index by using the user host account as the ID, the month and the ID are arranged in time sequence, and the instruction behaviors are combined to form a behavior sequence record, for example, 6 m; root; cd, mv, cp, ls, ls, rm, …, reboot;
and counting the use frequency count of each host operation instruction according to the data obtained in the step S1.
S3: screening of unusual instructions
And (4) performing ascending arrangement according to the operation instruction frequency obtained in the step (S1), and screening out operation instructions with the quantiles smaller than one fourth (which can be other set thresholds and determined according to actual conditions) of the quantiles from the sorted frequency number sequence by utilizing the quantile characteristic.
S4: compact predictive tree training
And converting the input of the behavior sequence record in the S2 data into an array, and performing model training by using a compact prediction tree.
The compact prediction tree training steps are as follows:
for example, there are four sets of sequences, represented in dictionary form: { 'ID 1' { 'A', 'B', 'C' }, 'ID 2' { 'A', 'B', }, 'ID 3' { 'A', 'B', 'D', 'C' }, 'ID 4' { 'B', 'C' };
building a prediction tree, wherein the prediction tree is a tree consisting of nodes, and each node has 3 elements:
data item (item): actual data items stored in the nodes;
child node (children): a list of all child nodes of the node;
parent node (parent): a link or reference to the parent of this node;
the prediction tree is basically a data structure of a dictionary tree, and the whole training data is compressed into a tree form.
As shown in FIG. 2, starting with ID1, first from A, it is checked whether A is a child of the root node. If not, adding A to the child node list of the root node, and sequentially adding the child nodes according to the sequence order of the ID1 until the last element of the ID1, namely the node C, is added; similarly, the ID2, the ID3 and the ID4 are added to the child nodes according to the steps, and finally a trained data structure is formed.
S5: compact prediction tree prediction
According to the instructions screened out in S3, selecting the user account and the corresponding sequence record containing the instructions,
and (4) predicting the possible 10 instructions by utilizing the front instruction sequence of the emergency instruction based on the trained prediction tree of S4, and if the prediction instruction set does not contain the emergency instruction of the actual data, judging that the user operation instruction behavior is abnormal.
The compact prediction tree prediction steps are as follows:
in the prediction phase, a prediction is made for each sequence of data in the test set in an iterative manner. For a single row, we find a sequence similar to the row using the inverted index. Then we find a subsequent sequence of similar sequences, add terms in the subsequent sequence to the count dictionary, and give a score.
The first step is as follows: finding a sequence similar to the target sequence by the following steps:
finding unique terms for a target sequence
Finding a set of sequence IDs for the presence of a particular unique item
Then take the intersection of all unique item sets
The second step is that: finding the subsequent sequence of each sequence similar to the target sequence
For each similar sequence, the subsequent sequence is defined as the longest subsequence after the last entry in the target sequence in the similar sequence has occurred, minus the entries present in the target sequence
The third step: adding elements in subsequent sequences and their scores to a counting dictionary
The initial state of the count dictionary { }, is an empty dictionary, and if the term does not exist in the dictionary, the score is 1+ (1/number of similar sequences) + (1/number of current terms in the count dictionary +1) × 0.001, otherwise, the score is (1+ (1/number of similar sequences) + (1/number of current terms in the calculation table) × 0.001) × the original score.
And according to a score dictionary obtained by the compact prediction tree, sorting in a descending order according to scores, selecting the top 10 prediction operation instructions with high scores, and if the actual operation instructions are not in the prediction instructions, operating the account as abnormal operation behaviors. And (3) carrying out abnormality judgment on the operation instruction behaviors of all the user accounts, carrying out abnormality identification on the operation instruction of the user account within one quarter, wherein 1 represents that the mark is abnormal, and 0 represents that the mark is normal, so as to form a training data set with the mark.
S6: training operational instruction vectors using word2vec
In consideration of the logic structure among the operation instructions, the host operation instruction sequence obtained in S2 is used as input, and is pre-trained by using word2vec algorithm to form an operation instruction vector. The method mainly comprises the following steps:
the first step is to regard the operation instructions as text structures, and each operation instruction corresponds to a word in the text. And generating a vocabulary list for the input text, counting word frequency of each word, sequencing from high to low according to the word frequency, and taking the most frequent V words to form the vocabulary list. Each word has a one-hot vector, the dimension of the vector is V, if the word appears in the vocabulary table, the corresponding position in the vocabulary table in the vector is 1, and the other positions are all 0. If the vocabulary does not appear, the vector is all 0;
generating a one-hot vector for each word of the input text, wherein the original position of each word is kept as the word is context-dependent;
thirdly, determining the dimension N of the word vector;
determining the window size and batch size in the bag-of-words model, iteratively training for a certain number of times by adopting softmax and a neural network to obtain a parameter matrix from the input layer to the hidden layer, wherein the transpose of each row in the matrix is a word vector of a corresponding word, namely a vector of a corresponding instruction.
S7: establishing classification recognition model by utilizing Bi-LSTM
And based on the pre-training vector model obtained in the step S6, inputting the training data set obtained in the step S5 into a Bi-LSTM algorithm, and training the training data set into a classification model for predicting whether the target is abnormal or not.
S8: anomaly identification
And processing the data to be recognized by the method of the step S2 to obtain a behavior sequence record, converting the behavior sequence record into a vector by the method of the step S6, and inputting the vector into the classification model for recognition.
In the embodiment, the compact prediction tree is adopted to analyze the operation instruction sequences of the user host, and the behavior relation among the instruction behavior sequences is researched, so that whether the operation instructions of the user host are abnormal or not is judged. Meanwhile, the compact prediction tree has high calculation efficiency and exceeds other sequence analysis algorithms.
In the embodiment, on the basis of identifying the abnormal object of the host operating instruction based on the compact prediction tree, a word2vec algorithm and a bilstm algorithm are combined to construct a model, so that an adaptive, iterative and stable abnormality identification model is formed. The Word2vec algorithm and the bilstm algorithm are combined to be applied to the abnormal recognition of the host operation instruction, the internal relation among the user operation instructions is fully considered, the logical relation of the instructions in the time dimension is researched, and the accuracy of the object recognition of the abnormal host operation instruction is improved.
Example 2
As shown in fig. 3, corresponding to embodiment 1, this embodiment further provides a system for recognizing an exception of a host operation command, which includes
The sample data extraction module extracts the system operation instruction log data in a specified time period as original sample data;
the data processing module is used for distinguishing in a set period based on sample data, processing the sample data into a unique index formed by taking a user host account as an ID and the set period and the ID, arranging the instructions according to a time sequence, combining the instruction behaviors to form a behavior sequence record,
according to the sample data, counting the use frequency of each host operating instruction;
the abnormal instruction screening module is used for carrying out ascending arrangement on the operation instruction frequency, and screening out the operation instructions smaller than a set threshold value from the sorted frequency number sequence by utilizing the characteristic of quantile to obtain a target operation instruction sequence;
the compact prediction tree training module is used for converting the behavior sequence record input into an array and performing model training by using a compact prediction tree; and obtaining the target compact prediction tree. The compact prediction tree prediction steps are as follows:
the first step is as follows: finding a sequence similar to the target operation instruction sequence, and searching through the following steps: finding a unique item of the target operation instruction sequence; searching a sequence ID set with a specific unique item; then, the intersection of all the unique item sets is taken;
the second step is that: finding the subsequent sequence of each sequence similar to the target operation instruction sequence specifically comprises: for each similar sequence, the subsequent sequence is defined as the longest subsequence of the target operation instruction sequence minus the items present in the target operation instruction sequence after the last item of the target operation instruction sequence in the similar sequence occurs;
the third step: adding elements in the subsequent sequence and the scores thereof into a score dictionary, performing descending sorting according to the scores according to the score dictionary obtained by the compact prediction tree, selecting a prediction operation instruction with a high score, and if the actual operation instruction is not in the prediction instruction, taking the account as an abnormal operation behavior; and (3) carrying out abnormality judgment on the operation instruction behaviors of all the user accounts, wherein 1 represents abnormal mark, and 0 represents normal mark, so as to form a labeled training data set.
The compact prediction tree prediction module selects user account numbers containing the instructions and corresponding behavior sequence records according to the screened target operation instruction sequences,
predicting a plurality of instructions which may appear by using a target operation instruction sequence screened by non-use instructions based on a trained target compact prediction tree, and if the predicted instruction set does not contain the non-use instructions of actual data, judging that the behavior of user operation instructions is abnormal, and finally obtaining a training data set with labels;
the training operation instruction vector module is used for inputting the obtained host operation instruction sequence and pre-training by utilizing a word2vec algorithm to form a pre-training vector; the pre-training step by using the word2vec algorithm is as follows:
the first step is as follows: taking the operation instruction sequence as a text structure, wherein each operation instruction corresponds to a word in the text; generating a vocabulary list for an input text, counting word frequency of each word, sequencing from high to low according to the word frequency, and taking the most frequent V words to form the vocabulary list; each word has a one-hot vector, the dimension of the vector is V, if the word appears in the vocabulary table, the corresponding position in the vocabulary table in the vector is 1, and the other positions are all 0; if the vocabulary does not appear, the vector is all 0;
the second step is that: generating a one-hot vector for each word of an input text, and reserving the original position of each word;
the third step: determining the dimension N of the word vector;
the fourth step: determining the window size and the batch size in the bag-of-words model, adopting softmax, carrying out iterative training on a neural network for a certain number of times, and obtaining a parameter matrix from an input layer to a hidden layer, wherein the transposition of each row in the matrix is the word vector of the corresponding word, namely the vector of the corresponding instruction.
Establishing a classification recognition model module, inputting the obtained training data set into a Bi-LSTM algorithm based on the obtained pre-training vector, and training the pre-training vector into a classification model for predicting whether the target is abnormal or not;
and the prediction module is used for processing the data to be recognized by the non-common instruction screening module to obtain a behavior sequence record, converting the behavior sequence record into a vector through the processing of the practice operation instruction vector module, and inputting the vector into the classification model for recognition.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for identifying abnormality of host operation instruction is characterized in that: the method comprises the following steps:
s1, sample data extraction
Extracting system operation instruction log data in a specified time period as original sample data;
s2, data processing
Based on the sample data extracted in S1, the data is distinguished by a set period, the data is processed into a unique index by taking the account number of the host computer of the user as an ID, the set period and the ID form a unique index, the instructions are arranged according to the time sequence, the instruction behaviors are combined to form a behavior sequence record,
according to the sample data obtained in the S1, counting the use frequency of each host operating instruction;
s3: screening of unusual instructions
Performing ascending arrangement on the operation instruction frequency obtained in the step S2, and screening out operation instructions smaller than a set threshold value from the sorted frequency number sequence to obtain a target operation instruction sequence;
s4: compact predictive tree training
Inputting the behavior sequence record in the S2 data into an array, and performing model training by using a compact prediction tree; obtaining a target compact prediction tree;
s5: compact prediction tree prediction
Selecting the user account and the corresponding behavior sequence record containing the instructions according to the target operation instruction sequence screened out by S3,
predicting a plurality of instructions which may appear by using a target operation instruction sequence screened by an unused instruction S3 based on a target compact prediction tree trained in S4, if the predicted instruction set does not contain the unused instruction of actual data, judging that the behavior of the user operation instruction is abnormal, and finally obtaining a training data set with a label;
s6: training operational instruction vectors using word2vec
Using the host operation instruction sequence obtained in the step S2 as input, and performing pre-training by using a word2vec algorithm to form a pre-training vector;
s7: establishing classification recognition model by utilizing Bi-LSTM
Based on the pre-training vector obtained in S6, inputting the training data set obtained in S5 into a Bi-LSTM algorithm, and training into a classification model for predicting whether the target is abnormal or not;
s8: and predicting by using a classification model.
2. The method of claim 1, wherein the host operation command exception identification comprises: in the step S3, the operation command smaller than the set threshold is screened out from the sorted frequent number sequence by using the characteristic of quantile.
3. The method of claim 1, wherein the host operation command exception identification comprises: in step S5, the compact prediction tree prediction step is as follows:
the first step is as follows: finding a sequence similar to the target operation instruction sequence, and searching through the following steps: finding a unique item of the target operation instruction sequence; searching a sequence ID set with a specific unique item; then, the intersection of all the unique item sets is taken;
the second step is that: finding the subsequent sequence of each sequence similar to the target operation instruction sequence specifically comprises: for each similar sequence, the subsequent sequence is defined as the longest subsequence of the target operation instruction sequence minus the items present in the target operation instruction sequence after the last item of the target operation instruction sequence in the similar sequence occurs;
the third step: adding elements in the subsequent sequence and the scores thereof into a score dictionary, performing descending sorting according to the scores according to the score dictionary obtained by the compact prediction tree, selecting a prediction operation instruction with a high score, and if the actual operation instruction is not in the prediction instruction, taking the account as an abnormal operation behavior; and (3) carrying out abnormality judgment on the operation instruction behaviors of all the user accounts, wherein 1 represents abnormal mark, and 0 represents normal mark, so as to form a labeled training data set.
4. A method according to any one of claims 1 to 3, wherein the method further comprises: the pre-training step in step S6 by using the word2vec algorithm is as follows:
the first step is as follows: taking the operation instruction sequence as a text structure, wherein each operation instruction corresponds to a word in the text; generating a vocabulary list for an input text, counting word frequency of each word, sequencing from high to low according to the word frequency, and taking the most frequent V words to form the vocabulary list; each word has a one-hot vector, the dimension of the vector is V, if the word appears in the vocabulary table, the corresponding position in the vocabulary table in the vector is 1, and the other positions are all 0; if the vocabulary does not appear, the vector is all 0;
the second step is that: generating a one-hot vector for each word of an input text, and reserving the original position of each word;
the third step: determining the dimension N of the word vector;
the fourth step: determining the window size and the batch size in the bag-of-words model, adopting softmax, carrying out iterative training on a neural network for a certain number of times, and obtaining a parameter matrix from an input layer to a hidden layer, wherein the transposition of each row in the matrix is the word vector of the corresponding word, namely the vector of the corresponding instruction.
5. A method according to any one of claims 1 to 3, wherein the method further comprises: the identification process of step S8 is specifically to process the data to be identified by the method of step S2 to obtain a behavior sequence record, and then convert the behavior sequence record into a vector by the method of step S6, and input the vector to the classification model for identification.
6. A host operation instruction exception recognition system is characterized in that: comprises that
The sample data extraction module extracts the system operation instruction log data in a specified time period as original sample data;
the data processing module is used for distinguishing in a set period based on sample data, processing the sample data into a unique index formed by taking a user host account as an ID and the set period and the ID, arranging the instructions according to a time sequence, combining the instruction behaviors to form a behavior sequence record,
according to the sample data, counting the use frequency of each host operating instruction;
the non-use instruction screening module is used for performing ascending arrangement on the operation instruction frequency, screening out the operation instructions smaller than a set threshold value from the sorted frequency sequence to obtain a target operation instruction sequence;
the compact prediction tree training module is used for converting the behavior sequence record input into an array and performing model training by using a compact prediction tree; obtaining a target compact prediction tree;
the compact prediction tree prediction module selects user account numbers containing the instructions and corresponding behavior sequence records according to the screened target operation instruction sequences,
predicting a plurality of instructions which may appear by using a target operation instruction sequence screened by non-use instructions based on a trained target compact prediction tree, and if the predicted instruction set does not contain the non-use instructions of actual data, judging that the behavior of user operation instructions is abnormal, and finally obtaining a training data set with labels;
the training operation instruction vector module is used for inputting the obtained host operation instruction sequence and pre-training by utilizing a word2vec algorithm to form a pre-training vector;
establishing a classification recognition model module, inputting the obtained training data set into a Bi-LSTM algorithm based on the obtained pre-training vector, and training the pre-training vector into a classification model for predicting whether the target is abnormal or not;
and the prediction module is used for predicting by utilizing the classification model.
7. The system of claim 6, wherein the host operation command exception recognition system comprises: and the unused instruction screening module screens out the operation instructions smaller than a set threshold value from the sorted frequency number series by using the characteristic of quantile.
8. The system of claim 5, wherein the host operation command exception recognition system comprises: in the compact prediction tree prediction module, the compact prediction tree prediction steps are as follows:
the first step is as follows: finding a sequence similar to the target operation instruction sequence, and searching through the following steps: finding a unique item of the target operation instruction sequence; searching a sequence ID set with a specific unique item; then, the intersection of all the unique item sets is taken;
the second step is that: finding the subsequent sequence of each sequence similar to the target operation instruction sequence specifically comprises: for each similar sequence, the subsequent sequence is defined as the longest subsequence of the target operation instruction sequence minus the items present in the target operation instruction sequence after the last item of the target operation instruction sequence in the similar sequence occurs;
the third step: adding elements in the subsequent sequence and the scores thereof into a score dictionary, performing descending sorting according to the scores according to the score dictionary obtained by the compact prediction tree, selecting a prediction operation instruction with a high score, and if the actual operation instruction is not in the prediction instruction, taking the account as an abnormal operation behavior; and (3) carrying out abnormality judgment on the operation instruction behaviors of all the user accounts, wherein 1 represents abnormal mark, and 0 represents normal mark, so as to form a labeled training data set.
9. The method of any one of claims 5 to 8, wherein the method further comprises: the pre-training step by using the word2vec algorithm in the training operation instruction vector module is as follows:
the first step is as follows: taking the operation instruction sequence as a text structure, wherein each operation instruction corresponds to a word in the text; generating a vocabulary list for an input text, counting word frequency of each word, sequencing from high to low according to the word frequency, and taking the most frequent V words to form the vocabulary list; each word has a one-hot vector, the dimension of the vector is V, if the word appears in the vocabulary table, the corresponding position in the vocabulary table in the vector is 1, and the other positions are all 0; if the vocabulary does not appear, the vector is all 0;
the second step is that: generating a one-hot vector for each word of an input text, and reserving the original position of each word;
the third step: determining the dimension N of the word vector;
the fourth step: determining the window size and the batch size in the bag-of-words model, adopting softmax, carrying out iterative training on a neural network for a certain number of times, and obtaining a parameter matrix from an input layer to a hidden layer, wherein the transposition of each row in the matrix is the word vector of the corresponding word, namely the vector of the corresponding instruction.
10. The system according to any one of claims 5 to 8, wherein: the identification process of the prediction module is specifically that the data to be identified is processed by a non-common instruction screening module to obtain a behavior sequence record, and then the behavior sequence record is processed by a practice operation instruction vector module to be converted into a vector which is input to the classification model for identification.
CN201911406512.7A 2019-12-31 2019-12-31 Method and system for identifying abnormity of host operation instruction Active CN111143840B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911406512.7A CN111143840B (en) 2019-12-31 2019-12-31 Method and system for identifying abnormity of host operation instruction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911406512.7A CN111143840B (en) 2019-12-31 2019-12-31 Method and system for identifying abnormity of host operation instruction

Publications (2)

Publication Number Publication Date
CN111143840A true CN111143840A (en) 2020-05-12
CN111143840B CN111143840B (en) 2022-01-25

Family

ID=70522692

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911406512.7A Active CN111143840B (en) 2019-12-31 2019-12-31 Method and system for identifying abnormity of host operation instruction

Country Status (1)

Country Link
CN (1) CN111143840B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111815425A (en) * 2020-07-27 2020-10-23 上海观安信息技术股份有限公司 User credit risk grade judgment method and system based on entity embedding
CN113360305A (en) * 2021-05-13 2021-09-07 杭州明实科技有限公司 Computer equipment and abnormal operation detection method, device and storage medium thereof
CN115826627A (en) * 2023-02-21 2023-03-21 白杨时代(北京)科技有限公司 Method, system, equipment and storage medium for determining formation instruction
CN117540153A (en) * 2024-01-09 2024-02-09 南昌工程学院 Tunnel monitoring data prediction method and system

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106663037A (en) * 2014-06-30 2017-05-10 亚马逊科技公司 Feature processing tradeoff management
CN106657410A (en) * 2017-02-28 2017-05-10 国家电网公司 Detection method for abnormal behaviors based on user access sequence
CN108399201A (en) * 2018-01-30 2018-08-14 武汉大学 A kind of Web user access path prediction technique based on Recognition with Recurrent Neural Network
CN108417272A (en) * 2018-02-08 2018-08-17 合肥工业大学 Similar case with temporal constraint recommends method and device
CN108664375A (en) * 2017-03-28 2018-10-16 瀚思安信(北京)软件技术有限公司 Method for the abnormal behaviour for detecting computer network system user
CN109615312A (en) * 2018-10-23 2019-04-12 平安科技(深圳)有限公司 Business abnormal investigation method, apparatus, electronic equipment and storage medium in execution
CN110334508A (en) * 2019-07-03 2019-10-15 广东省信息安全测评中心 A kind of host sequence intrusion detection method
US20190319868A1 (en) * 2019-06-25 2019-10-17 Intel Corporation Link performance prediction technologies
CN110456765A (en) * 2019-07-29 2019-11-15 北京威努特技术有限公司 Temporal model generation method, device and its detection method of industry control instruction, device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106663037A (en) * 2014-06-30 2017-05-10 亚马逊科技公司 Feature processing tradeoff management
CN106657410A (en) * 2017-02-28 2017-05-10 国家电网公司 Detection method for abnormal behaviors based on user access sequence
CN108664375A (en) * 2017-03-28 2018-10-16 瀚思安信(北京)软件技术有限公司 Method for the abnormal behaviour for detecting computer network system user
CN108399201A (en) * 2018-01-30 2018-08-14 武汉大学 A kind of Web user access path prediction technique based on Recognition with Recurrent Neural Network
CN108417272A (en) * 2018-02-08 2018-08-17 合肥工业大学 Similar case with temporal constraint recommends method and device
CN109615312A (en) * 2018-10-23 2019-04-12 平安科技(深圳)有限公司 Business abnormal investigation method, apparatus, electronic equipment and storage medium in execution
US20190319868A1 (en) * 2019-06-25 2019-10-17 Intel Corporation Link performance prediction technologies
CN110334508A (en) * 2019-07-03 2019-10-15 广东省信息安全测评中心 A kind of host sequence intrusion detection method
CN110456765A (en) * 2019-07-29 2019-11-15 北京威努特技术有限公司 Temporal model generation method, device and its detection method of industry control instruction, device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
徐明 等: ""自然语言处理与图分析相融合的网络舆论安全分析"", 《信息安全与通信保密》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111815425A (en) * 2020-07-27 2020-10-23 上海观安信息技术股份有限公司 User credit risk grade judgment method and system based on entity embedding
CN113360305A (en) * 2021-05-13 2021-09-07 杭州明实科技有限公司 Computer equipment and abnormal operation detection method, device and storage medium thereof
CN115826627A (en) * 2023-02-21 2023-03-21 白杨时代(北京)科技有限公司 Method, system, equipment and storage medium for determining formation instruction
CN117540153A (en) * 2024-01-09 2024-02-09 南昌工程学院 Tunnel monitoring data prediction method and system
CN117540153B (en) * 2024-01-09 2024-03-29 南昌工程学院 Tunnel monitoring data prediction method and system

Also Published As

Publication number Publication date
CN111143840B (en) 2022-01-25

Similar Documents

Publication Publication Date Title
CN111143840B (en) Method and system for identifying abnormity of host operation instruction
CN114610515B (en) Multi-feature log anomaly detection method and system based on log full semantics
US10089581B2 (en) Data driven classification and data quality checking system
Diab Optimizing stochastic gradient descent in text classification based on fine-tuning hyper-parameters approach. a case study on automatic classification of global terrorist attacks
CN106570513A (en) Fault diagnosis method and apparatus for big data network system
CN113742733B (en) Method and device for extracting trigger words of reading and understanding vulnerability event and identifying vulnerability type
CN115510500B (en) Sensitive analysis method and system for text content
CN110008699B (en) Software vulnerability detection method and device based on neural network
US11481734B2 (en) Machine learning model for predicting litigation risk on construction and engineering projects
CN110909542A (en) Intelligent semantic series-parallel analysis method and system
CN110990562A (en) Alarm classification method and system
CN113657461A (en) Log anomaly detection method, system, device and medium based on text classification
CN110580213A (en) Database anomaly detection method based on cyclic marking time point process
Madhfar et al. Arabic text classification: A comparative approach using a big dataset
Gunaseelan et al. Automatic extraction of segments from resumes using machine learning
Alam et al. Social media content categorization using supervised based machine learning methods and natural language processing in bangla language
CN114722198A (en) Method, system and related device for determining product classification code
CN112395881B (en) Material label construction method and device, readable storage medium and electronic equipment
CN113065356A (en) IT equipment operation and maintenance fault suggestion processing method based on semantic analysis algorithm
Seale et al. Approaches for using machine learning algorithms with large label sets for rotorcraft maintenance
Patel et al. Personality analysis using social media
CN115017894A (en) Public opinion risk identification method and device
CN115062615A (en) Financial field event extraction method and device
Khandokar et al. Event detection and knowledge mining from unlabelled bengali news articles
CN113761918A (en) Data processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant