CN114003908A - API labeling method and system for Windows PE virus sample - Google Patents
API labeling method and system for Windows PE virus sample Download PDFInfo
- Publication number
- CN114003908A CN114003908A CN202111312492.4A CN202111312492A CN114003908A CN 114003908 A CN114003908 A CN 114003908A CN 202111312492 A CN202111312492 A CN 202111312492A CN 114003908 A CN114003908 A CN 114003908A
- Authority
- CN
- China
- Prior art keywords
- api
- virus sample
- virus
- functional component
- probability
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/561—Virus type analysis
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Health & Medical Sciences (AREA)
- Virology (AREA)
- Computer Hardware Design (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention relates to a method and a system for marking API of a Windows PE virus sample, belongs to the field of network space security, and solves the problems of low precision and low efficiency of API marking of the Windows PE virus sample in the prior art. The method comprises the following steps: dynamically analyzing each virus sample in the acquired virus sample set to obtain a corresponding virus sample API, wherein the virus sample API comprises a dynamic API calling sequence; according to API information defined by the Windows operating system and the dynamic API calling sequence, initially marking the virus sample API to obtain an initial marking result; and automatically sensing and labeling the virus sample API according to the initial labeling result and the dynamic API calling sequence of the virus sample API by using the trained hidden Markov model, wherein the method can realize the rapid, efficient and accurate labeling of the virus sample API.
Description
Technical Field
The invention relates to the technical field of network space security, in particular to a method and a system for marking API of Windows PE virus samples.
Background
Windows PE (Portable Executable) virus samples have been one of the most serious security threats in the network space at present. It starts without being perceived by the user, destroying the security and privacy of the software and data. The increase of the number of viruses promotes the application of machine learning in the field of virus detection. The key problem faced by the effective implementation of the machine learning-based virus detection method is the acquisition of sample labels; because of the specificity of the virus, the identification of its tag is very difficult and requires extensive expertise. The manual labeling consumes manpower and material resources, and particularly, with the rapid increase of various new virus programs and varieties, the implementation of emergency response is seriously hindered by the lag problem of manual analysis, so that the virus programs are difficult to be controlled quickly and effectively, and the complete manual analysis loses efficiency and feasibility. Therefore, it is imperative to automate the analysis process. The method has important practical significance for deeply researching the virus sample behaviors by adopting an automatic analysis technology.
Most of the existing studies have the granularity of analysis of virus samples staying in the whole virus sample. The method is also effective for early viruses, Trojan horse and other viruses with relatively simple structures and functions; the code amount of the novel virus sample is obviously increased, and the code structure and the realized function are more and more complex; it is almost impossible to analyze the complete function of a complex virus sample. Therefore, there is also a need for finer grained analysis of virus samples. The API is an Application Programming Interface (Application Programming Interface), and basically all functions in the Windows operating system are implemented by calling the API, and the functions of the virus program can be analyzed by analyzing the system call API. When an attacker makes a virus sample, the attacker can call the API to achieve one or more attack purposes, so that the API with finer granularity needs to be researched in the process of analyzing the binary virus sample.
In addition, effective labeling of the PE virus sample is the basis for defending the threat of the PE virus sample, and in the prior art, a labeling method of a virus sample API mostly stays at a simple semantic feature analysis stage, namely, a TF-IDF (Term Frequency-Inverse Document Frequency) feature extraction method of a single API is used for carrying out feature function classification by using some word Frequency and other features of the single API and is mainly used for information retrieval and data mining; secondly, API feature extraction based on N-gram, the method is from the field of NLP (neural Language process), and has a good effect on semantic information extraction of the text; thirdly, the API sequence feature vector method based on the Global Vectors for Word representation can solve the problem that the associated vocabulary has strong performance;
the prior art has at least the following defects that firstly, the TF-IDF (Term Frequency-Inverse Document Frequency) feature extraction method of the single API has poor labeling effect when the types of the API of the virus sample are less; secondly, the API characteristic extraction method based on the N-gram leads the state space of the API sequence to be huge and the labeling efficiency to be lower due to the change of different N values; thirdly, an API sequence feature vector method based on the GloVe is used, in the field of virus samples, because an API calling sequence is longer and more changes are caused, and the GloVe method can only generate word vectors statically according to a corpus training result, the GloVe method cannot well analyze the API of the virus samples; fourthly, the existing methods cannot effectively analyze complex PE virus sample API, so that more complex PE virus samples cannot be successfully detected.
Disclosure of Invention
In view of the foregoing analysis, embodiments of the present invention provide a method and a system for API annotation of Windows PE virus samples, so as to solve the problems of low accuracy and low efficiency of API annotation of Windows PE virus samples in the prior art.
In one aspect, the invention provides an API labeling method for a Windows PE virus sample, comprising the following steps:
dynamically analyzing each virus sample in the acquired virus sample set to obtain a corresponding virus sample API, wherein the virus sample API comprises a dynamic API calling sequence;
according to API information defined by the Windows operating system and the dynamic API calling sequence, initially marking the virus sample API to obtain an initial marking result;
and carrying out automatic perception labeling on the virus sample API according to the initial labeling result of the virus sample API and the dynamic API calling sequence by utilizing the trained hidden Markov model.
Further, the API information includes API functional feature descriptions; specifically, the virus sample API is initially labeled in the following way to obtain an initial labeling result:
aiming at one API in the dynamic API calling sequence, determining the functional component class to which the API belongs and the probability of belonging to the functional component class based on the API functional feature description defined by the Windows operating system, thereby obtaining the first functional component probability distribution of the API;
matching one API in the dynamic API calling sequence with an API type matching set in a Windows operating system to determine the type of the API, and further determining the type of a functional component to which the API belongs and the probability of the functional component to which the API belongs, so as to obtain the probability distribution of a second functional component of the API;
if the maximum probability in the first functional component probability distribution is greater than the maximum probability in the second functional component probability distribution, taking the functional component category corresponding to the maximum probability in the first functional component probability distribution as the initial labeling result of the one API, if the maximum probability in the first functional component probability distribution is less than the maximum probability, taking the functional component category corresponding to the maximum probability in the second functional component probability distribution as the initial labeling result of the one API, and if the maximum probability in the first functional component probability distribution is equal to the maximum probability in the second functional component probability distribution, selecting any one of the functional component category corresponding to the maximum probability in the first functional component probability distribution and the functional component category corresponding to the maximum probability in the second functional component probability distribution as the initial labeling result of the one API;
and traversing each API in the dynamic API calling sequence to obtain an initial labeling result of each API, and further obtaining an initial labeling result of the virus sample API.
Further, the hidden markov model is obtained by training specifically as follows:
normalizing the length of a dynamic API call sequence corresponding to each virus sample API, and selecting the dynamic API call sequence with a preset length as the input of a hidden Markov model for each virus sample API; the preset length represents the number of APIs included in the dynamic API calling sequence;
and acquiring a state transition probability matrix, an observation probability matrix and an initial state distribution vector in the hidden Markov model based on the selected input, thereby acquiring the trained hidden Markov model.
Further, hidden state types in the hidden markov model correspond to functional component categories of the virus sample API, and the hidden state types include a management and control functional component, a detection functional component, an infection functional component, and a destruction functional component; the observed state class in the hidden Markov model corresponds to the actual name of the virus sample API.
Further, the state transition probability matrix in the hidden markov model is obtained specifically by:
numbering each hidden state type;
defining a first two-dimensional matrix corresponding to a virus sample API, wherein a first dimension and a second dimension of the first two-dimensional matrix correspond to hidden state type numbers, and elements in the first two-dimensional matrix represent the probability that a hidden state corresponding to the first dimension is transferred to a hidden state corresponding to the second dimension;
marking each API in a dynamic API calling sequence with a preset length corresponding to the virus sample API according to the initial marking result of the virus sample API, and determining a functional component type set to which the virus sample API belongs according to the API marking result;
setting a second two-dimensional matrix corresponding to the virus sample API, wherein the first dimension and the second dimension of the second two-dimensional matrix both correspond to hidden state type numbers, and elements in the second two-dimensional matrix represent the times of transferring the hidden state corresponding to the first dimension to the hidden state corresponding to the second dimension; determining element values in the second two-dimensional matrix according to the API labeling result;
calculating and obtaining a first two-dimensional matrix corresponding to the virus sample API based on the determined element values in the second two-dimensional matrix;
and traversing each virus sample API to determine a first two-dimensional matrix corresponding to each virus sample API, and calculating to obtain a state transition probability matrix in the hidden Markov model based on the first two-dimensional matrices.
Further, calculating and obtaining a first two-dimensional matrix corresponding to the virus sample API based on the determined element values in the second two-dimensional matrix, including:
wherein, M _ sample _ nijRepresenting the element values corresponding to the first dimension i and the second dimension j in the first two-dimensional matrix, Lable _ sample _ nijThe element values corresponding to the first dimension i and the second dimension j in the second two-dimensional matrix are expressed, and Lable _ sample _ nikAnd expressing element values corresponding to a first dimension with the number i and a second dimension with the number k in a second two-dimensional matrix, wherein i, j, k belongs to {1,2,3, 4}.
Further, the obtaining a state transition probability matrix in the hidden markov model based on the plurality of first two-dimensional matrix calculations comprises:
wherein M isijAnd expressing element values corresponding to a first dimension i and a second dimension j in the state transition probability matrix, wherein the total represents the total number of the virus samples.
Further, the observation probability matrix in the hidden markov model is obtained specifically by:
Ntj=Ut[j],
and the number of the first and second groups,
Ntj=Vt[j],
wherein N istjRepresenting the probability that the virus sample API with the actual name number t in all the virus samples belongs to the hidden state j, wherein j belongs to {1,2,3,4}, Ut[j]A first functional component probability distribution, V, representing the virus sample APIt[j]A second functional component probability distribution representing the virus sample API.
Further, the initial state distribution vector in the hidden markov model is obtained specifically by:
determining the times that the API contained in a dynamic API calling sequence with a preset length corresponding to the virus sample API is marked as a hidden state j, wherein the j belongs to {1,2,3,4 };
traversing APIs contained in a dynamic API calling sequence with a preset length corresponding to each virus sample API, so as to obtain the labeling times corresponding to each hidden state;
based on the labeling times corresponding to each hidden state, calculating and obtaining an initial state distribution vector through the following formula:
therein, IIjRepresents the initial state distribution vector, Label [ j ]]Representing the number of labels corresponding to the hidden state j, Label [ k]And representing the labeling times corresponding to the hidden state k, wherein k belongs to {1,2,3, 4}.
In another aspect, the present invention provides an API tagging system for Windows PE virus samples, comprising:
the system comprises a virus sample API acquisition module, a virus analysis module and a virus analysis module, wherein the virus sample API acquisition module is used for dynamically analyzing each virus sample in an acquired virus sample set to obtain a corresponding virus sample API, and the virus sample API comprises a dynamic API calling sequence;
the initial labeling module is used for carrying out initial labeling on the virus sample API according to the API information defined by the Windows operating system and the dynamic API calling sequence to obtain an initial labeling result;
and the marking module is used for carrying out automatic perception marking on the virus sample API according to the initial marking result and the dynamic API calling sequence of the virus sample API by utilizing the trained hidden Markov model.
Compared with the prior art, the invention can realize at least one of the following beneficial effects:
1. the method and the system for marking the API of the Windows PE virus sample consider four common operation stages of the API of the virus sample and the corresponding functional components, and divide the attribution of the API functional components of the virus sample, so that the characteristics of the API of the virus sample are more effectively analyzed, and efficient and accurate marking is carried out based on the characteristics of the API of the virus sample.
2. According to the method and the system for marking the API of the Windows PE virus sample, the API of the virus sample is initially marked according to the functional characteristic information and the dynamic API sequence information of the API of the virus sample, and the API of the virus sample is automatically sensed and marked by using a trained hidden Markov model according to the initial marking result and the dynamic API calling sequence corresponding to the API of the virus sample, so that the marking efficiency of the API of the virus sample is improved.
In the invention, the technical schemes can be combined with each other to realize more preferable combination schemes. Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.
Drawings
The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.
FIG. 1 is a flowchart of an API labeling method for a Windows PE virus sample according to an embodiment of the present invention.
Detailed Description
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate preferred embodiments of the invention and together with the description, serve to explain the principles of the invention and not to limit the scope of the invention.
Method embodiment
The invention discloses a specific embodiment of a Windows PE virus sample API labeling method. As shown in fig. 1, the method includes:
s110, each virus sample in the acquired virus sample set is dynamically analyzed to obtain a corresponding virus sample API, and the virus sample API comprises a dynamic API calling sequence. Specifically, the virus sample APIs obtained by analysis are APIs defined in the Windows operating system platform.
And S120, carrying out initial labeling on the virus sample API according to the API information defined by the Windows operating system and the dynamic API calling sequence to obtain an initial labeling result.
And S130, carrying out automatic perception labeling on the virus sample API according to the initial labeling result and the dynamic API calling sequence of the virus sample API by using the trained hidden Markov model.
Preferably, in step S110, based on the virus sample set, dynamically analyzing the virus samples by using an existing sandbox (network programming virtual execution environment) tool Cuckoo to obtain a dynamic API call sequence corresponding to each virus sample; the method comprises the steps of running a virus sample in a Cuckoo virtual execution environment, recording APIs (application programming interfaces) sequentially called by the sample, constructing an API sequence with dynamic characteristics, and extracting and filtering the obtained dynamic API calling sequence, namely removing redundant APIs (adjacent and repeated APIs in the sequence) and noise (the lengths of continuous APIs are one section, and the probability that one or more repeated APIs are added by a virus author) to obtain the processed dynamic API calling sequence.
Preferably, in step S120, the initial labeling is performed on the virus sample API specifically in the following manner to obtain an initial labeling result:
s1201, aiming at one API (also called as a first API) in the dynamic API call sequence, manually comparing the dynamic API call sequence with a Windows operating system API manual based on API functional feature description defined by a Windows operating system, parameter setting (Windows operating system API manual) and the like, determining the functional component category to which the first API belongs, and determining the probability that the first API belongs to the functional component category according to experience, thereby obtaining the probability distribution of the first functional component of the first API. Specifically, the probability distribution of the first functional component is specifically represented by the following form:
U=[A:wA,B:wB,C:wC,D:wD]
wherein, U represents the probability distribution of the first functional component, A, B, C, D represents the control, detection, infection, and destruction of four functional components, respectively, wA,wB,wCAnd wDRespectively representing the probability that the first API belongs to the corresponding functional component, and the probability parameter satisfies wA+wB+wC+wD=1。
S1202, matching one API (first API) in the dynamic API calling sequence with an API type matching set in a Windows operating system to determine the type of the API, and further determining the type of the functional component to which the API belongs and the probability of the functional component to which the API belongs, so that the probability distribution of a second functional component of the API is obtained. Specifically, the method comprises the following steps:
the 10 types of APIs defining the Windows operating system platform are respectively: memory, process, kernel, device, file, system, text, registry, window, and network; a match set, denoted M, is defined for each type of APImemory、Mprocess、Mkernel、Mdevice、Mfile、Msystem、Mtext、Mregistry、MwindowAnd MnetworkThe contents of each matching set are shown in table 1. Defining memory and process to belong to a management and control component, kernel and device to belong to a detection component, file, system and text to belong to an infection component, and registry, window and network to belong to a destruction component. For one API (first API) in the dynamic API call sequence, matching each item in the 10 matching sets, if one of the items is contained, stating that the first API is related to the category, stating that the first API is related to the functional component corresponding to the category API, and recording the matching result, the method is defined as follows:
[A:NA,B:NB,C:NC,D:ND]
wherein A, B, C, D represents the four functional components of control, detection, infection and destruction, respectively, NA,NB,NC,NDAnd representing the total number of the sub-character strings of the first API matched with the corresponding functional component, namely judging the functional component to which each character string in the first API belongs, and thus obtaining the total number of the character strings in the first API corresponding to each functional component. And constructing a matching record for the first API, namely a probability distribution of the second functional component:
V=[A:vA,B:vB,C:vC,D:vD]
wherein v isx(x=A,B,C,D),vA,vB,vC,vDRepresenting the ratio of the number of character strings matched by the functional component to the total number of character strings. The calculation method is as follows:
s1203, if the maximum probability in the probability distribution of the first function component is greater than the maximum probability in the probability distribution of the second function component, taking the function component category corresponding to the maximum probability in the probability distribution of the first function component as the initial labeling result of one of the APIs (first APIs), if the maximum probability in the probability distribution of the first function component is smaller than the maximum probability in the probability distribution of the second function component, taking the function component category corresponding to the maximum probability in the probability distribution of the second function component as the initial labeling result of one of the APIs (first APIs), and if the maximum probability in the probability distribution of the first function component is equal to the maximum probability in the probability distribution of the second function component, selecting any one of the function component category corresponding to the maximum probability in the probability distribution of the first function component and the function component category corresponding to the maximum probability in the probability distribution of the second function component as the initial labeling result of one of the APIs (first APIs).
S1204, traversing each API in the dynamic API call sequence to obtain an initial labeling result of each API, and further obtaining an initial labeling result of the virus sample API.
TABLE 1
First, explanation is made on relevant parameters in the hidden markov model, specifically as follows:
r represents a hidden state class;
Λ represents an observed state species;
m represents a state transition probability matrix (two-dimensional);
Ν represents an observation probability matrix (two-dimensional);
pi is the initial state distribution vector;
the hidden state type in the hidden Markov model corresponds to the functional component type of the virus sample API, and comprises a control functional component, a detection functional component, an infection functional component and a destruction functional component; the observation state type in the hidden Markov model corresponds to the actual name of the virus sample API, so that the automatic labeling of the dynamic API calling sequence can be carried out by analogy with a part-of-speech labeling method.
Specifically, r ═ a, B, C, D ], a set of symbols of four hidden states (functional component types);
Λ ═ 1,2,3,4, …, Num ] (i.e., Num sets of devAPI number numbers with different actual names);
Mijindicating the probability of the hidden state i transitioning to the hidden state j. "transfer" means: if the mth in Lable _ sample _ n is the hidden state i and the m +1 is defined as the hidden state j, the hidden state i is transferred to the hidden state j.
Preferably, the hidden markov model is obtained by training in particular:
step 1, normalizing the length of a dynamic API call sequence corresponding to each virus sample API, and selecting the dynamic API call sequence with a preset length as the input of a hidden Markov model for each virus sample API; the preset length represents the number of APIs included in the dynamic API calling sequence; illustratively, the predetermined length is 100.
And 2, acquiring a state transition probability matrix, an observation probability matrix and an initial state distribution vector in the hidden Markov model based on the selected input, and further acquiring a trained hidden Markov model.
Preferably, step 2 specifically comprises:
step 201, specifically, a state transition probability matrix in the hidden markov model is obtained through the following method:
each hidden state class is numbered, illustratively: [ A:1, B:2, C:3, D:4 ];
defining a first two-dimensional matrix M _ sample _ n [4 ] corresponding to the virus sample API][4]The first dimension and the second dimension of the first two-dimensional matrix both correspond to the hidden state type number, and the element M _ sample _ n in the first two-dimensional matrixijRepresents the probability of the hidden state in the first dimension being transferred to the hidden state corresponding to the second dimension, i.e. M _ sample _ nijThe probability of the hidden state i transitioning to the hidden state j in the nth (n 1,2, 3.. total) viral sample API (i.e., dynamic API call sequence) is represented, and the total represents the total number of viral samples.
Labeling each API in a dynamic API calling sequence with a preset length corresponding to the virus sample API according to an initial labeling result of the virus sample API, and determining a functional component type set Lable _ sample _ n [100] to which the virus sample API belongs according to the API labeling result, wherein Lable _ sample _ n [ f ] is a functional component to which the f-th API belongs.
Setting a second two-dimensional matrix Label _ sample _ n [4 ] corresponding to the virus sample API][4]The first dimension and the second dimension of the second two-dimensional matrix both correspond to the hidden state type numbers, and elements in the second two-dimensional matrix represent the number of times that the hidden state corresponding to the first dimension is transferred to the hidden state corresponding to the second dimension; and determining the element values in the second two-dimensional matrix according to the API labeling result. Specifically, Label _ sample _ nijIndicates the number of transitions from hidden state i to hidden state j in the nth virus sample API. And traversing the Label _ sample _ n of the nth virus sample, counting the transfer times and recording the transfer times into the Label _ sample _ n.
Calculating and obtaining a first two-dimensional matrix corresponding to the virus sample API based on the determined element values in the second two-dimensional matrix:
wherein, M _ sample _ nijRepresenting the element values corresponding to the first dimension i and the second dimension j in the first two-dimensional matrix, Lable _ sample _ nijThe element values corresponding to the first dimension i and the second dimension j in the second two-dimensional matrix are expressed, and Lable _ sample _ nikAnd expressing element values corresponding to a first dimension i and a second dimension k in a second two-dimensional matrix, wherein i, j, k belongs to {1,2,3,4}, and both represent numbers corresponding to functional component categories.
Traversing each virus sample API to determine a first two-dimensional matrix corresponding to each virus sample API, and calculating and obtaining a state transition probability matrix in the hidden Markov model based on a plurality of first two-dimensional matrices, which is specifically represented as follows:
wherein M isijAnd expressing element values corresponding to a first dimension i and a second dimension j in the state transition probability matrix, wherein the total represents the total number of the virus samples.
Step 202, specifically, obtaining an observation probability matrix in the hidden markov model by the following method:
Ntj=Ut[j],
and the number of the first and second groups,
Ntj=Vt[j],
wherein N istjRepresenting the probability that the virus sample API with the actual name number t in all the virus samples belongs to the hidden state j, wherein j belongs to {1,2,3,4}, Ut[j]A first functional component probability distribution, V, representing the virus sample APIt[j]A second functional component probability distribution representing the virus sample API.
Step 203, obtaining an initial state distribution vector in the hidden markov model specifically by:
and determining the times of marking the API contained in the dynamic API calling sequence with the preset length corresponding to the virus sample API as the hidden state j, wherein the j belongs to {1,2,3, 4}.
And traversing the APIs contained in the dynamic API calling sequence with the preset length corresponding to each virus sample API, thereby obtaining the labeling times corresponding to each hidden state.
Based on the labeling times corresponding to each hidden state, calculating and obtaining an initial state distribution vector through the following formula:
therein, IIjRepresents the initial state distribution vector, Label [ j ]]Representing the number of labels corresponding to the hidden state j, Label [ k]And the mark times corresponding to the hidden state k are shown, and k belongs to {1,2,3 and 4}, and the number of the functional component type is shown.
Preferably, the virus sample is automatically perceptually labeled by using a hidden markov model specifically in the following way:
and inputting a dynamic API calling sequence (a sequence of the corresponding actual name number of each API, wherein the API number refers to the number of the Num API library, and the corresponding API number is the position of the API library in the API library), and outputting the automatic perception labeling result of each virus sample API. Wherein, a viterbi algorithm is used for the hidden Markov model to find the hidden state sequence with the highest probability corresponding to the observation sequence by finding the maximum likelihood path.
Exemplarily, step 1, for the API number sequence to be labeled (obtained according to Λ) of the virus sample N, API _ SEQ _ N [100], the first state does not need to consider the transition state probability, but only considers the observation probability, so the labeling result Label _ result is calculated as follows:
Label_result[1]=Max(Ni1,Ni2,Ni3,Ni4)
wherein i represents API _ SEQ _ N [1 ]]Corresponding API numbering, which is why with the sequence of API numbering as input Max gets the number of the hidden state with the highest probability, i.e. if N is the numberi1At maximum, Max results in 1.
Step 2, for API _ SEQ _ N [ N ] (1 < N < 100), the transition probability needs to be considered, so the calculation mode of the marking result is as follows:
Label_result[n]=Max(MLabel_result[n-1]k*Nik)(k=1,2,3,4)
wherein M isAPI_SEQ_N[n-1]iRepresents API _ SEQ _ N [ N-1 ]]Probability of a marked hidden state transitioning to a hidden state i, i representing API _ SEQ _ N [ N ]]The corresponding API functional component class number.
And 3, completing the marking of the sample n.
And 4, repeating the steps 1 to 3 for total samples, thereby completing the automatic perception marking of the virus sample API of all the virus samples.
System embodiment
The invention further discloses an API marking system for the Windows PE virus sample.
Since the system embodiment and the method embodiment are based on the same working principle, the method embodiment may be referred to for the repeated points, and will not be described herein again.
Specifically, the system comprises:
the system comprises a virus sample API acquisition module, a virus analysis module and a virus analysis module, wherein the virus sample API acquisition module is used for dynamically analyzing each virus sample in an acquired virus sample set to obtain a corresponding virus sample API, and the virus sample API comprises a dynamic API calling sequence;
the initial labeling module is used for carrying out initial labeling on the virus sample API according to the API information defined by the Windows operating system and the dynamic API calling sequence to obtain an initial labeling result;
and the marking module is used for carrying out automatic perception marking on the virus sample API according to the initial marking result and the dynamic API calling sequence of the virus sample API by utilizing the trained hidden Markov model.
According to the method and the system for marking the API of the Windows PE virus sample, firstly, the four operation stages and the corresponding functional components of the API of the virus sample are considered, and the attribution of the functional components of the API of the virus sample is divided, so that the characteristics of the API of the virus sample are analyzed more effectively, and efficient and accurate marking is carried out based on the characteristics of the API of the virus sample; secondly, the method and the system for marking the API of the Windows PE virus sample perform initial marking on the API of the virus sample according to the functional characteristic information and the dynamic API sequence information of the API of the virus sample, combine the dynamic API call sequence corresponding to the API of the virus sample according to the initial marking result, and perform automatic perception marking on the API of the virus sample by using the trained hidden Markov model, thereby improving the marking efficiency of the API of the virus sample.
Those skilled in the art will appreciate that all or part of the flow of the method implementing the above embodiments may be implemented by a computer program, which is stored in a computer readable storage medium, to instruct related hardware. The computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.
Claims (10)
1. A Windows PE virus sample API labeling method is characterized by comprising the following steps:
dynamically analyzing each virus sample in the acquired virus sample set to obtain a corresponding virus sample API, wherein the virus sample API comprises a dynamic API calling sequence;
according to API information defined by the Windows operating system and the dynamic API calling sequence, initially marking the virus sample API to obtain an initial marking result;
and carrying out automatic perception labeling on the virus sample API according to the initial labeling result of the virus sample API and the dynamic API calling sequence by utilizing the trained hidden Markov model.
2. The API labeling method of Windows PE virus samples according to claim 1, wherein the API information includes API functional feature descriptions;
specifically, the virus sample API is initially labeled in the following way to obtain an initial labeling result:
aiming at one API in the dynamic API calling sequence, determining the functional component class to which the API belongs and the probability of belonging to the functional component class based on the API functional feature description defined by the Windows operating system, thereby obtaining the first functional component probability distribution of the API;
matching one API in the dynamic API calling sequence with an API type matching set in a Windows operating system to determine the type of the API, and further determining the type of a functional component to which the API belongs and the probability of the functional component to which the API belongs, so as to obtain the probability distribution of a second functional component of the API;
if the maximum probability in the first functional component probability distribution is greater than the maximum probability in the second functional component probability distribution, taking the functional component category corresponding to the maximum probability in the first functional component probability distribution as the initial labeling result of the one API, if the maximum probability in the first functional component probability distribution is less than the maximum probability, taking the functional component category corresponding to the maximum probability in the second functional component probability distribution as the initial labeling result of the one API, and if the maximum probability in the first functional component probability distribution is equal to the maximum probability in the second functional component probability distribution, selecting any one of the functional component category corresponding to the maximum probability in the first functional component probability distribution and the functional component category corresponding to the maximum probability in the second functional component probability distribution as the initial labeling result of the one API;
and traversing each API in the dynamic API calling sequence to obtain an initial labeling result of each API, and further obtaining an initial labeling result of the virus sample API.
3. The API labeling method for Windows PE virus samples according to claim 2, wherein the hidden Markov model is obtained by training in the following way:
normalizing the length of a dynamic API call sequence corresponding to each virus sample API, and selecting the dynamic API call sequence with a preset length as the input of a hidden Markov model for each virus sample API; the preset length represents the number of APIs included in the dynamic API calling sequence;
and acquiring a state transition probability matrix, an observation probability matrix and an initial state distribution vector in the hidden Markov model based on the selected input, thereby acquiring the trained hidden Markov model.
4. The Windows PE virus sample API labeling method of claim 3, wherein hidden state types in the hidden Markov model correspond to functional component classes of the virus sample API, the hidden state types include a management and control functional component, a detection functional component, an infection functional component, and a destruction functional component; the observed state class in the hidden Markov model corresponds to the actual name of the virus sample API.
5. The API labeling method for Windows PE virus samples according to claim 4, wherein the state transition probability matrix in the hidden Markov model is obtained by:
numbering each hidden state type;
defining a first two-dimensional matrix corresponding to a virus sample API, wherein a first dimension and a second dimension of the first two-dimensional matrix correspond to hidden state type numbers, and elements in the first two-dimensional matrix represent the probability that a hidden state corresponding to the first dimension is transferred to a hidden state corresponding to the second dimension;
marking each API in a dynamic API calling sequence with a preset length corresponding to the virus sample API according to the initial marking result of the virus sample API, and determining a functional component type set to which the virus sample API belongs according to the API marking result;
setting a second two-dimensional matrix corresponding to the virus sample API, wherein the first dimension and the second dimension of the second two-dimensional matrix both correspond to hidden state type numbers, and elements in the second two-dimensional matrix represent the times of transferring the hidden state corresponding to the first dimension to the hidden state corresponding to the second dimension; determining element values in the second two-dimensional matrix according to the API labeling result;
calculating and obtaining a first two-dimensional matrix corresponding to the virus sample API based on the determined element values in the second two-dimensional matrix;
and traversing each virus sample API to determine a first two-dimensional matrix corresponding to each virus sample API, and calculating to obtain a state transition probability matrix in the hidden Markov model based on the first two-dimensional matrices.
6. The Windows PE virus sample API labeling method of claim 5, wherein computing the first two-dimensional matrix corresponding to the virus sample API based on the determined element values in the second two-dimensional matrix comprises:
wherein, M _ sample _ nijRepresenting the element values corresponding to the first dimension i and the second dimension j in the first two-dimensional matrix, Lable _ sample _ nijThe element values corresponding to the first dimension i and the second dimension j in the second two-dimensional matrix are expressed, and Lable _ sample _ nikAnd expressing element values corresponding to a first dimension with the number i and a second dimension with the number k in a second two-dimensional matrix, wherein i, j, k belongs to {1,2,3, 4}.
7. The API labeling method of Windows PE virus samples according to claim 6, wherein the obtaining a state transition probability matrix in a hidden Markov model based on the plurality of first two-dimensional matrix calculations comprises:
wherein M isijAnd expressing element values corresponding to a first dimension i and a second dimension j in the state transition probability matrix, wherein the total represents the total number of the virus samples.
8. The API labeling method for Windows PE virus samples according to claim 4, wherein the observation probability matrix in the hidden Markov model is obtained by:
Ntj=Ut[j],
and the number of the first and second groups,
Ntj=Vt[j],
wherein N istjRepresenting the probability that the virus sample API with the actual name number t in all the virus samples belongs to the hidden state j, wherein j belongs to {1,2,3,4}, Ut[j]A first functional component probability distribution, V, representing the virus sample APIt[j]A second functional component probability distribution representing the virus sample API.
9. The API labeling method for Windows PE virus samples according to claim 4, wherein the initial state distribution vector in the hidden Markov model is obtained by:
determining the times that the API contained in a dynamic API calling sequence with a preset length corresponding to the virus sample API is marked as a hidden state j, wherein the j belongs to {1,2,3,4 };
traversing APIs contained in a dynamic API calling sequence with a preset length corresponding to each virus sample API, so as to obtain the labeling times corresponding to each hidden state;
based on the labeling times corresponding to each hidden state, calculating and obtaining an initial state distribution vector through the following formula:
therein, IIjRepresents the initial state distribution vector, Label [ j ]]Representing the number of labels corresponding to the hidden state j, Label [ k]And representing the labeling times corresponding to the hidden state k, wherein k belongs to {1,2,3, 4}.
10. An API labeling system for a Windows PE virus sample, comprising:
the system comprises a virus sample API acquisition module, a virus analysis module and a virus analysis module, wherein the virus sample API acquisition module is used for dynamically analyzing each virus sample in an acquired virus sample set to obtain a corresponding virus sample API, and the virus sample API comprises a dynamic API calling sequence;
the initial labeling module is used for carrying out initial labeling on the virus sample API according to the API information defined by the Windows operating system and the dynamic API calling sequence to obtain an initial labeling result;
and the marking module is used for carrying out automatic perception marking on the virus sample API according to the initial marking result and the dynamic API calling sequence of the virus sample API by utilizing the trained hidden Markov model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111312492.4A CN114003908A (en) | 2021-11-08 | 2021-11-08 | API labeling method and system for Windows PE virus sample |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111312492.4A CN114003908A (en) | 2021-11-08 | 2021-11-08 | API labeling method and system for Windows PE virus sample |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114003908A true CN114003908A (en) | 2022-02-01 |
Family
ID=79928004
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111312492.4A Pending CN114003908A (en) | 2021-11-08 | 2021-11-08 | API labeling method and system for Windows PE virus sample |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114003908A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114969738A (en) * | 2022-05-27 | 2022-08-30 | 天翼爱音乐文化科技有限公司 | Interface abnormal behavior monitoring method, system, device and storage medium |
-
2021
- 2021-11-08 CN CN202111312492.4A patent/CN114003908A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114969738A (en) * | 2022-05-27 | 2022-08-30 | 天翼爱音乐文化科技有限公司 | Interface abnormal behavior monitoring method, system, device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109271521B (en) | Text classification method and device | |
US8606779B2 (en) | Search method, similarity calculation method, similarity calculation, same document matching system, and program thereof | |
CN112818093B (en) | Evidence document retrieval method, system and storage medium based on semantic matching | |
US7734567B2 (en) | Document data analysis apparatus, method of document data analysis, computer readable medium and computer data signal | |
CN111475603A (en) | Enterprise identifier identification method and device, computer equipment and storage medium | |
CN111338692B (en) | Vulnerability classification method and device based on vulnerability codes and electronic equipment | |
CN112836509A (en) | Expert system knowledge base construction method and system | |
CN115186015B (en) | Network security knowledge graph construction method and system | |
CN112257444A (en) | Financial information negative entity discovery method and device, electronic equipment and storage medium | |
Wunderlich et al. | Comparison of system call representations for intrusion detection | |
KR100961179B1 (en) | Apparatus and Method for digital forensic | |
CN115659226A (en) | Data processing system for acquiring APP label | |
CN114003908A (en) | API labeling method and system for Windows PE virus sample | |
Andreopoulos | Malware detection with sequence-based machine learning and deep learning | |
Kim et al. | Towards attention based vulnerability discovery using source code representation | |
CN111782811A (en) | E-government affair sensitive text detection method based on convolutional neural network and support vector machine | |
EP4258107A1 (en) | Method and system for automated discovery of artificial intelligence and machine learning assets in an enterprise | |
Paik et al. | Malware family prediction with an awareness of label uncertainty | |
CN111191238A (en) | Webshell detection method, terminal device and storage medium | |
US20230244987A1 (en) | Accelerated data labeling with automated data profiling for training machine learning predictive models | |
CN110059180B (en) | Article author identity recognition and evaluation model training method and device and storage medium | |
CN113420127A (en) | Threat information processing method, device, computing equipment and storage medium | |
CN111382247B (en) | Content pushing optimization method, content pushing optimization device and electronic equipment | |
JPH06274548A (en) | Similarity degree calculating device | |
Sharma et al. | Optical Character Recognition Using Hybrid CRNN Based Lexicon-Free Approach with Grey Wolf Hyperparameter Optimization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |