CN114003908A - API labeling method and system for Windows PE virus sample - Google Patents

API labeling method and system for Windows PE virus sample Download PDF

Info

Publication number
CN114003908A
CN114003908A CN202111312492.4A CN202111312492A CN114003908A CN 114003908 A CN114003908 A CN 114003908A CN 202111312492 A CN202111312492 A CN 202111312492A CN 114003908 A CN114003908 A CN 114003908A
Authority
CN
China
Prior art keywords
api
virus sample
virus
functional component
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111312492.4A
Other languages
Chinese (zh)
Inventor
李伟
张永静
邢建华
石春刚
李景田
巩艳伟
常晓林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jinghang Computing Communication Research Institute
Original Assignee
Beijing Jinghang Computing Communication Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jinghang Computing Communication Research Institute filed Critical Beijing Jinghang Computing Communication Research Institute
Priority to CN202111312492.4A priority Critical patent/CN114003908A/en
Publication of CN114003908A publication Critical patent/CN114003908A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/561Virus type analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention relates to a method and a system for marking API of a Windows PE virus sample, belongs to the field of network space security, and solves the problems of low precision and low efficiency of API marking of the Windows PE virus sample in the prior art. The method comprises the following steps: dynamically analyzing each virus sample in the acquired virus sample set to obtain a corresponding virus sample API, wherein the virus sample API comprises a dynamic API calling sequence; according to API information defined by the Windows operating system and the dynamic API calling sequence, initially marking the virus sample API to obtain an initial marking result; and automatically sensing and labeling the virus sample API according to the initial labeling result and the dynamic API calling sequence of the virus sample API by using the trained hidden Markov model, wherein the method can realize the rapid, efficient and accurate labeling of the virus sample API.

Description

API labeling method and system for Windows PE virus sample
Technical Field
The invention relates to the technical field of network space security, in particular to a method and a system for marking API of Windows PE virus samples.
Background
Windows PE (Portable Executable) virus samples have been one of the most serious security threats in the network space at present. It starts without being perceived by the user, destroying the security and privacy of the software and data. The increase of the number of viruses promotes the application of machine learning in the field of virus detection. The key problem faced by the effective implementation of the machine learning-based virus detection method is the acquisition of sample labels; because of the specificity of the virus, the identification of its tag is very difficult and requires extensive expertise. The manual labeling consumes manpower and material resources, and particularly, with the rapid increase of various new virus programs and varieties, the implementation of emergency response is seriously hindered by the lag problem of manual analysis, so that the virus programs are difficult to be controlled quickly and effectively, and the complete manual analysis loses efficiency and feasibility. Therefore, it is imperative to automate the analysis process. The method has important practical significance for deeply researching the virus sample behaviors by adopting an automatic analysis technology.
Most of the existing studies have the granularity of analysis of virus samples staying in the whole virus sample. The method is also effective for early viruses, Trojan horse and other viruses with relatively simple structures and functions; the code amount of the novel virus sample is obviously increased, and the code structure and the realized function are more and more complex; it is almost impossible to analyze the complete function of a complex virus sample. Therefore, there is also a need for finer grained analysis of virus samples. The API is an Application Programming Interface (Application Programming Interface), and basically all functions in the Windows operating system are implemented by calling the API, and the functions of the virus program can be analyzed by analyzing the system call API. When an attacker makes a virus sample, the attacker can call the API to achieve one or more attack purposes, so that the API with finer granularity needs to be researched in the process of analyzing the binary virus sample.
In addition, effective labeling of the PE virus sample is the basis for defending the threat of the PE virus sample, and in the prior art, a labeling method of a virus sample API mostly stays at a simple semantic feature analysis stage, namely, a TF-IDF (Term Frequency-Inverse Document Frequency) feature extraction method of a single API is used for carrying out feature function classification by using some word Frequency and other features of the single API and is mainly used for information retrieval and data mining; secondly, API feature extraction based on N-gram, the method is from the field of NLP (neural Language process), and has a good effect on semantic information extraction of the text; thirdly, the API sequence feature vector method based on the Global Vectors for Word representation can solve the problem that the associated vocabulary has strong performance;
the prior art has at least the following defects that firstly, the TF-IDF (Term Frequency-Inverse Document Frequency) feature extraction method of the single API has poor labeling effect when the types of the API of the virus sample are less; secondly, the API characteristic extraction method based on the N-gram leads the state space of the API sequence to be huge and the labeling efficiency to be lower due to the change of different N values; thirdly, an API sequence feature vector method based on the GloVe is used, in the field of virus samples, because an API calling sequence is longer and more changes are caused, and the GloVe method can only generate word vectors statically according to a corpus training result, the GloVe method cannot well analyze the API of the virus samples; fourthly, the existing methods cannot effectively analyze complex PE virus sample API, so that more complex PE virus samples cannot be successfully detected.
Disclosure of Invention
In view of the foregoing analysis, embodiments of the present invention provide a method and a system for API annotation of Windows PE virus samples, so as to solve the problems of low accuracy and low efficiency of API annotation of Windows PE virus samples in the prior art.
In one aspect, the invention provides an API labeling method for a Windows PE virus sample, comprising the following steps:
dynamically analyzing each virus sample in the acquired virus sample set to obtain a corresponding virus sample API, wherein the virus sample API comprises a dynamic API calling sequence;
according to API information defined by the Windows operating system and the dynamic API calling sequence, initially marking the virus sample API to obtain an initial marking result;
and carrying out automatic perception labeling on the virus sample API according to the initial labeling result of the virus sample API and the dynamic API calling sequence by utilizing the trained hidden Markov model.
Further, the API information includes API functional feature descriptions; specifically, the virus sample API is initially labeled in the following way to obtain an initial labeling result:
aiming at one API in the dynamic API calling sequence, determining the functional component class to which the API belongs and the probability of belonging to the functional component class based on the API functional feature description defined by the Windows operating system, thereby obtaining the first functional component probability distribution of the API;
matching one API in the dynamic API calling sequence with an API type matching set in a Windows operating system to determine the type of the API, and further determining the type of a functional component to which the API belongs and the probability of the functional component to which the API belongs, so as to obtain the probability distribution of a second functional component of the API;
if the maximum probability in the first functional component probability distribution is greater than the maximum probability in the second functional component probability distribution, taking the functional component category corresponding to the maximum probability in the first functional component probability distribution as the initial labeling result of the one API, if the maximum probability in the first functional component probability distribution is less than the maximum probability, taking the functional component category corresponding to the maximum probability in the second functional component probability distribution as the initial labeling result of the one API, and if the maximum probability in the first functional component probability distribution is equal to the maximum probability in the second functional component probability distribution, selecting any one of the functional component category corresponding to the maximum probability in the first functional component probability distribution and the functional component category corresponding to the maximum probability in the second functional component probability distribution as the initial labeling result of the one API;
and traversing each API in the dynamic API calling sequence to obtain an initial labeling result of each API, and further obtaining an initial labeling result of the virus sample API.
Further, the hidden markov model is obtained by training specifically as follows:
normalizing the length of a dynamic API call sequence corresponding to each virus sample API, and selecting the dynamic API call sequence with a preset length as the input of a hidden Markov model for each virus sample API; the preset length represents the number of APIs included in the dynamic API calling sequence;
and acquiring a state transition probability matrix, an observation probability matrix and an initial state distribution vector in the hidden Markov model based on the selected input, thereby acquiring the trained hidden Markov model.
Further, hidden state types in the hidden markov model correspond to functional component categories of the virus sample API, and the hidden state types include a management and control functional component, a detection functional component, an infection functional component, and a destruction functional component; the observed state class in the hidden Markov model corresponds to the actual name of the virus sample API.
Further, the state transition probability matrix in the hidden markov model is obtained specifically by:
numbering each hidden state type;
defining a first two-dimensional matrix corresponding to a virus sample API, wherein a first dimension and a second dimension of the first two-dimensional matrix correspond to hidden state type numbers, and elements in the first two-dimensional matrix represent the probability that a hidden state corresponding to the first dimension is transferred to a hidden state corresponding to the second dimension;
marking each API in a dynamic API calling sequence with a preset length corresponding to the virus sample API according to the initial marking result of the virus sample API, and determining a functional component type set to which the virus sample API belongs according to the API marking result;
setting a second two-dimensional matrix corresponding to the virus sample API, wherein the first dimension and the second dimension of the second two-dimensional matrix both correspond to hidden state type numbers, and elements in the second two-dimensional matrix represent the times of transferring the hidden state corresponding to the first dimension to the hidden state corresponding to the second dimension; determining element values in the second two-dimensional matrix according to the API labeling result;
calculating and obtaining a first two-dimensional matrix corresponding to the virus sample API based on the determined element values in the second two-dimensional matrix;
and traversing each virus sample API to determine a first two-dimensional matrix corresponding to each virus sample API, and calculating to obtain a state transition probability matrix in the hidden Markov model based on the first two-dimensional matrices.
Further, calculating and obtaining a first two-dimensional matrix corresponding to the virus sample API based on the determined element values in the second two-dimensional matrix, including:
Figure BDA0003342563460000051
wherein, M _ sample _ nijRepresenting the element values corresponding to the first dimension i and the second dimension j in the first two-dimensional matrix, Lable _ sample _ nijThe element values corresponding to the first dimension i and the second dimension j in the second two-dimensional matrix are expressed, and Lable _ sample _ nikAnd expressing element values corresponding to a first dimension with the number i and a second dimension with the number k in a second two-dimensional matrix, wherein i, j, k belongs to {1,2,3, 4}.
Further, the obtaining a state transition probability matrix in the hidden markov model based on the plurality of first two-dimensional matrix calculations comprises:
Figure BDA0003342563460000052
(i,j∈{1,2,3,4}),
wherein M isijAnd expressing element values corresponding to a first dimension i and a second dimension j in the state transition probability matrix, wherein the total represents the total number of the virus samples.
Further, the observation probability matrix in the hidden markov model is obtained specifically by:
Ntj=Ut[j],
and the number of the first and second groups,
Ntj=Vt[j],
wherein N istjRepresenting the probability that the virus sample API with the actual name number t in all the virus samples belongs to the hidden state j, wherein j belongs to {1,2,3,4}, Ut[j]A first functional component probability distribution, V, representing the virus sample APIt[j]A second functional component probability distribution representing the virus sample API.
Further, the initial state distribution vector in the hidden markov model is obtained specifically by:
determining the times that the API contained in a dynamic API calling sequence with a preset length corresponding to the virus sample API is marked as a hidden state j, wherein the j belongs to {1,2,3,4 };
traversing APIs contained in a dynamic API calling sequence with a preset length corresponding to each virus sample API, so as to obtain the labeling times corresponding to each hidden state;
based on the labeling times corresponding to each hidden state, calculating and obtaining an initial state distribution vector through the following formula:
Figure BDA0003342563460000061
therein, IIjRepresents the initial state distribution vector, Label [ j ]]Representing the number of labels corresponding to the hidden state j, Label [ k]And representing the labeling times corresponding to the hidden state k, wherein k belongs to {1,2,3, 4}.
In another aspect, the present invention provides an API tagging system for Windows PE virus samples, comprising:
the system comprises a virus sample API acquisition module, a virus analysis module and a virus analysis module, wherein the virus sample API acquisition module is used for dynamically analyzing each virus sample in an acquired virus sample set to obtain a corresponding virus sample API, and the virus sample API comprises a dynamic API calling sequence;
the initial labeling module is used for carrying out initial labeling on the virus sample API according to the API information defined by the Windows operating system and the dynamic API calling sequence to obtain an initial labeling result;
and the marking module is used for carrying out automatic perception marking on the virus sample API according to the initial marking result and the dynamic API calling sequence of the virus sample API by utilizing the trained hidden Markov model.
Compared with the prior art, the invention can realize at least one of the following beneficial effects:
1. the method and the system for marking the API of the Windows PE virus sample consider four common operation stages of the API of the virus sample and the corresponding functional components, and divide the attribution of the API functional components of the virus sample, so that the characteristics of the API of the virus sample are more effectively analyzed, and efficient and accurate marking is carried out based on the characteristics of the API of the virus sample.
2. According to the method and the system for marking the API of the Windows PE virus sample, the API of the virus sample is initially marked according to the functional characteristic information and the dynamic API sequence information of the API of the virus sample, and the API of the virus sample is automatically sensed and marked by using a trained hidden Markov model according to the initial marking result and the dynamic API calling sequence corresponding to the API of the virus sample, so that the marking efficiency of the API of the virus sample is improved.
In the invention, the technical schemes can be combined with each other to realize more preferable combination schemes. Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.
Drawings
The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.
FIG. 1 is a flowchart of an API labeling method for a Windows PE virus sample according to an embodiment of the present invention.
Detailed Description
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate preferred embodiments of the invention and together with the description, serve to explain the principles of the invention and not to limit the scope of the invention.
Method embodiment
The invention discloses a specific embodiment of a Windows PE virus sample API labeling method. As shown in fig. 1, the method includes:
s110, each virus sample in the acquired virus sample set is dynamically analyzed to obtain a corresponding virus sample API, and the virus sample API comprises a dynamic API calling sequence. Specifically, the virus sample APIs obtained by analysis are APIs defined in the Windows operating system platform.
And S120, carrying out initial labeling on the virus sample API according to the API information defined by the Windows operating system and the dynamic API calling sequence to obtain an initial labeling result.
And S130, carrying out automatic perception labeling on the virus sample API according to the initial labeling result and the dynamic API calling sequence of the virus sample API by using the trained hidden Markov model.
Preferably, in step S110, based on the virus sample set, dynamically analyzing the virus samples by using an existing sandbox (network programming virtual execution environment) tool Cuckoo to obtain a dynamic API call sequence corresponding to each virus sample; the method comprises the steps of running a virus sample in a Cuckoo virtual execution environment, recording APIs (application programming interfaces) sequentially called by the sample, constructing an API sequence with dynamic characteristics, and extracting and filtering the obtained dynamic API calling sequence, namely removing redundant APIs (adjacent and repeated APIs in the sequence) and noise (the lengths of continuous APIs are one section, and the probability that one or more repeated APIs are added by a virus author) to obtain the processed dynamic API calling sequence.
Preferably, in step S120, the initial labeling is performed on the virus sample API specifically in the following manner to obtain an initial labeling result:
s1201, aiming at one API (also called as a first API) in the dynamic API call sequence, manually comparing the dynamic API call sequence with a Windows operating system API manual based on API functional feature description defined by a Windows operating system, parameter setting (Windows operating system API manual) and the like, determining the functional component category to which the first API belongs, and determining the probability that the first API belongs to the functional component category according to experience, thereby obtaining the probability distribution of the first functional component of the first API. Specifically, the probability distribution of the first functional component is specifically represented by the following form:
U=[A:wA,B:wB,C:wC,D:wD]
wherein, U represents the probability distribution of the first functional component, A, B, C, D represents the control, detection, infection, and destruction of four functional components, respectively, wA,wB,wCAnd wDRespectively representing the probability that the first API belongs to the corresponding functional component, and the probability parameter satisfies wA+wB+wC+wD=1。
S1202, matching one API (first API) in the dynamic API calling sequence with an API type matching set in a Windows operating system to determine the type of the API, and further determining the type of the functional component to which the API belongs and the probability of the functional component to which the API belongs, so that the probability distribution of a second functional component of the API is obtained. Specifically, the method comprises the following steps:
the 10 types of APIs defining the Windows operating system platform are respectively: memory, process, kernel, device, file, system, text, registry, window, and network; a match set, denoted M, is defined for each type of APImemory、Mprocess、Mkernel、Mdevice、Mfile、Msystem、Mtext、Mregistry、MwindowAnd MnetworkThe contents of each matching set are shown in table 1. Defining memory and process to belong to a management and control component, kernel and device to belong to a detection component, file, system and text to belong to an infection component, and registry, window and network to belong to a destruction component. For one API (first API) in the dynamic API call sequence, matching each item in the 10 matching sets, if one of the items is contained, stating that the first API is related to the category, stating that the first API is related to the functional component corresponding to the category API, and recording the matching result, the method is defined as follows:
[A:NA,B:NB,C:NC,D:ND]
wherein A, B, C, D represents the four functional components of control, detection, infection and destruction, respectively, NA,NB,NC,NDAnd representing the total number of the sub-character strings of the first API matched with the corresponding functional component, namely judging the functional component to which each character string in the first API belongs, and thus obtaining the total number of the character strings in the first API corresponding to each functional component. And constructing a matching record for the first API, namely a probability distribution of the second functional component:
V=[A:vA,B:vB,C:vC,D:vD]
wherein v isx(x=A,B,C,D),vA,vB,vC,vDRepresenting the ratio of the number of character strings matched by the functional component to the total number of character strings. The calculation method is as follows:
Figure BDA0003342563460000101
s1203, if the maximum probability in the probability distribution of the first function component is greater than the maximum probability in the probability distribution of the second function component, taking the function component category corresponding to the maximum probability in the probability distribution of the first function component as the initial labeling result of one of the APIs (first APIs), if the maximum probability in the probability distribution of the first function component is smaller than the maximum probability in the probability distribution of the second function component, taking the function component category corresponding to the maximum probability in the probability distribution of the second function component as the initial labeling result of one of the APIs (first APIs), and if the maximum probability in the probability distribution of the first function component is equal to the maximum probability in the probability distribution of the second function component, selecting any one of the function component category corresponding to the maximum probability in the probability distribution of the first function component and the function component category corresponding to the maximum probability in the probability distribution of the second function component as the initial labeling result of one of the APIs (first APIs).
S1204, traversing each API in the dynamic API call sequence to obtain an initial labeling result of each API, and further obtaining an initial labeling result of the virus sample API.
TABLE 1
Figure BDA0003342563460000102
Figure BDA0003342563460000111
First, explanation is made on relevant parameters in the hidden markov model, specifically as follows:
r represents a hidden state class;
Λ represents an observed state species;
m represents a state transition probability matrix (two-dimensional);
Ν represents an observation probability matrix (two-dimensional);
pi is the initial state distribution vector;
the hidden state type in the hidden Markov model corresponds to the functional component type of the virus sample API, and comprises a control functional component, a detection functional component, an infection functional component and a destruction functional component; the observation state type in the hidden Markov model corresponds to the actual name of the virus sample API, so that the automatic labeling of the dynamic API calling sequence can be carried out by analogy with a part-of-speech labeling method.
Specifically, r ═ a, B, C, D ], a set of symbols of four hidden states (functional component types);
Λ ═ 1,2,3,4, …, Num ] (i.e., Num sets of devAPI number numbers with different actual names);
Mijindicating the probability of the hidden state i transitioning to the hidden state j. "transfer" means: if the mth in Lable _ sample _ n is the hidden state i and the m +1 is defined as the hidden state j, the hidden state i is transferred to the hidden state j.
Preferably, the hidden markov model is obtained by training in particular:
step 1, normalizing the length of a dynamic API call sequence corresponding to each virus sample API, and selecting the dynamic API call sequence with a preset length as the input of a hidden Markov model for each virus sample API; the preset length represents the number of APIs included in the dynamic API calling sequence; illustratively, the predetermined length is 100.
And 2, acquiring a state transition probability matrix, an observation probability matrix and an initial state distribution vector in the hidden Markov model based on the selected input, and further acquiring a trained hidden Markov model.
Preferably, step 2 specifically comprises:
step 201, specifically, a state transition probability matrix in the hidden markov model is obtained through the following method:
each hidden state class is numbered, illustratively: [ A:1, B:2, C:3, D:4 ];
defining a first two-dimensional matrix M _ sample _ n [4 ] corresponding to the virus sample API][4]The first dimension and the second dimension of the first two-dimensional matrix both correspond to the hidden state type number, and the element M _ sample _ n in the first two-dimensional matrixijRepresents the probability of the hidden state in the first dimension being transferred to the hidden state corresponding to the second dimension, i.e. M _ sample _ nijThe probability of the hidden state i transitioning to the hidden state j in the nth (n 1,2, 3.. total) viral sample API (i.e., dynamic API call sequence) is represented, and the total represents the total number of viral samples.
Labeling each API in a dynamic API calling sequence with a preset length corresponding to the virus sample API according to an initial labeling result of the virus sample API, and determining a functional component type set Lable _ sample _ n [100] to which the virus sample API belongs according to the API labeling result, wherein Lable _ sample _ n [ f ] is a functional component to which the f-th API belongs.
Setting a second two-dimensional matrix Label _ sample _ n [4 ] corresponding to the virus sample API][4]The first dimension and the second dimension of the second two-dimensional matrix both correspond to the hidden state type numbers, and elements in the second two-dimensional matrix represent the number of times that the hidden state corresponding to the first dimension is transferred to the hidden state corresponding to the second dimension; and determining the element values in the second two-dimensional matrix according to the API labeling result. Specifically, Label _ sample _ nijIndicates the number of transitions from hidden state i to hidden state j in the nth virus sample API. And traversing the Label _ sample _ n of the nth virus sample, counting the transfer times and recording the transfer times into the Label _ sample _ n.
Calculating and obtaining a first two-dimensional matrix corresponding to the virus sample API based on the determined element values in the second two-dimensional matrix:
Figure BDA0003342563460000141
wherein, M _ sample _ nijRepresenting the element values corresponding to the first dimension i and the second dimension j in the first two-dimensional matrix, Lable _ sample _ nijThe element values corresponding to the first dimension i and the second dimension j in the second two-dimensional matrix are expressed, and Lable _ sample _ nikAnd expressing element values corresponding to a first dimension i and a second dimension k in a second two-dimensional matrix, wherein i, j, k belongs to {1,2,3,4}, and both represent numbers corresponding to functional component categories.
Traversing each virus sample API to determine a first two-dimensional matrix corresponding to each virus sample API, and calculating and obtaining a state transition probability matrix in the hidden Markov model based on a plurality of first two-dimensional matrices, which is specifically represented as follows:
Figure BDA0003342563460000142
(i,j∈{1,2,3,4}),
wherein M isijAnd expressing element values corresponding to a first dimension i and a second dimension j in the state transition probability matrix, wherein the total represents the total number of the virus samples.
Step 202, specifically, obtaining an observation probability matrix in the hidden markov model by the following method:
Ntj=Ut[j],
and the number of the first and second groups,
Ntj=Vt[j],
wherein N istjRepresenting the probability that the virus sample API with the actual name number t in all the virus samples belongs to the hidden state j, wherein j belongs to {1,2,3,4}, Ut[j]A first functional component probability distribution, V, representing the virus sample APIt[j]A second functional component probability distribution representing the virus sample API.
Step 203, obtaining an initial state distribution vector in the hidden markov model specifically by:
and determining the times of marking the API contained in the dynamic API calling sequence with the preset length corresponding to the virus sample API as the hidden state j, wherein the j belongs to {1,2,3, 4}.
And traversing the APIs contained in the dynamic API calling sequence with the preset length corresponding to each virus sample API, thereby obtaining the labeling times corresponding to each hidden state.
Based on the labeling times corresponding to each hidden state, calculating and obtaining an initial state distribution vector through the following formula:
Figure BDA0003342563460000151
therein, IIjRepresents the initial state distribution vector, Label [ j ]]Representing the number of labels corresponding to the hidden state j, Label [ k]And the mark times corresponding to the hidden state k are shown, and k belongs to {1,2,3 and 4}, and the number of the functional component type is shown.
Preferably, the virus sample is automatically perceptually labeled by using a hidden markov model specifically in the following way:
and inputting a dynamic API calling sequence (a sequence of the corresponding actual name number of each API, wherein the API number refers to the number of the Num API library, and the corresponding API number is the position of the API library in the API library), and outputting the automatic perception labeling result of each virus sample API. Wherein, a viterbi algorithm is used for the hidden Markov model to find the hidden state sequence with the highest probability corresponding to the observation sequence by finding the maximum likelihood path.
Exemplarily, step 1, for the API number sequence to be labeled (obtained according to Λ) of the virus sample N, API _ SEQ _ N [100], the first state does not need to consider the transition state probability, but only considers the observation probability, so the labeling result Label _ result is calculated as follows:
Label_result[1]=Max(Ni1,Ni2,Ni3,Ni4)
wherein i represents API _ SEQ _ N [1 ]]Corresponding API numbering, which is why with the sequence of API numbering as input Max gets the number of the hidden state with the highest probability, i.e. if N is the numberi1At maximum, Max results in 1.
Step 2, for API _ SEQ _ N [ N ] (1 < N < 100), the transition probability needs to be considered, so the calculation mode of the marking result is as follows:
Label_result[n]=Max(MLabel_result[n-1]k*Nik)(k=1,2,3,4)
wherein M isAPI_SEQ_N[n-1]iRepresents API _ SEQ _ N [ N-1 ]]Probability of a marked hidden state transitioning to a hidden state i, i representing API _ SEQ _ N [ N ]]The corresponding API functional component class number.
And 3, completing the marking of the sample n.
And 4, repeating the steps 1 to 3 for total samples, thereby completing the automatic perception marking of the virus sample API of all the virus samples.
System embodiment
The invention further discloses an API marking system for the Windows PE virus sample.
Since the system embodiment and the method embodiment are based on the same working principle, the method embodiment may be referred to for the repeated points, and will not be described herein again.
Specifically, the system comprises:
the system comprises a virus sample API acquisition module, a virus analysis module and a virus analysis module, wherein the virus sample API acquisition module is used for dynamically analyzing each virus sample in an acquired virus sample set to obtain a corresponding virus sample API, and the virus sample API comprises a dynamic API calling sequence;
the initial labeling module is used for carrying out initial labeling on the virus sample API according to the API information defined by the Windows operating system and the dynamic API calling sequence to obtain an initial labeling result;
and the marking module is used for carrying out automatic perception marking on the virus sample API according to the initial marking result and the dynamic API calling sequence of the virus sample API by utilizing the trained hidden Markov model.
According to the method and the system for marking the API of the Windows PE virus sample, firstly, the four operation stages and the corresponding functional components of the API of the virus sample are considered, and the attribution of the functional components of the API of the virus sample is divided, so that the characteristics of the API of the virus sample are analyzed more effectively, and efficient and accurate marking is carried out based on the characteristics of the API of the virus sample; secondly, the method and the system for marking the API of the Windows PE virus sample perform initial marking on the API of the virus sample according to the functional characteristic information and the dynamic API sequence information of the API of the virus sample, combine the dynamic API call sequence corresponding to the API of the virus sample according to the initial marking result, and perform automatic perception marking on the API of the virus sample by using the trained hidden Markov model, thereby improving the marking efficiency of the API of the virus sample.
Those skilled in the art will appreciate that all or part of the flow of the method implementing the above embodiments may be implemented by a computer program, which is stored in a computer readable storage medium, to instruct related hardware. The computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims (10)

1. A Windows PE virus sample API labeling method is characterized by comprising the following steps:
dynamically analyzing each virus sample in the acquired virus sample set to obtain a corresponding virus sample API, wherein the virus sample API comprises a dynamic API calling sequence;
according to API information defined by the Windows operating system and the dynamic API calling sequence, initially marking the virus sample API to obtain an initial marking result;
and carrying out automatic perception labeling on the virus sample API according to the initial labeling result of the virus sample API and the dynamic API calling sequence by utilizing the trained hidden Markov model.
2. The API labeling method of Windows PE virus samples according to claim 1, wherein the API information includes API functional feature descriptions;
specifically, the virus sample API is initially labeled in the following way to obtain an initial labeling result:
aiming at one API in the dynamic API calling sequence, determining the functional component class to which the API belongs and the probability of belonging to the functional component class based on the API functional feature description defined by the Windows operating system, thereby obtaining the first functional component probability distribution of the API;
matching one API in the dynamic API calling sequence with an API type matching set in a Windows operating system to determine the type of the API, and further determining the type of a functional component to which the API belongs and the probability of the functional component to which the API belongs, so as to obtain the probability distribution of a second functional component of the API;
if the maximum probability in the first functional component probability distribution is greater than the maximum probability in the second functional component probability distribution, taking the functional component category corresponding to the maximum probability in the first functional component probability distribution as the initial labeling result of the one API, if the maximum probability in the first functional component probability distribution is less than the maximum probability, taking the functional component category corresponding to the maximum probability in the second functional component probability distribution as the initial labeling result of the one API, and if the maximum probability in the first functional component probability distribution is equal to the maximum probability in the second functional component probability distribution, selecting any one of the functional component category corresponding to the maximum probability in the first functional component probability distribution and the functional component category corresponding to the maximum probability in the second functional component probability distribution as the initial labeling result of the one API;
and traversing each API in the dynamic API calling sequence to obtain an initial labeling result of each API, and further obtaining an initial labeling result of the virus sample API.
3. The API labeling method for Windows PE virus samples according to claim 2, wherein the hidden Markov model is obtained by training in the following way:
normalizing the length of a dynamic API call sequence corresponding to each virus sample API, and selecting the dynamic API call sequence with a preset length as the input of a hidden Markov model for each virus sample API; the preset length represents the number of APIs included in the dynamic API calling sequence;
and acquiring a state transition probability matrix, an observation probability matrix and an initial state distribution vector in the hidden Markov model based on the selected input, thereby acquiring the trained hidden Markov model.
4. The Windows PE virus sample API labeling method of claim 3, wherein hidden state types in the hidden Markov model correspond to functional component classes of the virus sample API, the hidden state types include a management and control functional component, a detection functional component, an infection functional component, and a destruction functional component; the observed state class in the hidden Markov model corresponds to the actual name of the virus sample API.
5. The API labeling method for Windows PE virus samples according to claim 4, wherein the state transition probability matrix in the hidden Markov model is obtained by:
numbering each hidden state type;
defining a first two-dimensional matrix corresponding to a virus sample API, wherein a first dimension and a second dimension of the first two-dimensional matrix correspond to hidden state type numbers, and elements in the first two-dimensional matrix represent the probability that a hidden state corresponding to the first dimension is transferred to a hidden state corresponding to the second dimension;
marking each API in a dynamic API calling sequence with a preset length corresponding to the virus sample API according to the initial marking result of the virus sample API, and determining a functional component type set to which the virus sample API belongs according to the API marking result;
setting a second two-dimensional matrix corresponding to the virus sample API, wherein the first dimension and the second dimension of the second two-dimensional matrix both correspond to hidden state type numbers, and elements in the second two-dimensional matrix represent the times of transferring the hidden state corresponding to the first dimension to the hidden state corresponding to the second dimension; determining element values in the second two-dimensional matrix according to the API labeling result;
calculating and obtaining a first two-dimensional matrix corresponding to the virus sample API based on the determined element values in the second two-dimensional matrix;
and traversing each virus sample API to determine a first two-dimensional matrix corresponding to each virus sample API, and calculating to obtain a state transition probability matrix in the hidden Markov model based on the first two-dimensional matrices.
6. The Windows PE virus sample API labeling method of claim 5, wherein computing the first two-dimensional matrix corresponding to the virus sample API based on the determined element values in the second two-dimensional matrix comprises:
Figure FDA0003342563450000031
wherein, M _ sample _ nijRepresenting the element values corresponding to the first dimension i and the second dimension j in the first two-dimensional matrix, Lable _ sample _ nijThe element values corresponding to the first dimension i and the second dimension j in the second two-dimensional matrix are expressed, and Lable _ sample _ nikAnd expressing element values corresponding to a first dimension with the number i and a second dimension with the number k in a second two-dimensional matrix, wherein i, j, k belongs to {1,2,3, 4}.
7. The API labeling method of Windows PE virus samples according to claim 6, wherein the obtaining a state transition probability matrix in a hidden Markov model based on the plurality of first two-dimensional matrix calculations comprises:
Figure FDA0003342563450000041
wherein M isijAnd expressing element values corresponding to a first dimension i and a second dimension j in the state transition probability matrix, wherein the total represents the total number of the virus samples.
8. The API labeling method for Windows PE virus samples according to claim 4, wherein the observation probability matrix in the hidden Markov model is obtained by:
Ntj=Ut[j],
and the number of the first and second groups,
Ntj=Vt[j],
wherein N istjRepresenting the probability that the virus sample API with the actual name number t in all the virus samples belongs to the hidden state j, wherein j belongs to {1,2,3,4}, Ut[j]A first functional component probability distribution, V, representing the virus sample APIt[j]A second functional component probability distribution representing the virus sample API.
9. The API labeling method for Windows PE virus samples according to claim 4, wherein the initial state distribution vector in the hidden Markov model is obtained by:
determining the times that the API contained in a dynamic API calling sequence with a preset length corresponding to the virus sample API is marked as a hidden state j, wherein the j belongs to {1,2,3,4 };
traversing APIs contained in a dynamic API calling sequence with a preset length corresponding to each virus sample API, so as to obtain the labeling times corresponding to each hidden state;
based on the labeling times corresponding to each hidden state, calculating and obtaining an initial state distribution vector through the following formula:
Figure FDA0003342563450000042
therein, IIjRepresents the initial state distribution vector, Label [ j ]]Representing the number of labels corresponding to the hidden state j, Label [ k]And representing the labeling times corresponding to the hidden state k, wherein k belongs to {1,2,3, 4}.
10. An API labeling system for a Windows PE virus sample, comprising:
the system comprises a virus sample API acquisition module, a virus analysis module and a virus analysis module, wherein the virus sample API acquisition module is used for dynamically analyzing each virus sample in an acquired virus sample set to obtain a corresponding virus sample API, and the virus sample API comprises a dynamic API calling sequence;
the initial labeling module is used for carrying out initial labeling on the virus sample API according to the API information defined by the Windows operating system and the dynamic API calling sequence to obtain an initial labeling result;
and the marking module is used for carrying out automatic perception marking on the virus sample API according to the initial marking result and the dynamic API calling sequence of the virus sample API by utilizing the trained hidden Markov model.
CN202111312492.4A 2021-11-08 2021-11-08 API labeling method and system for Windows PE virus sample Pending CN114003908A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111312492.4A CN114003908A (en) 2021-11-08 2021-11-08 API labeling method and system for Windows PE virus sample

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111312492.4A CN114003908A (en) 2021-11-08 2021-11-08 API labeling method and system for Windows PE virus sample

Publications (1)

Publication Number Publication Date
CN114003908A true CN114003908A (en) 2022-02-01

Family

ID=79928004

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111312492.4A Pending CN114003908A (en) 2021-11-08 2021-11-08 API labeling method and system for Windows PE virus sample

Country Status (1)

Country Link
CN (1) CN114003908A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114969738A (en) * 2022-05-27 2022-08-30 天翼爱音乐文化科技有限公司 Interface abnormal behavior monitoring method, system, device and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114969738A (en) * 2022-05-27 2022-08-30 天翼爱音乐文化科技有限公司 Interface abnormal behavior monitoring method, system, device and storage medium

Similar Documents

Publication Publication Date Title
CN109271521B (en) Text classification method and device
US8606779B2 (en) Search method, similarity calculation method, similarity calculation, same document matching system, and program thereof
CN112818093B (en) Evidence document retrieval method, system and storage medium based on semantic matching
US7734567B2 (en) Document data analysis apparatus, method of document data analysis, computer readable medium and computer data signal
CN111475603A (en) Enterprise identifier identification method and device, computer equipment and storage medium
CN111338692B (en) Vulnerability classification method and device based on vulnerability codes and electronic equipment
CN112836509A (en) Expert system knowledge base construction method and system
CN115186015B (en) Network security knowledge graph construction method and system
CN112257444A (en) Financial information negative entity discovery method and device, electronic equipment and storage medium
Wunderlich et al. Comparison of system call representations for intrusion detection
KR100961179B1 (en) Apparatus and Method for digital forensic
CN115659226A (en) Data processing system for acquiring APP label
CN114003908A (en) API labeling method and system for Windows PE virus sample
Andreopoulos Malware detection with sequence-based machine learning and deep learning
Kim et al. Towards attention based vulnerability discovery using source code representation
CN111782811A (en) E-government affair sensitive text detection method based on convolutional neural network and support vector machine
EP4258107A1 (en) Method and system for automated discovery of artificial intelligence and machine learning assets in an enterprise
Paik et al. Malware family prediction with an awareness of label uncertainty
CN111191238A (en) Webshell detection method, terminal device and storage medium
US20230244987A1 (en) Accelerated data labeling with automated data profiling for training machine learning predictive models
CN110059180B (en) Article author identity recognition and evaluation model training method and device and storage medium
CN113420127A (en) Threat information processing method, device, computing equipment and storage medium
CN111382247B (en) Content pushing optimization method, content pushing optimization device and electronic equipment
JPH06274548A (en) Similarity degree calculating device
Sharma et al. Optical Character Recognition Using Hybrid CRNN Based Lexicon-Free Approach with Grey Wolf Hyperparameter Optimization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination