CN110868405B

CN110868405B - Malicious code detection method and device, computer equipment and storage medium

Info

Publication number: CN110868405B
Application number: CN201911071345.5A
Authority: CN
Inventors: 梁志宏; 胡朝辉; 陈佳捷; 罗强; 高健; 伍思廉; 郑伟文; 吴佩泽; 彭伯庄; 王金贺; 陈鹏
Original assignee: Southern Power Grid Digital Grid Research Institute Co Ltd
Current assignee: China Southern Power Grid Digital Platform Technology Guangdong Co ltd
Priority date: 2019-11-05
Filing date: 2019-11-05
Publication date: 2022-03-04
Anticipated expiration: 2039-11-05
Also published as: CN110868405A

Abstract

The application discloses a malicious code detection method and device, computer equipment and a storage medium, and relates to the technical field of information security. In the method, a server of the network equipment can select a first calling sequence and a second calling sequence from an Application Program Interface (API) sequence set, respectively calculate a normal risk value and a malicious risk value of the first calling sequence, and a normal risk value and a malicious risk value of the second calling sequence, and label a target sequence according to the normal risk value and the malicious risk value of the first calling sequence and the normal risk value and the malicious risk value of the second calling sequence to obtain a labeling result; and removing the target sequence from the API sequence set, taking the API sequence set without the target sequence as the next circulating API sequence set, and circularly detecting and labeling the calling sequences to be detected until all the calling sequences in the API sequence set are labeled. According to the technical scheme, the malicious code processing efficiency can be improved.

Description

Malicious code detection method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of information security technologies, and in particular, to a malicious code detection method and apparatus, a computer device, and a storage medium.

Background

The network attack is a behavior of attacking an information system and data resources of the network equipment by using vulnerabilities and security defects existing in the network information system, and specifically, the network attack can tamper the authority of the attacked network equipment so as to steal files; and the attacked network equipment can also refuse service, so that the user can not normally use the network equipment, thereby bringing huge loss to the user.

In the prior art, a method for detecting whether a malicious code exists in a target file received by a network device is provided, and the method includes: the method comprises the steps of obtaining dynamic action information of a target file to be detected, wherein the dynamic action information comprises action information and access information generated after the target file runs, and judging that malicious codes exist in the target file when the action information cannot pass a security baseline verification or the access information points to a core unit of network equipment.

However, when the malicious code exists in the target file, the source code of the target file is detected again to determine the location of the malicious code, which requires additional time and labor, and results in low efficiency in processing the malicious code.

Disclosure of Invention

Based on this, it is necessary to provide a malicious code detection method, apparatus, computer device and storage medium for solving the above-mentioned problem of the specific location of the undeterminable malicious code in the target file.

In a first aspect, an embodiment of the present application provides a malicious code detection method, where the method includes:

selecting a first calling sequence and a second calling sequence from an Application Program Interface (API) sequence set, wherein the API sequence set comprises a plurality of calling sequences to be detected;

respectively calculating a normal risk value and a malicious risk value of the first calling sequence and a normal risk value and a malicious risk value of the second calling sequence, wherein the normal risk value represents the probability of malicious codes existing in the calling sequences; the malicious risk value represents the probability of no malicious code existing in the calling sequence;

labeling a target sequence according to a normal risk value of the first calling sequence, a malicious risk value of the first calling sequence, a normal risk value of the second calling sequence and a malicious risk value of the second calling sequence to obtain a labeling result, wherein the target sequence is the first calling sequence or the second calling sequence, and the labeling result comprises the existence of malicious codes or the absence of malicious codes;

and removing the target sequence from the API sequence set, taking the API sequence set without the target sequence as the next circulating API sequence set, and circularly detecting and labeling the calling sequences to be detected until all the calling sequences in the API sequence set are labeled.

In one embodiment, calculating the normal risk value and the malicious risk value for the first call sequence comprises:

acquiring a malicious sample set and a normal sample set, wherein the malicious sample set comprises a plurality of known malicious calling sequences containing malicious codes; the normal sample set comprises a plurality of normal calling sequences which are known not to contain malicious code;

respectively calculating the malicious similarity between the first calling sequence and the malicious calling sequence aiming at each malicious calling sequence in the malicious sample set; respectively calculating the normal similarity of the first calling sequence and the normal calling sequence aiming at each normal calling sequence in the normal sample set;

acquiring a malicious predicted value corresponding to each malicious calling sequence, a normal predicted value corresponding to each normal calling sequence and a predicted value of the first calling sequence;

calculating a malicious risk value of the first calling sequence according to the malicious predicted value, the malicious similarity and the predicted value of the first calling sequence; and calculating the normal risk value of the first calling sequence according to the normal predicted value, the normal similarity and the predicted value of the first calling sequence.

In one embodiment, the labeling the target sequence according to the normal risk value and the malicious risk value of the first call sequence and the normal risk value and the malicious risk value of the second call sequence to obtain a labeling result includes:

selecting a minimum risk value from the normal risk value of the first calling sequence, the malicious risk value of the first calling sequence, the normal risk value of the second calling sequence and the malicious risk value of the second calling sequence;

determining the calling sequence corresponding to the minimum risk value as a target sequence;

and labeling the target sequence according to the minimum risk value to obtain a labeling result.

In one embodiment, labeling the target sequence according to the minimum risk value to obtain a labeling result includes:

when the minimum risk value is the normal risk value of the first calling sequence or the normal risk value of the second calling sequence, marking that the target sequence has no malicious code;

and when the minimum risk value is the malicious risk value of the first calling sequence or the malicious risk value of the second calling sequence, marking that the target sequence has malicious codes.

In one embodiment, selecting the first call sequence and the second call sequence from the API sequence set includes:

respectively calculating the Hamming distance of each two call sequences to be detected in the API sequence set;

and selecting two calling sequences with the maximum Hamming distance as a first calling sequence and a second calling sequence.

In one embodiment, before selecting the first call sequence and the second call sequence from the API sequence set, the method further comprises:

running the received target file to be detected in the virtual sandbox, and acquiring a calling sequence corresponding to an API (application program interface) function of the target file;

and for each calling sequence, obtaining a feature vector of the calling sequence to form an API sequence set.

In one embodiment, the step of taking the API sequence set excluding the target sequence as the API sequence set of the next cycle includes:

stopping detection when the API sequence set circulated next time only comprises a calling sequence to be detected;

and sending a detection instruction to the manual detection terminal, wherein the detection instruction is used for indicating manual detection of the call sequence to be detected in the API sequence set of the next cycle.

In a second aspect, an embodiment of the present application provides a malicious code detection apparatus, where the apparatus includes:

the calling sequence selection module is used for selecting a first calling sequence and a second calling sequence from an Application Program Interface (API) sequence set, wherein the API sequence set comprises a plurality of calling sequences to be detected;

the risk calculation module is used for calculating a normal risk value and a malicious risk value of the first calling sequence and a normal risk value and a malicious risk value of the second calling sequence respectively, wherein the normal risk value represents the probability of malicious codes existing in the calling sequences; the malicious risk value represents the probability of no malicious code existing in the calling sequence;

the marking module is used for marking the target sequence according to the normal risk value of the first calling sequence, the malicious risk value of the first calling sequence, the normal risk value of the second calling sequence and the malicious risk value of the second calling sequence to obtain a marking result, the target sequence is the first calling sequence or the second calling sequence, and the marking result comprises the existence of malicious codes or the absence of malicious codes;

and the cyclic processing module is used for removing the target sequence from the API sequence set, taking the API sequence set without the target sequence as the next cyclic API sequence set, and carrying out cyclic detection and labeling on the plurality of calling sequences to be detected until all calling sequences in the API sequence set are labeled.

In a third aspect, there is provided a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, performs the steps of the method of the first aspect described above.

In a fourth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of the first aspect described above.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

a server (hereinafter, referred to as a server) of the network device may select a first call sequence and a second call sequence from an Application Programming Interface (API) sequence set, where the API sequence set includes a plurality of call sequences to be detected. The server can respectively calculate a normal risk value and a malicious risk value of the first calling sequence and a normal risk value and a malicious risk value of the second calling sequence, wherein the normal risk value represents the probability of malicious codes existing in the calling sequences; the malicious risk value represents the probability that malicious code is not present in the invocation sequence. The server can label the target sequence according to the normal risk value of the first calling sequence, the malicious risk value of the first calling sequence, the normal risk value of the second calling sequence and the malicious risk value of the second calling sequence to obtain a labeling result, wherein the target sequence is the first calling sequence or the second calling sequence, and the labeling result comprises the existence of malicious codes or the absence of malicious codes. The server can remove the target sequence from the API sequence set, the API sequence set without the target sequence is used as the next circulating API sequence set, and the cyclic detection and labeling are carried out on the plurality of calling sequences to be detected until all the calling sequences in the API sequence set are labeled. Therefore, in the embodiment of the application, the server of the network device marks all the calling sequences, so that whether malicious codes exist in each calling sequence of the multiple calling sequences to be detected or not can be determined, and in the process of determining whether the malicious codes exist in the target file, the calling sequence in which the malicious codes exist can be directly determined, and a user can directly process the calling sequences in which the malicious codes exist.

Drawings

Fig. 1 is a schematic diagram of an implementation environment of a malicious code detection method according to an embodiment of the present application;

fig. 2 is a schematic diagram of another implementation environment of a malicious code detection method according to an embodiment of the present disclosure;

fig. 3 is a flowchart of a malicious code detection method according to an embodiment of the present disclosure;

fig. 4 is a flowchart of another malicious code detection method according to an embodiment of the present disclosure;

fig. 5 is a flowchart of another malicious code detection method according to an embodiment of the present disclosure;

fig. 6 is a flowchart of another malicious code detection method according to an embodiment of the present disclosure;

fig. 7 is a block diagram of a malicious code detection apparatus according to an embodiment of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

With the development of computer technology, network devices are more and more widely used, and the number of files transmitted between the network devices is increased dramatically. Wherein, part of the file may be embedded with malicious code, which refers to computer code that is intentionally programmed or set up to cause a threat or potential threat to a network or system, such as: computer viruses, trojan horses, and the like. The malicious codes can perform actions such as anonymous advertisement pushing, silent software downloading, even fee stealing and the like, and when the network equipment opens a file carrying the malicious codes, the network equipment can be attacked by the network. The network attack is a behavior of attacking an information system and data resources of the network equipment by using vulnerabilities and security defects existing in the network information system, and specifically, the network attack can tamper the authority of the attacked network equipment so as to steal files; and the attacked network equipment can also refuse service, so that the user can not normally use the network equipment, thereby bringing huge loss to the user.

In the prior art, a method for detecting malicious codes is provided, which detects a target file received by a network device, obtains dynamic action information of the target file to be detected, where the dynamic action information includes action information and access information generated after the target file runs, and determines that the malicious codes exist in the target file when the action information cannot be checked through a security baseline or the access information points to a core unit of the network device. However, the method cannot directly determine the position of the malicious code when the target file is determined to have the malicious code, and therefore, when the target file has the malicious code, the source code of the target file needs to be detected again to determine the position of the malicious code, and the malicious code needs to be processed.

According to the method, when the malicious codes exist in the target file, the source codes of the target file are detected again to determine the positions of the malicious codes, extra time and labor are needed, and the processing efficiency of the malicious codes is low.

The malicious code detection method, the malicious code detection device, the computer equipment and the storage medium can improve the processing efficiency of the malicious code. In the method, a server (hereinafter referred to as a server) of the network device may select a first call sequence and a second call sequence from an Application Program Interface (API) sequence set, where the API sequence set includes a plurality of call sequences to be detected. The server can respectively calculate a normal risk value and a malicious risk value of the first calling sequence and a normal risk value and a malicious risk value of the second calling sequence, wherein the normal risk value represents the probability of malicious codes existing in the calling sequences; the malicious risk value represents the probability that malicious code is not present in the invocation sequence. The server can label the target sequence according to the normal risk value of the first calling sequence, the malicious risk value of the first calling sequence, the normal risk value of the second calling sequence and the malicious risk value of the second calling sequence to obtain a labeling result, wherein the target sequence is the first calling sequence or the second calling sequence, and the labeling result comprises the existence of malicious codes or the absence of malicious codes. The server can remove the target sequence from the API sequence set, the API sequence set without the target sequence is used as the next circulating API sequence set, and the cyclic detection and labeling are carried out on the plurality of calling sequences to be detected until all the calling sequences in the API sequence set are labeled. Therefore, in the embodiment of the application, the server of the network device marks all the calling sequences, so that whether malicious codes exist in each calling sequence of the multiple calling sequences to be detected or not can be determined, and in the process of determining whether the malicious codes exist in the target file, the calling sequence in which the malicious codes exist can be directly determined, and a user can directly process the calling sequences in which the malicious codes exist.

In the following, a brief description will be given of an implementation environment related to the malicious code detection method provided in the embodiment of the present application.

Referring to fig. 1, fig. 1 is a schematic diagram of an implementation environment related to a malicious code detection method provided in an embodiment of the present application, where the implementation environment may be as shown in fig. 1, and includes a network device (fig. 1 shows a computer) in which a malicious code detection program is installed on a server, where the malicious code detection program may be called by the server of the network device to implement the malicious code detection method provided in the embodiment of the present application.

Optionally, in this embodiment of the present application, the network device may be a router, a computer, a switch, and the like.

Referring to fig. 2, a server of a network device (hereinafter, referred to as a server) is provided, an internal structure of the server may be as shown in fig. 2, and the server includes a processor, a memory, a network interface, and a database connected through a system bus. Wherein the processor of the server is configured to provide computing and control capabilities. The memory of the server comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the server is used for storing a malicious sample set and a normal sample set, wherein the malicious sample set comprises a plurality of known malicious calling sequences containing malicious codes; the normal sample set includes a plurality of normal call sequences that are known to contain no malicious code. The network interface of the server is used for communicating with an external terminal through network connection. The computer program is executed by a processor to implement a malicious code detection method.

The structure shown in fig. 2 is a block diagram of only a portion of the structure associated with the present application, and does not constitute a limitation on the network devices to which the present application is applied, and a particular network device may include more or less components than those shown in fig. 2, or combine certain components, or have a different arrangement of components.

Referring to fig. 3, a flowchart of a malicious code detection method provided by an embodiment of the present application is shown, where the malicious code detection method may be applied to the server shown in fig. 2. As shown in fig. 3, the malicious code detection method may include the steps of:

step 301, the server selects a first calling sequence and a second calling sequence from the API sequence set.

In this embodiment of the application, the API sequence set includes a plurality of call sequences to be detected, and the first call sequence and the second call sequence may be two call sequences of the plurality of call sequences to be detected.

In an alternative implementation, as shown in fig. 4, before the server selects the first call sequence and the second call sequence from the API sequence set, the following steps 401 to 402 are further included:

step 401, after the network device receives the target file, the server may operate the received target file to be detected in the virtual sandbox to obtain a call sequence corresponding to the API function of the target file.

The virtual sandbox refers to a virtual system program, can run in a virtual environment to run a target file, and can delete changes generated by running the target file. The virtual sandbox can direct files generated and modified by operating the target file to a folder of the virtual sandbox through a redirection technology, so that the target file is prevented from modifying the local system file. Therefore, the attack on the local system by the malicious codes possibly appearing in the target file can be avoided.

The API calling sequence obtained by the server refers to the combination of API calls, and the API calling sequence is formed by a plurality of API calls based on the front and back dependency relations.

In the embodiment of the application, when the external world sends the target file to the network device through the network protocol, the server places the received target file in the virtual sandbox for operation, and in the operation process, the server can obtain the static action information of the target file, obtain the source code of the target file according to the static action information, and extract the API calling sequence from the source code of the target file.

In this embodiment, the static action information includes an MD5 (english: MD5Message-Digest Algorithm; abbreviation: MD5 information Digest Algorithm) value of the target file, and the process of the server obtaining the source code of the target file according to the static action information may be: and the server judges whether to call a shelling tool according to the MD5 value of the target file, and when the MD5 value of the target file is greater than a threshold value, the shelling tool is adopted to obtain the source code of the target file. When the MD5 value of the target file is less than or equal to the threshold, then no shelling tool is needed. Note that "shelling" is the inverse operation of adding shells to software. Software shelling refers to setting a program which is specially responsible for protecting software from being illegally modified or decompiled on written software.

Step 402, for each calling sequence, the server obtains the feature vector of the calling sequence to form an API sequence set.

In the embodiment of the application, the calling sequence corresponding to the API function can be processed through a locality sensitive hash sim-hash algorithm to obtain the feature vector H of the binary API calling sequence_iFeature vectors for multiple call sequences may constitute a set of API sequences.

In an alternative implementation manner, in the embodiment of the present application, two call sequences may be arbitrarily selected from the API sequence set as the first call sequence and the second call sequence.

In an alternative implementation, in order to increase the difference between the first call sequence and the second call sequence, so as to distinguish the first call sequence from the second call sequence, the process of the server selecting the first call sequence and the second call sequence from the API sequence set may include the following steps B1-B2:

and step B1, the server calculates the Hamming distance of each two call sequences to be detected in the API sequence set respectively.

The hamming distance is used to indicate the different number of corresponding bits of two (same length) words, for example: codeword a is 10001001 and codeword B is 10110001, so the number of different characters in codeword a and codeword B is 3, which means that the hamming distance between codeword a and codeword B is 3.

The server can calculate the hamming distance of any two call sequences in the API sequence set.

Optionally, formula (1) may be used to calculate the hamming distance between every two call sequences in the API sequence set:

in the formula (1), y_rA bit value, z, corresponding to a calling sequence in the API sequence set_rBit value, D, corresponding to another calling sequence in the API sequence set_ham(y, z) is the Hamming distance, r is the number of groups in the API sequence set, grouped two by two, and m is the sample capacity.

And step B2, the server selects two calling sequences with the maximum Hamming distance as a first calling sequence and a second calling sequence.

The larger the hamming distance is, the lower the similarity between two code words is, and the smaller the hamming distance is, the higher the similarity between two code words is.

In the embodiment of the application, two calling sequences with the largest Hamming distance are selected, namely two calling sequences with the lowest similarity are selected as a first calling sequence and a second calling sequence respectively.

Step 302, the server calculates a normal risk value and a malicious risk value of the first calling sequence and a normal risk value and a malicious risk value of the second calling sequence respectively.

Wherein the normal risk value represents a probability that malicious code exists in the call sequence; the malicious risk value represents the probability that malicious code is not present in the invocation sequence. Wherein, the larger the normal risk value of the calling sequence is, the lower the possibility that the calling sequence is a normal sequence is. The smaller the normal risk value of the call sequence, the higher the probability that the call sequence is a normal sequence. The greater the malicious risk value of the call sequence, the lower the likelihood that the call sequence is a malicious sequence. The smaller the malicious risk value of the call sequence, the higher the probability that the call sequence is a malicious sequence.

In the embodiment of the application, the server may calculate the respective normal risk value and the malicious risk value for the first call sequence and the second call sequence respectively. In an alternative implementation manner, taking the first call sequence as an example, in this embodiment of the application, as shown in fig. 5, a process of the server calculating a normal risk value and a malicious risk value of the first call sequence may include the following steps:

step 501, the server obtains a malicious sample set and a normal sample set.

In the embodiment of the application, Advanced Persistent Threat APT (Advanced Persistent Threat) team can be tracked, various main types of malicious codes such as Backdoor, Trojan (Chinese: Trojan Virus), Virus (Chinese: Virus) and Worm (Chinese: Worm) are collected through malicious code sharing websites such as VXHeavens and Malshare, malicious call sequences corresponding to API functions corresponding to the malicious codes are obtained, known malicious call sequences containing the malicious codes are processed through a sim-hash algorithm, and feature vectors H of binary malicious call sequences are obtained_i-', feature vector H of multiple malicious call sequences_i-' combining forms a set of malicious samples.

Meanwhile, in the embodiment of the application, the server can acquire the normal calling sequence corresponding to the API function corresponding to the known normal code without the malicious code, and the known normal calling sequence containing the normal code is processed through the sim-hash algorithm to obtain the characteristic of the binary normal calling sequenceVector H_i+', feature vector H of multiple normal call sequences_i+' combining to form a normal sample set.

Step 502, aiming at each malicious calling sequence in the malicious sample set, the server respectively calculates the malicious similarity between the first calling sequence and the malicious calling sequence; and aiming at each normal calling sequence in the normal sample set, the server respectively calculates the normal similarity of the first calling sequence and the normal calling sequence.

In the embodiment of the present application, formula (2) may be adopted to calculate the similarity between the first call sequence and each malicious call sequence and each normal call sequence. In the embodiment of the present application, for convenience of distinguishing, a similarity between the first call sequence and the malicious call sequence is referred to as a malicious similarity, and a similarity between the first call sequence and the normal sequence is referred to as a normal similarity.

Wherein, sim (H)_i,H_i') is a similarity measure, y_rA bit value, z, corresponding to a malicious (or normal) calling sequence in the malicious sample set (or normal sample set)_rAnd the bit value corresponding to the first calling sequence is r is the number of groups grouped in pairs in the API sequence set, and m is the sample capacity.

Optionally, in order to distinguish the malicious sample set from the normal sample set, in this embodiment of the present application, H may be used_i+' denotes a normal call sequence corresponding to a normal sample set, H_i-' denotes a malicious call sequence corresponding to a malicious sample set. Then, the malicious similarity of the first call sequence to the malicious call sequence can be expressed as: sim (H)_i,H_i-'), the normal similarity of the first call sequence to the normal call sequence can be expressed as: sim (H)_i,H_i+')。

For example, in the embodiment of the present application, it is assumed that the normal sample set includes 5 eigenvectors H_i+', denoted L1, L2, L3, L4 and L5, respectively. The malicious sample set comprises 5Feature vector H_i-', denoted L6, L7, L8, L9 and L10, respectively. The first sequence of calls is denoted A1, then the server can calculate the normal degree of similarity between A1L1, A1L2, A1L3, A1L4 and A1L5, hereinafter referred to as A1L1, A1L2, A1L3, A1L4 and A1L 5. Accordingly, the malicious similarity of the first call sequence to each of the malicious call sequences may be expressed as: A1L6, A1L7, A1L8, A1L9, and A1L 10.

Step 503, the server obtains a malicious predicted value corresponding to each malicious calling sequence, a normal predicted value corresponding to each normal calling sequence, and a predicted value of the first calling sequence.

In the embodiment of the application, a classifier C comprising a random forest algorithm is established, a malicious sample set and a normal sample set are respectively input into the classifier C for training, and a prediction result C of the classifier C is obtained_i. In the embodiment of the application, the result of classifying the malicious call sequence in the malicious sample set by the classifier can be C_i-Indicating that the result of the classifier classifying the normal call sequence in the normal sample set can be C_i+And (4) showing. The prediction result represents the probability that no malicious code exists in the call sequence or the probability that malicious code exists in the call sequence.

By taking the above example as an example, five prediction results can be obtained after classifying the normal sample sets of L1, L2, L3, L4 and L5, which are respectively L1C_i+、L2C_i+、L3C_i+、L4C_i+And L5C_i+And (4) showing.

Five prediction results can be obtained after classification aiming at L6, L7, L8, L9 and L10 in a malicious sample set, and the five prediction results are respectively used as L6C_i-、L7C_i-、L8C_i-、L9C_i-And L10C_i-And (4) showing.

Meanwhile, the server can also input the first calling sequence A1 into the classifier to obtain the prediction result of A1, A1C_i' means.

Step 504, the server calculates a malicious risk value of the first calling sequence according to the malicious predicted value, the malicious similarity and the predicted value of the first calling sequence; and calculating the normal risk value of the first calling sequence according to the normal predicted value, the normal similarity and the predicted value of the first calling sequence.

In the embodiment of the application, the server may calculate the normal risk value of the first call sequence according to formula (3), and calculate the malicious risk value of the first call sequence according to formula (4).

R_S+＝∑(C_i+-C_i)²/sim(H_i,H_i') equation (3).

R_S-＝∑(C_i--C_i)²/sim(H_i,H_i') equation (4).

Taking the above example, the normal risk value A1R for the first call sequence_S+Can be expressed as:

malicious risk value A1R for a first sequence of calls_S-Can be expressed as:

based on the same principle of steps 501 to 504, in the embodiment of the present application, the server may calculate the normal risk value and the malicious risk value of the second call sequence, which are respectively used as A2R_S+And A2R_S-And (4) showing.

Step 303, the server labels the target sequence according to the normal risk value of the first calling sequence, the malicious risk value of the first calling sequence, the normal risk value of the second calling sequence and the malicious risk value of the second calling sequence, and obtains a labeling result.

The target sequence is a first calling sequence or a second calling sequence, and the labeling result comprises the existence of malicious codes or the absence of the malicious codes.

In an alternative implementation manner, as shown in fig. 6, the process of labeling the target sequence by the server to obtain the labeling result may include the following steps:

step 601, the server may select a minimum risk value from the normal risk value of the first call sequence, the malicious risk value of the first call sequence, the normal risk value of the second call sequence, and the malicious risk value of the second call sequence.

Bearing the above example, the normal risk value for the first call sequence is A1R_S+The malicious risk value of the first call sequence is A1R_S-The normal risk value for the second call sequence is A2R_S+The malicious risk value of the second call sequence is A2R_S-. From A1R_S+、A1R_S-、A2R_S+、A2R_S-The minimum risk value is selected.

For example A2R_S-Is the minimum risk value.

In step 602, the server may determine the call sequence corresponding to the minimum risk value as the target sequence.

A2R_S-The corresponding calling sequence is the second calling sequence, namely the second calling sequence is the target sequence.

And 603, the server marks the target sequence according to the minimum risk value to obtain a marking result.

In the embodiment of the application, when the minimum risk value is the normal risk value of the first call sequence or the normal risk value of the second call sequence, the marking result indicates that the target sequence does not have malicious codes.

By way of example, in the present embodiment, the minimum risk value A2R_S-And the second calling sequence is a malicious risk value of the second calling sequence, so that the marking result is that malicious code exists in the second calling sequence.

And step 304, the server eliminates the target sequence from the API sequence set, takes the API sequence set without the target sequence as the next circulating API sequence set, and circularly detects and labels a plurality of calling sequences to be detected until all calling sequences of the API sequence set are labeled.

In the embodiment of the application, the server marks the second calling sequence A2 as the existence of malicious code and removes the second calling sequence A2 from the API sequence set.

For example: the API sequence set comprises A1-A10 calling sequences to be detected, and after A2 is removed, the API sequence set with target sequences removed comprises A1 and A3-A10. And taking the API sequence set with the target sequence removed as the API sequence set of the next loop, then selecting a new first calling sequence and a new second calling sequence from A1 and A3 to A10 by the server, and performing the steps circularly to realize the labeling of each calling sequence in the API sequence set.

Wherein, the API sequence set with the target sequence removed is used as the API sequence set of the next cycle, and the method further includes:

the detection is stopped when the next set of looped API sequences includes only one call sequence to be detected.

When only one target sequence to be detected is left in the API sequence set after the target sequence is removed, the API sequence set serving as the next cycle cannot meet the condition of selecting the first calling sequence and the second calling sequence from the API sequence set in the next cycle, so that the cycle detection process cannot be continued, and at the moment, when the server detects that the API sequence set without the target sequence only comprises one calling sequence to be detected, the detection is stopped.

The server can send a detection instruction to the manual detection terminal.

And the detection instruction is used for indicating manual detection of the call sequence to be detected included in the API sequence set of the next cycle.

Namely, the server can send a detection instruction and a code related to the last call sequence to be detected to the manual detection terminal, and the staff can manually detect and label the last call sequence to be detected to obtain a labeling result, wherein the labeling result comprises the existence of a malicious code or the absence of a malicious code. The manual detection terminal can feed back the labeling result to the server.

In the malicious code detection method provided by the embodiment of the Application, a server (hereinafter referred to as a server) of the network device may select a first calling sequence and a second calling sequence from an Application Programming Interface (API) sequence set, where the API sequence set includes a plurality of calling sequences to be detected. The server can respectively calculate a normal risk value and a malicious risk value of the first calling sequence and a normal risk value and a malicious risk value of the second calling sequence, wherein the normal risk value represents the probability of malicious codes existing in the calling sequences; the malicious risk value represents the probability that malicious code is not present in the invocation sequence. The server can label the target sequence according to the normal risk value of the first calling sequence, the malicious risk value of the first calling sequence, the normal risk value of the second calling sequence and the malicious risk value of the second calling sequence to obtain a labeling result, wherein the target sequence is the first calling sequence or the second calling sequence, and the labeling result comprises the existence of malicious codes or the absence of malicious codes. The server can remove the target sequence from the API sequence set, the API sequence set without the target sequence is used as the next circulating API sequence set, and the cyclic detection and labeling are carried out on the plurality of calling sequences to be detected until all the calling sequences in the API sequence set are labeled. Therefore, in the embodiment of the application, the server of the network device marks all the calling sequences, so that whether malicious codes exist in each calling sequence of the multiple calling sequences to be detected or not can be determined, and in the process of determining whether the malicious codes exist in the target file, the calling sequence in which the malicious codes exist can be directly determined, and a user can directly process the calling sequences in which the malicious codes exist.

Furthermore, in the embodiment of the application, whether the target file has the malicious codes or not can be accurately and efficiently detected, and the calling sequence with the malicious codes is determined, so that the analysis, tracking and positioning capabilities of a user on the target file with the malicious codes are greatly improved. The method is greatly helpful for the user to track the identity of the APT attacker.

Referring to fig. 7, a block diagram of a malicious code detection apparatus provided in an embodiment of the present application is shown, where the malicious code detection apparatus may be configured in a server in the implementation environment shown in fig. 2. As shown in fig. 7, the malicious code detection apparatus may include a call sequence selection module 701, a risk calculation module 702, a labeling module 703, and a loop processing module 704, where:

a calling sequence selection module 701, configured to select a first calling sequence and a second calling sequence from an application program interface API sequence set, where the API sequence set includes multiple calling sequences to be detected;

a risk calculation module 702, configured to calculate a normal risk value and a malicious risk value of the first call sequence, and a normal risk value and a malicious risk value of the second call sequence, respectively, where the normal risk value indicates a probability that a malicious code exists in the call sequence; the malicious risk value represents the probability of no malicious code existing in the calling sequence;

the labeling module 703 is configured to label a target sequence according to a normal risk value of the first call sequence, a malicious risk value of the first call sequence, a normal risk value of the second call sequence, and a malicious risk value of the second call sequence, to obtain a labeling result, where the target sequence is the first call sequence or the second call sequence, and the labeling result includes existence of a malicious code or absence of a malicious code;

and the cyclic processing module 704 is configured to remove the target sequence from the API sequence set, use the API sequence set from which the target sequence is removed as an API sequence set of the next cycle, and cyclically detect and label the multiple call sequences to be detected until all call sequences in the API sequence set are labeled.

In an embodiment of the present application, the risk calculation module 702 is further configured to obtain a malicious sample set and a normal sample set, where the malicious sample set includes a plurality of malicious call sequences known to contain malicious code; the normal sample set comprises a plurality of normal calling sequences which are known not to contain malicious code; respectively calculating the malicious similarity between the first calling sequence and the malicious calling sequence aiming at each malicious calling sequence in the malicious sample set; respectively calculating the normal similarity of the first calling sequence and the normal calling sequence aiming at each normal calling sequence in the normal sample set; acquiring a malicious predicted value corresponding to each malicious calling sequence, a normal predicted value corresponding to each normal calling sequence and a predicted value of the first calling sequence; calculating a malicious risk value of the first calling sequence according to the malicious predicted value, the malicious similarity and the predicted value of the first calling sequence; and calculating the normal risk value of the first calling sequence according to the normal predicted value, the normal similarity and the predicted value of the first calling sequence.

In an embodiment of the present application, the labeling module 703 is further configured to select a minimum risk value from the normal risk value of the first call sequence, the malicious risk value of the first call sequence, the normal risk value of the second call sequence, and the malicious risk value of the second call sequence; determining the calling sequence corresponding to the minimum risk value as a target sequence; and labeling the target sequence according to the minimum risk value to obtain a labeling result.

In an embodiment of the present application, the labeling module 703 is further configured to label that the target sequence does not have a malicious code when the minimum risk value is a normal risk value of the first call sequence or a normal risk value of the second call sequence; and when the minimum risk value is the malicious risk value of the first calling sequence or the malicious risk value of the second calling sequence, marking that the target sequence has malicious codes.

In an embodiment of the present application, the calling sequence selecting module 701 is further configured to calculate hamming distances of every two calling sequences to be detected in the API sequence set respectively; and selecting two calling sequences with the maximum Hamming distance as a first calling sequence and a second calling sequence.

In an embodiment of the present application, the calling sequence selecting module 701 is further configured to run the received target file to be detected in the virtual sandbox, and obtain a calling sequence corresponding to an API function of the target file; and for each calling sequence, obtaining a feature vector of the calling sequence to form an API sequence set.

In an embodiment of the present application, the loop processing module 704 is further configured to stop detecting when the API sequence set of the next loop includes only one call sequence to be detected; and sending a detection instruction to the manual detection terminal, wherein the detection instruction is used for indicating manual detection of the call sequence to be detected in the API sequence set of the next cycle.

In one embodiment of the present application, there is provided a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

selecting a first calling sequence and a second calling sequence from an Application Program Interface (API) sequence set, wherein the API sequence set comprises a plurality of calling sequences to be detected; respectively calculating a normal risk value and a malicious risk value of the first calling sequence and a normal risk value and a malicious risk value of the second calling sequence, wherein the normal risk value represents the probability of malicious codes existing in the calling sequences; the malicious risk value represents the probability of no malicious code existing in the calling sequence; labeling a target sequence according to a normal risk value of the first calling sequence, a malicious risk value of the first calling sequence, a normal risk value of the second calling sequence and a malicious risk value of the second calling sequence to obtain a labeling result, wherein the target sequence is the first calling sequence or the second calling sequence, and the labeling result comprises the existence of malicious codes or the absence of malicious codes; and removing the target sequence from the API sequence set, taking the API sequence set without the target sequence as the next circulating API sequence set, and circularly detecting and labeling the calling sequences to be detected until all the calling sequences in the API sequence set are labeled.

In one embodiment of the application, the processor when executing the computer program may further implement the steps of: acquiring a malicious sample set and a normal sample set, wherein the malicious sample set comprises a plurality of known malicious calling sequences containing malicious codes; the normal sample set comprises a plurality of normal calling sequences which are known not to contain malicious code; respectively calculating the malicious similarity between the first calling sequence and the malicious calling sequence aiming at each malicious calling sequence in the malicious sample set; respectively calculating the normal similarity of the first calling sequence and the normal calling sequence aiming at each normal calling sequence in the normal sample set; acquiring a malicious predicted value corresponding to each malicious calling sequence, a normal predicted value corresponding to each normal calling sequence and a predicted value of the first calling sequence; calculating a malicious risk value of the first calling sequence according to the malicious predicted value, the malicious similarity and the predicted value of the first calling sequence; and calculating the normal risk value of the first calling sequence according to the normal predicted value, the normal similarity and the predicted value of the first calling sequence.

In one embodiment of the application, the processor when executing the computer program may further implement the steps of: selecting a minimum risk value from the normal risk value of the first calling sequence, the malicious risk value of the first calling sequence, the normal risk value of the second calling sequence and the malicious risk value of the second calling sequence; determining the calling sequence corresponding to the minimum risk value as a target sequence; and labeling the target sequence according to the minimum risk value to obtain a labeling result.

In one embodiment of the application, the processor when executing the computer program may further implement the steps of: when the minimum risk value is the normal risk value of the first calling sequence or the normal risk value of the second calling sequence, marking that the target sequence has no malicious code; and when the minimum risk value is the malicious risk value of the first calling sequence or the malicious risk value of the second calling sequence, marking that the target sequence has malicious codes.

In one embodiment of the application, the processor when executing the computer program may further implement the steps of: respectively calculating the Hamming distance of each two call sequences to be detected in the API sequence set; and selecting two calling sequences with the maximum Hamming distance as a first calling sequence and a second calling sequence.

In one embodiment of the application, the processor when executing the computer program may further implement the steps of: running the received target file to be detected in the virtual sandbox, and acquiring a calling sequence corresponding to an API (application program interface) function of the target file; and for each calling sequence, obtaining a feature vector of the calling sequence to form an API sequence set.

In one embodiment of the application, the processor when executing the computer program may further implement the steps of: stopping detection when the API sequence set circulated next time only comprises a calling sequence to be detected; and sending a detection instruction to the manual detection terminal, wherein the detection instruction is used for indicating manual detection of the call sequence to be detected in the API sequence set of the next cycle.

The implementation principle and technical effect of the computer device provided by the embodiment of the present application are similar to those of the method embodiment described above, and are not described herein again.

In an embodiment of the application, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of:

In one embodiment of the application, the computer program, when executed by the processor, may further implement the steps of: acquiring a malicious sample set and a normal sample set, wherein the malicious sample set comprises a plurality of known malicious calling sequences containing malicious codes; the normal sample set comprises a plurality of normal calling sequences which are known not to contain malicious code; respectively calculating the malicious similarity between the first calling sequence and the malicious calling sequence aiming at each malicious calling sequence in the malicious sample set; respectively calculating the normal similarity of the first calling sequence and the normal calling sequence aiming at each normal calling sequence in the normal sample set; acquiring a malicious predicted value corresponding to each malicious calling sequence, a normal predicted value corresponding to each normal calling sequence and a predicted value of the first calling sequence; calculating a malicious risk value of the first calling sequence according to the malicious predicted value, the malicious similarity and the predicted value of the first calling sequence; and calculating the normal risk value of the first calling sequence according to the normal predicted value, the normal similarity and the predicted value of the first calling sequence.

In one embodiment of the application, the computer program, when executed by the processor, may further implement the steps of: selecting a minimum risk value from the normal risk value of the first calling sequence, the malicious risk value of the first calling sequence, the normal risk value of the second calling sequence and the malicious risk value of the second calling sequence; determining the calling sequence corresponding to the minimum risk value as a target sequence; and labeling the target sequence according to the minimum risk value to obtain a labeling result.

In one embodiment of the application, the computer program, when executed by the processor, may further implement the steps of: when the minimum risk value is the normal risk value of the first calling sequence or the normal risk value of the second calling sequence, marking that the target sequence has no malicious code; and when the minimum risk value is the malicious risk value of the first calling sequence or the malicious risk value of the second calling sequence, marking that the target sequence has malicious codes.

In one embodiment of the application, the computer program, when executed by the processor, may further implement the steps of: respectively calculating the Hamming distance of each two call sequences to be detected in the API sequence set; and selecting two calling sequences with the maximum Hamming distance as a first calling sequence and a second calling sequence.

In one embodiment of the application, the computer program, when executed by the processor, may further implement the steps of: running the received target file to be detected in the virtual sandbox, and acquiring a calling sequence corresponding to an API (application program interface) function of the target file; and for each calling sequence, obtaining a feature vector of the calling sequence to form an API sequence set.

In one embodiment of the application, the computer program, when executed by the processor, may further implement the steps of: stopping detection when the API sequence set circulated next time only comprises a calling sequence to be detected; and sending a detection instruction to the manual detection terminal, wherein the detection instruction is used for indicating manual detection of the call sequence to be detected in the API sequence set of the next cycle.

The implementation principle and technical effect of the computer-readable storage medium provided in the embodiment of the present application are similar to those of the method embodiment described above, and are not described herein again.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the claims. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for malicious code detection, the method comprising:

acquiring static action information of a target file, wherein the static action information comprises an MD5 value of the target file, judging whether the MD5 value of the target file is greater than a threshold value, and if the MD5 value of the target file is greater than the threshold value, acquiring a source code of the target file by adopting a shelling tool; if the MD5 value of the target file is less than or equal to the threshold value, directly acquiring a source code of the target file;

extracting an API calling sequence from the source code of the target file according to the source code of the target file to obtain a plurality of API calling sequences;

acquiring a characteristic vector of each API calling sequence, and acquiring an API sequence set according to the characteristic vector of each API calling sequence; the API sequence set comprises a plurality of call sequences to be detected;

selecting a first calling sequence and a second calling sequence from the API sequence set;

respectively calculating a normal risk value and a malicious risk value of the first calling sequence, and a normal risk value and a malicious risk value of the second calling sequence, wherein the normal risk value represents the probability of malicious codes existing in the calling sequences; the malicious risk value represents the probability of no malicious code existing in the calling sequence;

labeling a target sequence according to the normal risk value of the first calling sequence, the malicious risk value of the first calling sequence, the normal risk value of the second calling sequence and the malicious risk value of the second calling sequence to obtain a labeling result, wherein the target sequence is the first calling sequence or the second calling sequence, and the labeling result comprises the existence of malicious codes or the absence of malicious codes;

and removing the target sequence from the API sequence set, taking the API sequence set without the target sequence as an API sequence set of the next cycle, and performing cyclic detection and labeling on the plurality of calling sequences to be detected until all calling sequences in the API sequence set are labeled.

2. The method of claim 1, wherein the calculating the normal risk value and the malicious risk value for the first sequence of calls comprises:

acquiring a malicious sample set and a normal sample set, wherein the malicious sample set comprises a plurality of known malicious calling sequences containing malicious codes; the normal sample set comprises a plurality of normal call sequences known to be free of malicious code;

respectively calculating the malicious similarity of the first calling sequence and the malicious calling sequence aiming at each malicious calling sequence in the malicious sample set; respectively calculating the normal similarity of the first calling sequence and the normal calling sequence aiming at each normal calling sequence in the normal sample set;

calculating the malicious risk value of the first calling sequence according to the malicious predicted value, the malicious similarity and the predicted value of the first calling sequence; and calculating the normal risk value of the first calling sequence according to the normal predicted value, the normal similarity and the predicted value of the first calling sequence.

3. The method according to claim 1, wherein the labeling a target sequence according to the normal risk value and the malicious risk value of the first call sequence and the normal risk value and the malicious risk value of the second call sequence to obtain a labeling result comprises:

selecting a minimum risk value from the normal risk value of the first call sequence, the malicious risk value of the first call sequence, the normal risk value of the second call sequence and the malicious risk value of the second call sequence;

determining the calling sequence corresponding to the minimum risk value as the target sequence;

4. The method according to claim 3, wherein the labeling the target sequence according to the minimum risk value to obtain a labeling result comprises:

when the minimum risk value is the normal risk value of the first calling sequence or the normal risk value of the second calling sequence, the marking result indicates that the target sequence does not have malicious codes;

and when the minimum risk value is the malicious risk value of the first calling sequence or the malicious risk value of the second calling sequence, the marking result is that the target sequence has malicious codes.

5. The method of claim 1, wherein said selecting a first call sequence and a second call sequence from said API sequence set comprises:

and selecting two calling sequences with the maximum Hamming distance as the first calling sequence and the second calling sequence.

6. The method of claim 1, wherein prior to said selecting the first call sequence and the second call sequence from the API sequence set, the method further comprises:

running a received target file to be detected in a virtual sandbox, and acquiring a calling sequence corresponding to an API (application program interface) function of the target file;

and for each calling sequence, obtaining a feature vector of the calling sequence to form the API sequence set.

7. The method of claim 1, wherein the step of using the set of API sequences excluding the target sequence as the set of API sequences for the next cycle comprises:

stopping detection when the API sequence set of the next cycle only comprises a calling sequence to be detected;

and sending a detection instruction to a manual detection terminal, wherein the detection instruction is used for indicating manual detection of the calling sequence to be detected in the API sequence set of the next cycle.

8. An apparatus for malicious code detection, the apparatus comprising:

calling a sequence selection module to obtain static action information of a target file, wherein the static action information comprises an MD5 value of the target file, judging whether the MD5 value of the target file is greater than a threshold value, and if the MD5 value of the target file is greater than the threshold value, obtaining a source code of the target file by adopting a shelling tool; if the MD5 value of the target file is less than or equal to the threshold value, directly acquiring a source code of the target file; extracting an API calling sequence from the source code of the target file according to the source code of the target file to obtain a plurality of API calling sequences; acquiring a characteristic vector of each API calling sequence, and acquiring an API sequence set according to the characteristic vector of each API calling sequence; the API sequence set comprises a plurality of call sequences to be detected; selecting a first calling sequence and a second calling sequence from the API sequence set;

the risk calculation module is used for calculating a normal risk value and a malicious risk value of the first calling sequence, a normal risk value and a malicious risk value of the second calling sequence respectively, wherein the normal risk value represents the probability of malicious codes existing in the calling sequences; the malicious risk value represents the probability of no malicious code existing in the calling sequence;

the labeling module is used for labeling a target sequence according to the normal risk value of the first calling sequence, the malicious risk value of the first calling sequence, the normal risk value of the second calling sequence and the malicious risk value of the second calling sequence to obtain a labeling result, wherein the target sequence is the first calling sequence or the second calling sequence, and the labeling result comprises the existence of malicious codes or the absence of malicious codes;

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.