CN107092829B

CN107092829B - Malicious code detection method based on image matching

Info

Publication number: CN107092829B
Application number: CN201710265324.1A
Authority: CN
Inventors: 喻波; 刘浏; 杨强; 解炜; 唐勇; 陈曙晖; 方莹
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2017-04-21
Filing date: 2017-04-21
Publication date: 2020-03-17
Anticipated expiration: 2037-04-21
Also published as: CN107092829A

Abstract

The invention discloses a malicious code detection method based on image matching, which comprises the following steps: s1, obtaining training samples corresponding to malicious codes of different family categories, respectively converting the training samples into gray level images and extracting corresponding image texture features; selecting a first reference sample from training samples of each family type, selecting a second reference sample according to the difference of image texture characteristics between the first reference sample and the samples, and forming the first reference sample and the second reference sample selected by each family type into a corresponding reference sample set; s2, converting the malicious codes to be detected into gray images, and extracting corresponding image texture features; and S3, matching the image texture features extracted in the step S2 with the reference sample sets corresponding to the family categories respectively, and confirming the family categories of the malicious codes to be detected according to matching results. The method has the advantages of simple implementation method, strong robustness, high detection accuracy and high detection effect.

Description

Malicious code detection method based on image matching

Technical Field

The invention relates to the technical field of malicious code detection and analysis, in particular to a malicious code detection method based on image matching.

Background

With the wide application of automatic generation tools of malicious codes and the application of open source codes in the malicious codes, the number of variants of the malicious codes and new malicious code families is also rapidly increased, the number of variants of the malicious codes detected by statistics year reaches 4.3 hundred million, and the malicious codes become a great challenge for network space security. The traditional malicious code detection method is mainly divided into two types: one is a detection method based on a signature mechanism, which can quickly detect known malicious code samples, but has the disadvantages that a great deal of expert experience and manual participation analysis are required, and deformed and confused malicious code samples are difficult to deal with; the other method is a detection method based on abnormal behaviors, which can detect a zero-day vulnerability and a novel family of malicious code samples, but has a high false alarm rate.

The method mainly comprises the steps of using a machine learning method to analyze malicious codes, wherein the steps are generally divided into ① to extract characteristics of the malicious codes, ② to select a proper model, and ③ to obtain a classification result.

The malicious code detection method based on automatic analysis has the following defects:

(1) the robustness is poor, and the detection precision is low. In the method, classification detection is carried out based on the extracted characteristics of malicious codes, the detection precision obtained by different characteristics may be different, and the precision of characteristic extraction and the selection of the characteristics directly influence the precision of a final detection analysis result, so that the actual detection robustness is poor and the detection precision is low;

(2) the detection efficiency is low. The method is usually complex to implement, and usually needs a long time for model training, so that the detection efficiency is low.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the technical problems in the prior art, the invention provides the malicious code detection method based on image matching, which has the advantages of simple implementation method, strong robustness, high detection accuracy and high detection effect.

In order to solve the technical problems, the technical scheme provided by the invention is as follows:

a malicious code detection method based on image matching comprises the following steps:

s1, selecting a reference sample: acquiring training samples corresponding to malicious codes of different family categories, respectively converting the training samples into gray level images and extracting corresponding image texture features; selecting a first reference sample from training samples of each family type, selecting a second reference sample according to the difference of image texture characteristics between the first reference sample and the samples, and forming the first reference sample and the second reference sample selected by each family type into a corresponding reference sample set;

s2, image feature extraction: converting the malicious codes to be detected into gray level images, and extracting corresponding image texture features;

s3, testing code classification: and matching the image texture features extracted in the step S2 with the reference sample sets corresponding to the family categories respectively, and confirming the family categories of the malicious codes to be detected according to matching results.

As a further improvement of the present invention, the specific steps of selecting the second reference sample for each family category in step S1 are as follows:

s11, obtaining candidate reference samples: matching the selected first reference samples with the rest training samples respectively, and finding out the training samples which are wrongly distributed in each family type according to the matching result and using the training samples as candidate reference samples;

s12, determining a second reference sample: and respectively calculating difference values between each candidate reference sample and other candidate reference samples in each family type, and if the calculated difference values are greater than a specified threshold value, taking the corresponding candidate reference sample as a second reference sample of the corresponding family type.

As a further improvement of the present invention, in step S12, specifically, a Gabor function value of each candidate reference sample and a distance value between each candidate reference sample and another training sample are calculated, and a difference value between the candidate reference sample and another candidate reference sample is calculated according to the Gabor function value and the distance value.

As a further improvement of the invention, the difference value between one candidate reference sample and other candidate reference samples is calculated according to the following formula;

p_d(es_id)＝∑_{j＝0,1,......,N}D(es_id,es_hj)

wherein es_idFor the ith class of candidate reference samples, es_hjFor the h-th class jth candidate reference sample, D (es)_id,es_hj) As samples es_idWith the sample es_hjH is es_idAnd μ is a weighting coefficient, N is the number of reference samples contained in the family class h, M is the number of reference samples, and l is the vector length of the image texture feature.

As a further improvement of the invention: the image texture features are signal type static texture features.

As a further improvement of the invention: the image texture features are obtained by extracting through a Gabor filter.

As a further improvement of the present invention, the specific steps of confirming the family category of the malicious code to be detected in step S3 are:

s31, respectively obtaining matching results of the malicious codes to be detected and all reference samples in the reference sample sets of all family categories;

and S32, respectively obtaining a comprehensive matching value corresponding to each family type according to all matching results of each family type, and judging whether the malicious codes to be detected belong to the corresponding family type according to the comprehensive matching value of each family type.

As a further improvement of the invention: the comprehensive matching value is obtained by calculation according to the following formula;

wherein,

es_testfor malicious code to be detected, es_ijIs the jth reference sample of the ith class, and N is the number of reference samples contained in the family class i.

As a further improvement of the invention: and when the comprehensive matching value R corresponding to the target family category meets the condition that R is more than 0, judging that the malicious codes to be detected belong to the target family category, otherwise, judging that the malicious codes to be detected do not belong to the target family category.

Compared with the prior art, the invention has the advantages that:

1) according to the malicious code detection method based on image matching, image texture features are automatically extracted, image matching is carried out based on feature similarity analysis, family classification judgment is realized based on an image matching result, detection automation can be realized, and large-scale malicious code family detection analysis can be conveniently and efficiently realized;

2) in the image matching process, a first reference sample of each family type is selected, a second reference sample is selected according to the difference of image texture characteristics between the first reference sample and the samples, a reference sample set is formed by the first reference sample and the second reference sample to carry out image matching on the malicious code to be detected, the family type of the malicious code to be detected is finally confirmed, long-time model training is not needed, the reliability of the selected reference sample is high, the influence of the selection of the reference sample on a detection result can be greatly reduced, and the detection precision is improved;

3) the malicious code detection method based on image matching further comprises the steps of selecting a first reference sample, selecting a second reference sample based on the first reference sample, searching out a sample with a wrong matching through a matching state of the first reference sample as a candidate reference sample, calculating a difference value between the candidate reference sample and a training sample of a current family type, and finally determining whether the candidate reference sample is used as a new reference sample or not based on the difference value, so that the sample which is distributed by errors and has a larger difference with other samples is also used as the reference sample.

Drawings

Fig. 1 is a schematic flow chart of an implementation of the malicious code detection method based on image matching according to the embodiment.

Fig. 2 is a schematic diagram illustrating an implementation principle of the malicious code detection method based on image matching according to the embodiment.

Fig. 3 is a schematic flow chart of a specific implementation of the grayscale image conversion in this embodiment.

Fig. 4 is a gray scale map obtained in an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the drawings and specific preferred embodiments of the description, without thereby limiting the scope of protection of the invention.

As shown in fig. 1 and 2, the malicious code detection method based on image matching in the present embodiment includes the steps of:

s3, testing code classification: and (5) matching the image texture features extracted in the step (S2) with the reference sample sets corresponding to the family categories respectively, and confirming the family categories of the malicious codes to be detected according to matching results.

The texture of the image can reflect the visual characteristics of the homogeneous phenomenon in the image, and can reflect the slowly-changing or periodically-changing surface structure organization arrangement attributes of the surface of the object. The embodiment utilizes the characteristic texture characteristics, realizes malicious code detection based on an image matching mode, automatically extracts image texture characteristics, performs image matching based on characteristic similarity analysis, and then realizes family classification judgment based on an image matching result, can realize detection automation, can be conveniently applied to a back-end classification system of PC (personal computer) end, mobile client end and the like or large-scale malicious code family homology analysis to perform malicious software detection, and can efficiently mine malicious codes from massive samples to be detected in an online or offline mode and the like.

In the image matching process, the first reference sample of each family type is selected firstly, the second reference sample is selected according to the image texture characteristic difference between the first reference sample and the samples, the first reference sample and the second reference sample form a reference sample set to perform image matching on the malicious code to be detected, and finally the family type of the malicious code to be detected is confirmed.

In this embodiment, the specific step of selecting the second reference sample for each family category in step S1 is as follows:

s11, obtaining candidate reference samples: matching each selected first reference sample with the rest training samples respectively, and searching out the training samples which are wrongly distributed in each family type according to the matching result and using the training samples as candidate reference samples;

In the traditional image matching, a reference sample is directly selected from a sample set with a known class as a reference sample of the class, if the sample to be detected is matched with the image of the reference sample, the sample to be detected is judged as the class corresponding to the reference sample, the matching effects obtained by selecting the reference images of different samples are possibly different, and the selection of the reference sample directly influences the precision of the matching effects. In the embodiment, when the reference sample is selected, the second reference sample is selected based on the first reference sample, in the selection process of the second reference sample, firstly, a sample with a wrong matching is found out through the matching state of the second reference sample and the first reference sample to serve as a candidate reference sample, then, a difference value between the second reference sample and other candidate reference samples is calculated through the candidate reference sample, and whether the second reference sample is used as a new reference sample is determined based on the difference value, so that a sample which is distributed by errors and has a larger difference with other candidate reference samples is also used as the reference sample.

In step S1, for the training sample, the embodiment first performs image transformation on the sample, and transforms the binary malicious code into a grayscale image. As shown in fig. 3, in the embodiment, when the gray scale image of the sample is converted, since each pixel of the gray scale image is represented by unsigned integer data located between [0 and 255], a malicious code in a binary form is first converted into an unsigned integer data matrix, and since the binary data of 8 bits is converted into an integer greater than 0 and less than 256, the binary file is specifically cut and converted in units of continuous 8 bits, and the image width can be fine-tuned according to the conversion requirement, so as to obtain the gray scale image of each training sample.

The image texture features mainly comprise four types, namely statistical texture features, model texture features, signal texture features and structural texture features. In this embodiment, the image texture features are specifically signal texture features, feature extraction is performed by using a signal texture processing method, the image texture features are specifically extracted by using a Gabor filter, and this embodiment is based on static features, does not need to run malicious codes, and is simple to implement.

The Gabor filter is a linear filter for image edge feature extraction, and can be defined as a sine wave multiplied by a gaussian function, wherein a two-dimensional Gabor filter is a sine plane wave. Due to the multiplicative convolution property, the fourier transform of the impulse response of a Gabor filter is the convolution of the fourier transform of its harmonic function and the fourier transform of a gaussian function, then the filter consists of real and imaginary parts and are orthogonal to each other. The Gabor filter used in this embodiment is specifically as follows, where the complex expression is:

the real part is:

the virtual part is:

wherein, x 'is xcos θ + ysin θ, y' is xsin θ + ycos θ, λ is wavelength, and pixel is unit; theta represents the direction and the value range is between 0 degree and 360 degrees; psi denotes a phase shift, falling within the region of [ -180 °,180 ° ]; the value of γ determines the ellipticity of the shape of the Gabor function; Σ represents the standard deviation of the gaussian factor of the Gabor function and varies with the bandwidth.

When the image features are extracted, an array of Gabor functions with different frequencies and different directions can be obtained, and when texture features are calculated based on Gabor, the image texture features of each sample can be represented as T ([ a1, a2], [ b1, b2], [ c1, c2], [ d1, d2]), that is, the image texture features are composed of four feature values of a, b, c and d, and each feature value is respectively composed of a real part (subscript 1) and an imaginary part (subscript 2).

When the samples are matched, the difference value of the image textural features between the samples is calculated specifically, the matching performance between the samples is judged according to the difference value, and the smaller the difference value of the image textural features between the samples is, the more the matching is correspondingly performed. Single sample s_iAnd s_jThe difference between the two is calculated by the formula:

in this embodiment, in step S12, the difference value between each candidate reference sample and another candidate reference sample is obtained by calculating the Gabor function value of each candidate reference sample and the training sample, and the distance value between each candidate reference sample and another candidate reference sample according to the Gabor function value and the distance value.

In this embodiment, one candidate reference sample es_idCalculating the difference value between the reference sample and other candidate reference samples according to the following formula;

p_d(es_id)＝∑_{j＝0,1,......,N}D(es_id,es_hj)

wherein es_idFor the ith class of candidate reference samples, es_hjFor the h-th class jth candidate reference sample, D (es)_id,es_hj) As samples es_idWith the sample es_hjH is es_idIncorrectly assigned family class, μ is a trade-off factor, and N is a family classThe number of the reference samples contained in the odd h, M is the number of the reference samples, and l is the length of the image texture feature vector obtained by the G filter. And determining whether the candidate reference samples are used as new reference samples according to the difference value between each candidate reference sample and the training sample of the current family class so as to improve the reliability of the reference samples.

For any n samples and sample set C containing m family classes, it is noted as C ═ C₁,C₂,…,C_mN in the set C₁A training sample and n₂Unknown sample, and n ═ n₁+n₂The detailed steps of selecting the candidate reference sample are as follows:

① randomly selects m samples b from the training samples₁₁,b₂₁,…,b_m1Is a reference sample, where b_ijRepresents the jth reference sample from family i, i.e. randomly selects one training sample from the training samples of each family class as the initial reference sample (the first reference sample);

②, matching the residual training samples with the initial reference samples respectively, and counting the samples with the matching errors, namely the samples allocated incorrectly, wherein the samples with the matching errors are also m types correspondingly, the samples with the matching errors are taken as candidate reference samples, and the samples with the matching errors of m types are assumed to be expressed as:

es＝{{es₁₁,es₁₂,...},{es₂₁,es₂₂,...},...,{es_n,1,es_n,2,...}}

③, performing secondary matching inside the matching error sample set of each family category, specifically, representing the candidate reference sample set of the family i as { es_i1,es_i2…, a Gabor function value { Gabor is calculated for each candidate reference sample_l(es_i1)，gabor_l(es_i2),gabor_l(es_i3) …, and calculating the difference between different candidate reference samples, specifically calculating the sample es according to the formula (5)_idRegarding the difference value of family i, if the candidate sample es_idSatisfies the difference value of D (es)_id)>ρ, then es is added_idNew reference sample for family i(second reference sample).

In this embodiment, to-be-detected malicious codes are first subjected to image transformation, binary to-be-detected malicious codes are transformed into a gray image form, and then image texture features are extracted, which is specifically the same as the processing method of the training samples.

In this embodiment, the specific steps of determining the family type of the malicious code to be detected based on the extracted image texture features in step S3 are as follows:

s31, respectively obtaining matching results of the malicious codes to be detected and all reference samples in the reference sample set of all family categories;

In this embodiment, the comprehensive matching value is calculated according to the following formula;

wherein,

The embodiment detects malicious codes es to be detected_testWith reference samples es_ijDuring matching, if the matching is carried out, the matching result is 1, if the matching is not matched, the matching result is-1, and certainly, the matching result can be set according to actual requirements; and accumulating all matching results obtained by each family category to obtain a final comprehensive matching value, and judging the family category to which the comprehensive matching value belongs according to the comprehensive matching value.

In this embodiment, when the comprehensive matching value R corresponding to the target family category satisfies R >0, it is determined that the malicious code to be detected belongs to the target family category, and otherwise, it is determined that the malicious code to be detected does not belong to the target family category.

The invention is further illustrated below by taking the detection classification of 10 test samples in two family classes as an example.

The training samples used in this example are shown in table 1.

Table 1: and training a sample table.

Step 1: reference sample selection

Step 1.1: training sample image texture feature extraction

Taking two training samples S1(0B06744D7C5822BA585C5992B10ADFA0), S2(0BDAFFBA037a4880D31C93C0AADCC1FE) in family (1), and two training samples S3(2C69C485a46B03C277B5F88DED0BABF0), S4(2C9F38EF39CFD73AA52E22869E8ABD90) in family (2) as examples, the four malicious code training samples are first converted from binary files into gray maps, wherein a binary code segment "01100111" is converted into an unsigned integer 206, which indicates that the value of the corresponding pixel point after conversion into a gray map is 206, and the gray map result is shown in fig. 4, wherein map (a) is family (2) and corresponds to samples S3 and S4, respectively; panel (b) is family (2), corresponding to samples S1 and S2, respectively; and extracting texture features of the malicious code by using a Gabor filter, wherein the implementation method of the Gabor filter is specifically shown in the formulas (1) to (3), and the texture features of the four samples obtained by calculation are respectively as follows:

T_{sample S1}＝([3.64589196e-01,1.78531921e-02],[1.11456886e-01,3.62631582e-03],[2.45940133e-01,4.82167451e-03],[3.66851460e-04,1.85390288e-04])；

T_{Sample S2}＝([3.67820753e-01,2.47166142e-02],[1.12444790e-01,5.22362168e-03],[2.48120037e-01,7.30584538e-03],[3.70103068e-04,3.69625304e-04])；

T_{Sample S3}＝([3.82683113e-01,1.65478632e-02],[1.16988294e-01,5.31969120e-03],[2.58145706e-01,3.20882018e-03],[3.85057648e-04,4.53992963e-04])；

T_{Sample S4}＝([3.78114609e-01,2.53183776e-02],[1.15591678e-01,5.70669572e-03],[2.55063941e-01,7.49917029e-03],[3.80460797e-04,3.41053618e-04])。

Step 1.2: candidate reference sample selection

In this embodiment, the families (1) and (2) provide training sample sets 1 and 2 shown in table 1, respectively, and each sample set contains 10 training samples. The training samples are differentially calculated using equation (4) to match, and assuming that the initial reference samples are sample S1 and sample S3, the matching result is: in training sample set 1, sample S7 and sample S10 are assigned errors; in training sample set 2, sample S5 is assigned an error; then add these three samples as candidate reference samples { [ c ]₁₁,c₁₂],[c₂₁]}。

Step 1.3: second reference sample determination

Calculating the texture features of the candidate reference samples:

([3.53133564e-01,2.24345224e-02],[1.07954837e-01,4.99747304e-03],[2.38212532e-01,6.49801062e-03],[3.55324746e-04,4.32171770e-04]),([3.54380214e-01,2.24449735e-02],[1.08335945e-01,5.00705347e-03],[2.39053482e-01,6.41765146e-03],[3.56579131e-04,3.85045161e-04])，([3.66485717e-01,2.55705031e-02],[1.12036663e-01,5.15760513e-03],[2.47219465e-01,8.83855001e-03],[3.68759749e-04,3.02423971e-04])。

calculating the difference value between the candidate reference sample and the other reference samples by using the above formula (5), it can be obtained:

where, μ is set to 2.

In the present embodiment, the threshold ρ is assumed to be 0.45, since D (c) is present₁₁)>ρ and D (c)₁₂)>ρ, then the candidate reference sample c₁₁And c₁₂Added as a new benchmarkSample (second reference sample), then there is a reference sample set of family (1) as { b }₁₁,c₁₁,c₁₂}. Meanwhile, because only 1 candidate reference sample of family 2 is directly added as a new reference sample (second reference sample), the reference sample set is obtained as { b }₂₁,c₂₁}。

Step 2: test sample image texture feature extraction

And converting each test malicious code into a gray level image, and extracting image texture features, wherein the specific method is as described above.

And step 3: detection classification

Reference sample set { b) using family (1)₁₁,c₁₁,c₁₂And the reference sample set of family (2) { b }₂₁,c₂₁Re-matching test samples for each test sample, wherein the set of test samples comprises ten test samples { S }₁,S₂,S₃,S₄,S₅,S₆,S₇,S₈,S₉,S₁₀}. The comprehensive matching result of each test sample obtained by adopting the formula (6) is specifically as follows:

table 2: and testing a matching result table.

Test specimen	S₁	S₂	S₃	S₄	S₅	S₆	S₇	S₈	S₉	S₁₀
											Family 1	3	3	3	1	3	0	0	3	1	0
Family 2	0	0	0	2	0	2	2	0	2	2

If the comprehensive matching result is larger than 0, the comprehensive matching result is judged to belong to the corresponding family category, otherwise, the comprehensive matching result is judged not to belong to the family category. Then according to the above-mentioned comprehensive matching resultThe final test result is obtained as { S₁,S₂,S₃,S₅,S₈Belongs to family (1), { S4, S6, S7, S9, S10} belongs to family (2). According to the detection result, the detection method can accurately divide the malicious code family category, and has high detection efficiency.

The foregoing is considered as illustrative of the preferred embodiments of the invention and is not to be construed as limiting the invention in any way. Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical spirit of the present invention should fall within the protection scope of the technical scheme of the present invention, unless the technical spirit of the present invention departs from the content of the technical scheme of the present invention.

Claims

1. A malicious code detection method based on image matching is characterized by comprising the following steps:

s3, testing code classification: matching the image texture features extracted in the step S2 with the reference sample sets corresponding to the family categories respectively, and confirming the family categories of the malicious codes to be detected according to matching results;

the specific step of selecting the second reference sample in step S1 is:

2. The method according to claim 1, wherein in step S12, a difference value between each candidate reference sample and the other candidate reference samples is calculated according to the Gabor function value and the distance value by specifically calculating a Gabor function value of each candidate reference sample and a distance value between each candidate reference sample and the other candidate reference samples.

3. The image matching-based malicious code detection method according to claim 2, wherein the difference value between one candidate reference sample and the other candidate reference samples is calculated according to the following formula;

p_d(es_id)＝∑_{j＝0,1,......,N}D(es_id,es_hj)

4. The malicious code detection method based on image matching according to any one of claims 1 to 3, wherein the image texture features are signal type static texture features.

5. The malicious code detection method based on image matching according to any one of claims 1 to 3, characterized in that: the image texture features are obtained by extracting through a Gabor filter.

6. The image matching-based malicious code detection method according to any one of claims 1 to 3, wherein the specific steps of confirming the family category of the malicious code to be detected in the step S3 are as follows:

7. The image matching-based malicious code detection method according to claim 6, wherein the comprehensive matching value is calculated according to the following formula;

wherein,

8. The image matching-based malicious code detection method according to claim 7, wherein: and when the comprehensive matching value R corresponding to the target family category meets R >0, judging that the malicious codes to be detected belong to the target family category, otherwise, judging that the malicious codes to be detected do not belong to the target family category.