CN108416213A - A kind of malicious code sorting technique based on image texture fingerprint - Google Patents
A kind of malicious code sorting technique based on image texture fingerprint Download PDFInfo
- Publication number
- CN108416213A CN108416213A CN201810207712.9A CN201810207712A CN108416213A CN 108416213 A CN108416213 A CN 108416213A CN 201810207712 A CN201810207712 A CN 201810207712A CN 108416213 A CN108416213 A CN 108416213A
- Authority
- CN
- China
- Prior art keywords
- malicious code
- image
- gray level
- sorting technique
- code
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/562—Static detection
- G06F21/563—Static detection by source code analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Virology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Collating Specific Patterns (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a kind of malicious code sorting techniques based on image texture fingerprint, by combining image analysis technology and malicious code sorting technique, gray level image of the binary channels without compression is mapped as after operation code is quantized, then twin-channel image is converted by single pass gray-scale map according to gray-scale transformation method, the textural characteristics of image are extracted using gray level co-occurrence matrixes, and using these features as the substantive characteristics of malicious code, finally classified to malicious code using random forests algorithm.The malicious code sorting technique based on image texture fingerprint of the present invention, on the one hand reduces the feature quantity for stating malicious code, improves the classification speed of malicious code;On the other hand the malicious codes confounding issues such as operation code rearrangement, code conversion are effectively overcomed, improves the precision of malicious code classification.
Description
Technical field
The present invention relates to malicious code classification field more particularly to a kind of malicious code classification sides based on image analysis
Method.
Background technology
With flourishing for internet, malicious code has become one of the principal element for threatening internet security,
Show the trend of rapid growth.In the prior art, the mode of analysis and identification malicious code generally include Static Analysis Method and
Dynamic analysing method, dynamic analysing method are analyzed in code operational process, and the code analyzed is exactly practical execution
Code, but dynamic analysis can only obtain single-pathway behavior in an implementation procedure, and there are a plurality of for many malicious codes
Execution route, therefore there are certain limitations for dynamic analysing method itself;Static Analysis Method is first to executable program
Dis-assembling is carried out, and the characteristic information of extraction code is classified on this basis, there are many research people in the prior art
Malicious code is converted to image, and extracts the Static Analysis Method that characteristics of image is identified, such as Nataraj L et al. by member
Propose a kind of SPAM-GIST malicious codes sorting technique (Nataraj L, Manjunath B S.SPAM:Signal
Processing to Analyze Malware[Applications Corner][J].IEEE Signal Processing
Magazine,2016,33(2):105-117), malicious code binary file is mapped as image and carrys out Expressive Features, utilized
The global characteristics GIST of the multiple dimensioned and multidirectional feature extraction image of Gabor filter, and use this character representation malice generation
Code feature, then classifies to malicious code using nearest neighbor algorithm.However the feature dimensions of these Static Analysis Methods extraction
Number is excessive and insufficient to the malicious code nicety of grading after obscuring, this causes, and the nicety of grading of malicious code is low, classification speed is slow
The deficiencies of.Therefore, how to obtain the malicious code analysis method that nicety of grading is high, classification speed is fast is that those skilled in the art need to
It solves the problems, such as.
Invention content
The present invention provides a kind of malicious code sorting techniques based on image texture fingerprint, solve evil in the prior art
The code classification technology of anticipating can not the effectively malicious code of identity confusion, the characteristic amount of extraction it is big, and then lead to malicious code
The precision of classification is low, slow-footed problem.
In order to solve the above technical problems, one aspect of the present invention, which is to provide one kind, being based on image texture fingerprint
Malicious code sorting technique, including step:Operation code quantizes, and the operation code of the malicious code is converted to numeralization text
Part, the numeralization file are further converted into binary file;The processing of binary channels gray-scale map, the binary file is mapped
Two vectors are generated, described two vector correspondences are visualized as binary channels gray level image;The processing of single channel gray-scale map, will be described double
Channel gray level image is converted into the gray-scale map of fixed grey level by greyscale transformation, and exports single channel gray level image;Extract line
Feature is managed, extracts the textural characteristics of image from the single channel gray level image using gray level co-occurrence matrixes;Malicious code is classified, will
Substantive characteristics of the textural characteristics as the malicious code divides the textural characteristics using random forests algorithm
Class.
In another embodiment of malicious code sorting technique the present invention is based on image texture fingerprint, in the operation yardage
It is the signless binary files of 16bit by the numerical value conversion in the numeralization file, in the binary file in value
In, each binary numeral is further separated into two parts, wherein first part includes low 8bit, and second part includes height
8bit。
In another embodiment of malicious code sorting technique the present invention is based on image texture fingerprint, in binary channels gray-scale map
In processing, the binary file, which corresponds to, generates two vectors, and each element value range in the vector is [0,255],
Wherein, primary vector corresponds to the first part in the binary file, and secondary vector corresponds in the binary file
Two parts, the primary vector are mapped as first passage gray level image again, and the secondary vector is mapped as second channel gray scale again
Image, first passage gray level image and second channel gray level image after mapping save as two PNG pictures.
In another embodiment of malicious code sorting technique the present invention is based on image texture fingerprint, in single channel ash
It spends in figure processing, gray scale is carried out to the first passage gray level image and second channel gray level image using gray level hierarchical algorithm
Transformation.
In another embodiment of malicious code sorting technique the present invention is based on image texture fingerprint, in the extraction texture
In feature, 12 kinds of textural characteristics are extracted from the single channel gray level image using gray level co-occurrence matrixes.
In another embodiment of malicious code sorting technique the present invention is based on image texture fingerprint, 12 kinds of texture spies
Sign includes:Angular second moment, contrast, correlation, covariance, contrast sub-matrix, homogeney, otherness, entropy, mean value and variance
With and entropy, poor entropy.
In another embodiment of malicious code sorting technique the present invention is based on image texture fingerprint, the random forest is calculated
The method that method classifies to the textural characteristics includes:Step 1:It is concentrated from malicious code training sample, using bootstrap
Method has that puts back to randomly select K new self-service sample sets, and thus builds K decision tree, the sample not being pumped to every time
The outer data of K bag of composition;Step 2:Equipped with n feature, then m (m are randomly selected at each node of each decision tree<=
N) a candidate feature selects that there is minimum Geordie to refer to by calculating the gini index of each candidate feature in m candidate feature
The feature of numerical value carries out node split;Step 3:When only there are one the samples in classification or node in the node in every decision tree
When number is less than minimum division series, stop growing;Step 4:K decision tree of generation is formed into random forest, uses random forest
Classify to new malicious code substantive characteristics data, classification results by the ballot of decision tree it is how many depending on;Wherein Geordie refers to
Several computational methods are as follows:
Wherein, Gini (D) indicates the Geordie value for the data set D that each node of each tree includes before division
Computational methods, | y | and pkThe ratio of total data set is accounted for for the categorical measure of the data set D and each classification;A is the m
Any one in candidate feature, Dv indicate that set { a=fixed attributes value }, V indicate that feature a can be divided according to its attribute value
Total class number, Gini (Dv) indicate DvGeordie value, | D | indicate the sample number of the data set D, | Dv| indicate the set { a
=fixed attribute value } sample number, (D a) states the gini index of the characteristic attribute a of the data set D to Gini_index.
In another embodiment of malicious code sorting technique the present invention is based on image texture fingerprint, by gray scale stage layered
The gray level of the single channel gray level image after algorithmic transformation is 16 grades.
In another embodiment of malicious code sorting technique the present invention is based on image texture fingerprint, in the random forest
Decision tree value range be [100,150].
In another embodiment of malicious code sorting technique the present invention is based on image texture fingerprint, in the random forest
Decision tree value be 100.
The beneficial effects of the invention are as follows:The invention discloses a kind of malicious code classification sides based on image texture fingerprint
Method is mapped as twin-channel no pressure by combining image analysis technology and malicious code sorting technique after operation code quantizes
Then twin-channel image is converted single pass gray-scale map by the gray level image of contracting according to gray-scale transformation method, total using gray scale
The textural characteristics of raw matrix extraction image, and using these features as the substantive characteristics of malicious code, finally use random forest
Algorithm classifies to malicious code.The malicious code sorting technique based on image texture fingerprint in the present invention, will operation
For the gray level image gray scale transformation of code to the numerical value of a very little, this so that the size of gray level co-occurrence matrixes can be smaller, reduces
Feature quantity for stating malicious code, improves the classification speed of malicious code;In addition it is reflected after operation code being quantized
It penetrates and effectively overcomes the malicious codes confounding issues such as operation code rearrangement, code conversion for the method for gray-scale map, improve malice
The precision of code classification.
Description of the drawings
Fig. 1 is an embodiment flow chart of the malicious code sorting technique the present invention is based on image texture fingerprint;
Fig. 2 is the schematic diagram for the embodiment that the operation code of malicious code is mapped as to binary channels gray-scale map in the present invention;
Fig. 3 is in the present invention between gray level and malicious code sorting technique nicety of grading based on image texture fingerprint
The schematic diagram of relationship;
Fig. 4 be in the present invention decision tree quantity and malicious code sorting technique nicety of grading based on image texture fingerprint it
Between relationship schematic diagram.
Specific implementation mode
To facilitate the understanding of the present invention, in the following with reference to the drawings and specific embodiments, the present invention will be described in more detail.
The preferred embodiment of the present invention is given in attached drawing.But the present invention can realize in many different forms, and it is unlimited
In this specification described embodiment.Make to the disclosure on the contrary, purpose of providing these embodiments is
Understand more thorough and comprehensive.
It should be noted that unless otherwise defined, all technical and scientific terms used in this specification with belong to
The normally understood meaning of those skilled in the art of the present invention is identical.Used term in the description of the invention
It is to be not intended to the limitation present invention to describe the purpose of specific embodiment.Term "and/or" packet used in this specification
Include any and all combinations of one or more relevant Listed Items.
Fig. 1 is an embodiment schematic diagram of the malicious code sorting technique the present invention is based on image texture fingerprint.Described
Malicious code sorting technique based on image texture fingerprint specifically includes following steps:
Step S1, operation code numeralization, numeralization file, the numeralization are converted to by the operation code of the malicious code
File is further converted into binary file;
Dis-assembling code corresponding with the malicious code is obtained after malicious code dis-assembling, includes in the dis-assembling code
There is operation code, from collecting all operation codes in order in the dis-assembling code and being stored in a vector, then the vector
The operation code file of the as described malicious code.Further, the corresponding numerical value of all operation codes is defined, by aforesaid operations code file
In each operation code be converted to corresponding numerical value and obtain the numeralization file of the operation code file, by the numeralization file
Further binarization can be obtained the binary file of the operation code file.Table 1 lists all operation codes and its correspondence
Numerical value.
1 operation code of table and its corresponding numerical value
Step S2, binary channels gray-scale map processing, the binary file is mapped and generates two vectors, described two vectors
Correspondence is visualized as binary channels gray level image;
Specifically, the binary numeral of each operation code is divided into two parts, since the greatest measure of operation code is
301, therefore, when using binary representation operation code, 9 binary bits codes are at least needed, further for subsequent extracted
Image information is arranged and indicates operation code without symbol binary system using 16bit, when the numerical value of operation code is 16 inadequate, two
0 is mended before ary codes.Therefore, the binary numeral of the correspondence numerical value of each operation code is divided into two parts, first part
Including least-significant byte part, second part includes most-significant byte part, and first part is converted to decimal system vector and obtains primary vector, will
Second part is converted to decimal system vector and obtains secondary vector, wherein the data amount check that the primary vector and secondary vector include
It is equal to the operation code number in the operation code file included.The primary vector and secondary vector conversion are mapped as respectively
Image obtains binary channels gray level image --- first passage gray level image and the second gray channel image.
Step S3, the processing of single channel gray-scale map convert the binary channels gray level image to fixed ash by greyscale transformation
The gray-scale map of grade is spent, and exports single channel gray level image;
In order to obtain the global feature of the malicious code and reduce the feature quantity of image, need the binary channels
Greyscale image transitions are the single channel gray level image of fixed grey level.Specifically, in the present invention, using gray level hierarchical algorithm
Greyscale transformation is carried out to the binary channels gray level image, to obtain the gray-scale map of fixed grey level, by the gray-scale map of the fix level
Output obtains single channel gray level image.After greyscale transformation, the grey level range of the gray-scale map of the obtained fix level
It is smaller.
Step S4, texture feature extraction extract the line of image using gray level co-occurrence matrixes from the single channel gray level image
Manage feature;
Texture fingerprint characteristic is one of important feature of image, after the present invention extracts the textural characteristics of image as image conversion
Malicious code substantive characteristics.Specifically, the gray level co-occurrence matrixes in image processing field are used to obtain the texture of image
Feature.
Step S5, malicious code classification, using the textural characteristics as the substantive characteristics of the malicious code, using random
Forest algorithm classifies to the textural characteristics;
The algorithms most in use for being used for malicious code classification in the prior art includes neural network, support vector machines, k-nearest neighbor
Deng.Classified to the substantive characteristics of malicious code using random forests algorithm in the present invention, and finally obtains the malice generation
Classification belonging to code.
The malicious code sorting technique based on image texture fingerprint of the present invention, is that operation code is converted to gray-scale map,
And then gray scale transformation is mapped as gray level co-occurrence matrixes to the numerical value of a very little, this makes the ruler of gray level co-occurrence matrixes
Very little meeting is smaller, reduces the feature quantity for stating malicious code, improves the classification of malicious code to a certain extent
Speed;In addition the method that gray-scale map is mapped as after operation code being quantized can be efficiently against operation code rearrangement, code conversion
Equal malicious codes confounding issues, improve the precision of malicious code classification.
Fig. 2 is the schematic diagram for the embodiment that the operation code of malicious code is mapped as to binary channels gray-scale map in the present invention.
In fig. 2, the dis-assembling code 21 of the malicious code is obtained to malicious code dis-assembling first, dis-assembling code 21 is also operation
One example of code;Secondly, its operation code composition operation code vector 22 for being included is collected successively from dis-assembling code 21;According to
Numeralization is carried out to operation code vector 22 according to table 1 and is converted to operation code numerical value vector 23;Then by operation code numerical value vector 23
In all numerical value conversions be 16bit unsigned ints data to obtain operation code binary file 24;Later by the operation
All 16 integer datas in code binary file 24 are divided into two parts, and first part includes most effective least-significant byte, and second
Part includes most-significant byte;Least-significant byte part is converted into primary vector 251 successively, most-significant byte part is converted into secondary vector 252, this
The entire file of sample ultimately produces two vectors, and each element value range in vector is that [0,255] (0 indicates black, 255 tables
Show white), wherein primary vector 251 is considered as channel 1 (channel1), and secondary vector 252 is considered as channel 2 (channel2), so
Be visualized as first passage gray level image 261 and second channel gray level image 262 respectively afterwards, the height of image is 1, width according to
Depending on the operation number of codes for including in file.Specifically, the first passage gray level image 261 and second channel gray-scale map after mapping
The PNG pictures for saving as no compression as 262.
In the embodiment, when by operation code numeralization and binarization, the vector height of generation is 1, and this avoids work as
Be the matrix of other numerical value come when preserving operation code numerical value using height, can because the element number of operation code number and matrix not
Meanwhile needing additionally to add the deficiency of 0 element, it avoids influencing subsequent image characteristics extraction so that the feature that the present invention extracts
Malicious code more can accurately be stated or describe, this also ensures the accuracy of malicious code classification to a certain extent.
Preferably, it is to preferably extract the brightness of gradation of image using gray scale stage layered, the present invention is using ash
It spends stage layered algorithm and greyscale transformation is carried out to twin-channel image, it is fixed grey level that two gray-scale maps, which are passed through shift conversion,
Gray-scale map.In practical applications, the gray level of a width gray level image is generally 256 grades, is derived by gray level co-occurrence matrixes calculating
When the textural characteristics gone out, it is desirable that the gray level of image is much smaller than 256, this is primarily due to the calculation amount of image co-occurrence matrix by scheming
The tonal gradation of picture and the size of image determine.In order to reduce calculation amount, accelerate analyze speed, the solution of generally use
It is the gray level for compressing image.Therefore, when calculating gray level co-occurrence matrixes, often first will under the premise of not influencing nicety of grading
The gray-scale compression of original image generally takes 8 grades or 16 grades to smaller range, to reduce the size of gray level co-occurrence matrixes.This
Gray level hierarchical algorithm in invention is as shown in table 2.
2 gray level hierarchical algorithm of table
Wherein, the gray-scale map of the channel1 and the gray-scale map of channel2 are respectively that operation code is low eight corresponding
Gray-scale map and high eight-bit gray-scale map.After conversion by above-mentioned gray level hierarchical algorithm, a width gray-scale map is finally obtained, is labeled as
Channel0 gray level images, the gray level of the channel0 gray level images and the Del_graynum are equal in magnitude.
Preferably, the textural characteristics of gray level image are extracted using gray level co-occurrence matrixes.Texture is the important attribute of image,
Describe the Spatial Distribution Pattern and queueing discipline of image.Gray level co-occurrence matrixes are the texture feature extraction sides being widely used
Method, this method are defined as on the directions θ, are respectively provided with the probability that gray value i and j occur at a distance of two pixels for d, are denoted as P
(i,j;D, θ), reflect gradation of image direction, amplitude of variation and regional area integrated information.Due in the present invention
Image f height is 1, and θ=0 is selected to calculate image f textural characteristics in the horizontal direction, P (i, j;D, 0) representation method is as follows:
P(i,j;D, 0)=# { (x, y), (m, n) ∈ (Lr,Lc)×(Lr,Lc), | x-m |=0, | y-n |=d, f (x, y)=
I, f (m, n)=j } (1)
Wherein, #, Lr、Lc, f (x, y)=i, f (m, n)=j indicate the total number of pixel in image f, image f rows respectively
Dimension, the dimension of image f row, gray value is i at the point (x, y) of image f, gray value is j at the point (m, n) of image f.This
Outside, pass through p (i, j;D, 0)=P (i, j;D, 0)/R, normalization co-occurrence matrix, wherein R=2 × L can be acquiredr×(Lc-1)。
Specifically, the present invention extracts the common 12 kinds of features of gray level co-occurrence matrixes, respectively angular second moment ASM (Angular
Second Moment), contrast C ontrast, correlation Correlation, covariance Variance, contrast sub-matrix IDM
(Inverse Differential Moment), homogeney Homogeneity, otherness Dissimilarity, entropy
Entropy, mean value and Sum_Average, variance and Sum_Variance and entropy Sum_Entropy, poor entropy Difference_
Entropy, feature calculation method are as follows:
ASM=∑si∑j[p(i,j)]2 (2)
Contrast=∑si∑j(i,j)2p(x,y) (3)
Variance=∑si∑j(i-μ)2p(i,j) (5)
Dissimilarity=∑si∑j|i-j|p(i,j) (8)
Entropy=∑si∑jp(i,j)log(p(i,j)) (9)
Wherein, L, μ are gray level and mean value, μx、μy、σx、σyRespectively px、pyMean value and standard deviation, px、py、px-y、
px+yRepresentation method is as follows:
The feature being calculated is combined and forms 12 dimensional feature vector T, essence of this feature vector T as the operation code
Feature vector is input in grader, is classified to malicious code, judges the classification described in the malicious code.
Preferably, in the present invention, classified to malicious code using random forests algorithm.Random forest (Random
Forest, RF) it is a kind of ensemble machine learning method, it utilizes random resampling technique bootstrap and node random splitting
Technology builds more decision trees, and final classification result is obtained by ballot.RF has the complicated interaction classification feature of analysis
Ability, the data for noise data and there are missing values have good robustness, and have faster pace of learning,
Variable importance measurement can as the feature selecting tool of high dimensional data, be widely used in recent years it is various classify,
In prediction, feature selecting and outlier detection problem.
Random forest is by one group of decision tree classifier { h (X, θk), k=1 ..., K } composition integrated classifier, wherein
{θkIt is to obey independent identically distributed random vector, K indicates the number of decision tree in random forest, in given independent variable X situations
Under, each decision tree is by choosing the optimal classification result of independent variable X in a vote.
Random forest is as follows to malicious code assorting process:
Step 1:Concentrated from malicious code training sample, have using bootstrap methods put back to randomly select K new
Self-service sample set, and K decision tree is thus built, the sample not being pumped to every time forms the outer data of K bag;
Step 2:Equipped with n feature, then m (m are randomly selected at each node of each decision tree<=n) a candidate
Feature, by calculating the gini index Gini_index of each candidate feature, selection is with minimum Geordie in m candidate feature
The feature of exponential quantity carries out node split;
Step 3:When only there are one the sample numbers in classification or node to be less than minimum divide in the node in every decision tree
When series, stop growing;
Step 4:K decision tree of generation is formed into random forest, it is special to the essence of new malicious code with random forest
Sign data classify, classification results by the ballot of decision tree it is how many depending on;
The gini index Gini_index computational methods of wherein each candidate feature are as follows:
Wherein, Gini (D) indicates the base for the data set D that each node of each decision tree includes before division
Buddhist nun's value calculating method, | y | and pkThe ratio of total data set is accounted for for the categorical measure of the data set D and each classification;A is described
Any one in m candidate feature, DvIndicate that set { a=fixed attributes value }, V indicate that feature a can be with according to its attribute value
The total class number divided, Gini (Dv) indicate DvGeordie value, | D | indicate the sample number of the data set D, | Dv| indicate the collection
The sample number of { a=fixed attributes value } is closed, (D, the Geordie for a) stating the characteristic attribute a of the data set D refer to Gini_index
Number.
Table 3 is the malicious code sorting technique and SPAM-GIST malice based on image texture fingerprint of the present invention
Code classification method rolls over the recognition result of cross validation in different k.The malicious code data collection that the present invention uses comes from
In project Microsoft Malware Classification Challenges of the Microsoft on Kaggle.The present invention
9929 malicious code binary files for choosing 7 classifications are tested, and table 2 gives the present invention malicious code number used
According to the essential information of collection.
2 malicious code data collection of table
Malicious code classification | Classification number | Quantity |
Ramniit | 0 | 1513 |
Lollipop | 1 | 2470 |
Kelihos_ver3 | 2 | 2936 |
Vundo | 3 | 446 |
Kelihos_ver1 | 4 | 387 |
Obfuscator_ACY | 5 | 1166 |
Gatak | 6 | 1011 |
In the experiment of the present invention, experimental result is assessed using k folding cross validations.In each experiment, we
Malicious code data collection is divided into k equal portions, using wherein k-1 equal portions are as training set, for training random forest grader,
Remaining 1 equal portions collect as verification, are verified to grader.
Specifically, the present invention is using accuracy rate (Accuracy), macro precision ratio (macro_P), macro recall ratio (macro_
R), classifying quality of the four kinds of evaluation index evaluation random forests algorithms of macro F1 (macro_F1) to malicious code.For classifying more
Problem will correspond to a confusion matrix per the combination of classification two-by-two, recall ratio and precision ratio then calculated on each confusion matrix,
It is denoted as (P1, R1), (P2, R2) ..., (Pn, Rn), then average value is calculated, obtain macro precision ratio (macro_P), macro recall ratio
(macro_R), macro F1 (macro_F1).Specific each evaluation index calculation formula is as follows:
Wherein TP, FP, FN, TN indicate to be classified respectively device be identified as positive positive sample, be classified device be identified as it is positive negative
Sample is classified device and is identified as negative positive sample, is classified device and is identified as negative negative sample.P, R is that looking into for each confusion matrix is complete
Rate and precision ratio.
Specifically, in the malicious code sorting technique based on image texture fingerprint of the present invention, the ash after setting transformation
It is 16 to spend grade, and random forest is made of 100 decision trees, and the minimum division series of each tree is 2;It is disliked in the SPAM-GIST
In code classification method of anticipating, the K=3 of k nearest neighbor (K-NearestNeighbor, KNN) sorting algorithm.In the present invention, by changing
Become the k values of cross validation to compare two groups of experiments, ten experiments are carried out under each k values and are averaged as final result, table 3
Show malicious code sorting technique based on image texture fingerprint and the SPAM-GIST malicious codes classification of the present invention
Method rolls over the recognition result of cross validation in different k.Table 4 is that the present invention is based on the malicious codes of image texture fingerprint point
Class method carries out the confusion matrix of the best result of ten experiments under 10 folding cross validations, and table 5 is SPAM-GIST malice
Code classification method carries out the confusion matrix of the best result of ten experiments under 10 folding cross validations.
The knowledge of table 3 the malicious code sorting technique based on image texture fingerprint and SPAM-GIST malicious code sorting techniques
Other result
The confusion matrix of 4 SPAM-GIST malicious code sorting techniques of table
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
0.894 | 0.019 | 0.001 | 0.011 | 0.003 | 0.028 | 0.044 | |
0.011 | 0.960 | 0.017 | 0.004 | 0 | 0.001 | 0.007 | |
0 | 0 | 0.999 | 0 | 0 | 0 | 0.001 | |
0.009 | 0.002 | 0 | 0.982 | 0 | 0.004 | 0.003 | |
0.008 | 0.008 | 0 | 0 | 0.933 | 0 | 0.051 | |
0.054 | 0.019 | 0 | 0.008 | 0 | 0.910 | 0.009 | |
0.015 | 0.010 | 0.011 | 0.012 | 0 | 0.010 | 0.942 |
Malicious code sorting technique confusion matrix of the table 5 based on image texture fingerprint
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
0.954 | 0.031 | 0.002 | 0.002 | 0.002 | 0.007 | 0.002 | |
0.017 | 0.978 | 0 | 0 | 0 | 0.003 | 0.002 | |
0 | 0 | 0.999 | 0 | 0 | 0 | 0.001 | |
0.011 | 0.004 | 0.002 | 0.978 | 0.005 | 0 | 0 | |
0.002 | 0 | 0 | 0 | 0.998 | 0 | 0 | |
0.042 | 0.010 | 0 | 0 | 0 | 0.943 | 0.005 | |
0.002 | 0.007 | 0.003 | 0 | 0 | 0.004 | 0.984 |
As can be seen from the above Table 3, of the invention based on image texture fingerprint in different k folding cross validation experiments
The recognition effect of malicious code sorting technique be above the SPAM-GIST malicious codes sorting technique, i.e., in the present invention
Malicious code sorting technique based on image texture fingerprint can preferably identify the malicious code after obscuring, and improve this hair
The adaptability and robustness of the bright malicious code sorting technique based on image texture fingerprint.
Further, the malicious code sorting technique based on image texture fingerprint of the present invention is disliked with the SPAM-GIST
Meaning code classification method carries out speed comparison on the basis of same data set, wherein the present invention is based on the evils of image texture fingerprint
Meaning code classification method average characteristics extraction time is 67ms, and the random forest classification time is 3ms, total time 2.6s;It is described
The average characteristics extraction time of SPAM-GIST malicious code sorting techniques is 92ms, and KNN times of classifying are 92ms, and total time is
3.1s.As shown in the above, the present invention is based on the malicious code sorting techniques of image texture fingerprint to be disliked than the SPAM-GIST
Code classification method of anticipating is 1.2 times about fast, i.e., the malicious code sorting technique based on image texture fingerprint of the invention is being classified
There is apparent advantage in speed, can more meet the requirement in practical application to sorting algorithm in time, be conducive to the present invention
The popularization of malicious code sorting technique based on image texture fingerprint in practical applications.
In addition, in order to verify different parameters to the present invention is based on the experiments of the malicious code sorting technique of image texture fingerprint to tie
The influence of fruit, the present invention is by decision tree number quantity (Num_Tree) in change gray level (GrayLevel), random forest to this
The malicious code sorting technique based on image texture fingerprint of invention is evaluated.Fig. 3 is gray level GrayLevel and based on figure
As texture fingerprint malicious code sorting technique nicety of grading between relationship, Fig. 4 is decision tree quantity Num_Tree and to be based on
Relationship between the malicious code sorting technique nicety of grading of image texture fingerprint.
Fig. 3 gives in the present invention when decision tree quantity Num_Tree is fixed value 100 in random forest, different ashes
Grade GrayLevel is spent to the present invention is based on the influences of the classifying quality of the malicious code sorting technique of image texture fingerprint.From Fig. 3
As can be seen that gray level GrayLevel, when changing between 8 to 24, the correct recognition rata of sorting technique of the invention exists
Variation in about 97% smaller range, i.e., the variation of gray level GrayLevel is to the present invention is based on the evils of image texture fingerprint
The influence of the classifying quality for code classification method of anticipating is little.
Fig. 4 gives the present invention when gray level GrayLevel is fixed value 16, different decision tree quantity Num_Tree
To the present invention is based on the influences of the classifying quality of the malicious code sorting technique of image texture fingerprint.From fig. 4, it can be seen that when certainly
Plan tree quantity Num_Tree from 50 increase to 150 when, the malicious code sorting technique of the invention based on image texture fingerprint
Nicety of grading steps up;It is of the invention to be referred to based on image texture when decision tree quantity Num_Tree increases to 500 from 150
The nicety of grading of the malicious code sorting technique of line is held essentially constant.
It follows that the present invention the malicious code sorting technique based on image texture fingerprint originally experience parameter influence not
Greatly, the stability for improving the malicious code sorting technique based on image texture fingerprint of the present invention, improves the sorting technique
Adaptability and replicability.
In summary present disclosure is it is found that the invention discloses a kind of malicious codes based on image texture fingerprint point
Class method is mapped as twin-channel by combining image analysis technology and malicious code sorting technique after operation code quantizes
Then twin-channel image is converted single pass gray-scale map according to gray-scale transformation method, uses ash by the gray level image without compression
The textural characteristics of co-occurrence matrix extraction image are spent, and using these features as the substantive characteristics of malicious code, finally used random
Forest algorithm classifies to malicious code.The malicious code sorting technique based on image texture fingerprint in the present invention, will
For the gray level image gray scale transformation of operation code to the numberical range of a very little, this makes the size of gray level co-occurrence matrixes that can compare
It is small, reduce the feature quantity for stating malicious code, improves the classification speed of malicious code;In addition by operation code numerical value
The method that gray-scale map is mapped as after change effectively overcomes the malicious codes confounding issues such as operation code rearrangement, code conversion, improves
The precision of malicious code classification.
Example the above is only the implementation of the present invention is not intended to limit the scope of the invention, every to utilize this hair
Equivalent structure transformation made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant technical fields,
Similarly it is included within the scope of the present invention.
Claims (10)
1. a kind of malicious code sorting technique based on image texture fingerprint, which is characterized in that including step:
Operation code quantizes, and the operation code of the malicious code is converted to numeralization file, the numeralization file is further
Be converted to binary file;
The binary file is mapped and generates two vectors by the processing of binary channels gray-scale map, described two corresponding visualizations of vector
For binary channels gray level image;
The processing of single channel gray-scale map, converts the binary channels gray level image to by greyscale transformation the gray scale of fixed grey level
Figure, and export single channel gray level image;
Texture feature extraction extracts the textural characteristics of image using gray level co-occurrence matrixes from the single channel gray level image;
Malicious code is classified, and using the textural characteristics as the substantive characteristics of the malicious code, uses random forests algorithm pair
The textural characteristics are classified.
2. the malicious code sorting technique according to claim 1 based on image texture fingerprint, which is characterized in that described
It is the signless binary files of 16bit by the numerical value conversion in the numeralization file, described two in operation code numeralization
In binary file, each binary numeral is further separated into two parts, wherein first part includes low 8bit, second part
Including high 8bit.
3. the malicious code sorting technique according to claim 2 based on image texture fingerprint, which is characterized in that in bilateral
In road gray-scale map processing, the binary file, which corresponds to, generates two vectors, and each element value range in the vector is
[0,255], wherein primary vector corresponds to the first part in the binary file, and secondary vector corresponds to the binary system text
Second part in part, the primary vector are mapped as first passage gray level image again, and the secondary vector is mapped as second again
Channel gray level image, first passage gray level image and second channel gray level image after mapping save as two PNG pictures.
4. the malicious code sorting technique according to claim 3 based on image texture fingerprint, which is characterized in that described
In the processing of single channel gray-scale map, using gray level hierarchical algorithm to the first passage gray level image and second channel gray level image
Carry out greyscale transformation.
5. the malicious code sorting technique according to claim 4 based on image texture fingerprint, which is characterized in that described
In texture feature extraction, 12 kinds of textural characteristics are extracted from the single channel gray level image using gray level co-occurrence matrixes.
6. the malicious code sorting technique according to claim 5 based on image texture fingerprint, 12 kinds of textural characteristics
Including:Angular second moment, contrast, correlation, covariance, contrast sub-matrix, homogeney, otherness, entropy, mean value and variance and
With entropy, poor entropy.
7. the malicious code sorting technique according to claim 6 based on image texture fingerprint, the random forests algorithm
The method classified to the textural characteristics includes:
Step 1:It is concentrated from malicious code training sample, there is that puts back to randomly select new self-service of K using bootstrap methods
Sample set, and K decision tree is thus built, the sample not being pumped to every time forms the outer data of K bag;
Step 2:Equipped with n feature, then m (m are randomly selected at each node of each decision tree<=n) a candidate feature,
By calculating the gini index of each candidate feature, in m candidate feature selection have the feature of minimum Geordie exponential quantity into
Row node split;
Step 3:When only there are one the sample numbers in classification or node to be less than minimum division series in the node in every decision tree
When, it stops growing;
Step 4:K decision tree of generation is formed into random forest, with random forest to new malicious code substantive characteristics data
Classify, classification results by the ballot of decision tree it is how many depending on;
The computational methods of wherein gini index are as follows:
Wherein, Gini (D) indicates the Geordie value for the data set D that each node of each decision tree includes before division
Computational methods, | y | and pkThe ratio of total data set is accounted for for the categorical measure of the data set D and each classification;A is the m
Any one in candidate feature, DvIndicate that set { a=fixed attributes value }, V indicate that feature a can be divided according to its attribute value
Total class number, Gini (Dv) indicate DvGeordie value, | D | indicate the sample number of the data set D, | Dv| indicate the set { a
=fixed attribute value } sample number, (D a) states the gini index of the characteristic attribute a of the data set D to Gini_index.
8. the malicious code sorting technique according to claim 4 based on image texture fingerprint, which is characterized in that by ash
The gray level for spending the single channel gray level image after stage layered algorithmic transformation is 16 grades.
9. the malicious code sorting technique according to claim 7 based on image texture fingerprint, which is characterized in that it is described with
Decision tree value range in machine forest is [100,150].
10. the malicious code sorting technique according to claim 9 based on image texture fingerprint, which is characterized in that described
Decision tree value in random forest is 100.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810207712.9A CN108416213A (en) | 2018-03-14 | 2018-03-14 | A kind of malicious code sorting technique based on image texture fingerprint |
CN201811187768.9A CN109241741B (en) | 2018-03-14 | 2018-10-12 | Malicious code classification method based on image texture fingerprints |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810207712.9A CN108416213A (en) | 2018-03-14 | 2018-03-14 | A kind of malicious code sorting technique based on image texture fingerprint |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108416213A true CN108416213A (en) | 2018-08-17 |
Family
ID=63131434
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810207712.9A Pending CN108416213A (en) | 2018-03-14 | 2018-03-14 | A kind of malicious code sorting technique based on image texture fingerprint |
CN201811187768.9A Active CN109241741B (en) | 2018-03-14 | 2018-10-12 | Malicious code classification method based on image texture fingerprints |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811187768.9A Active CN109241741B (en) | 2018-03-14 | 2018-10-12 | Malicious code classification method based on image texture fingerprints |
Country Status (1)
Country | Link |
---|---|
CN (2) | CN108416213A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109492692A (en) * | 2018-11-07 | 2019-03-19 | 北京知道创宇信息技术有限公司 | A kind of webpage back door detection method, device, electronic equipment and storage medium |
CN109858251A (en) * | 2019-02-26 | 2019-06-07 | 哈尔滨工程大学 | Malicious code classification and Detection method based on Bagging Ensemble Learning Algorithms |
CN110955588A (en) * | 2018-09-26 | 2020-04-03 | 华为技术有限公司 | Quality determination method and device for test cases |
CN112613521A (en) * | 2020-12-28 | 2021-04-06 | 德州正捷电气有限公司 | Multilevel data analysis system and method based on data conversion |
CN112861135A (en) * | 2021-04-12 | 2021-05-28 | 中南大学 | Malicious code detection method based on attention mechanism |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111552966A (en) * | 2020-04-07 | 2020-08-18 | 哈尔滨工程大学 | Malicious software homology detection method based on information fusion |
CN112347478B (en) * | 2020-10-13 | 2021-08-24 | 北京天融信网络安全技术有限公司 | Malicious software detection method and device |
CN114510721B (en) * | 2022-02-18 | 2024-07-05 | 哈尔滨工程大学 | Static malicious code classification method based on feature fusion |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8306942B2 (en) * | 2008-05-06 | 2012-11-06 | Lawrence Livermore National Security, Llc | Discriminant forest classification method and system |
CN105989288B (en) * | 2015-12-31 | 2019-04-16 | 武汉安天信息技术有限责任公司 | A kind of malicious code sample classification method and system based on deep learning |
CN106096411B (en) * | 2016-06-08 | 2018-09-18 | 浙江工业大学 | A kind of Android malicious code family classification methods based on bytecode image clustering |
CN107092829B (en) * | 2017-04-21 | 2020-03-17 | 中国人民解放军国防科学技术大学 | Malicious code detection method based on image matching |
CN107392019A (en) * | 2017-07-05 | 2017-11-24 | 北京金睛云华科技有限公司 | A kind of training of malicious code family and detection method and device |
-
2018
- 2018-03-14 CN CN201810207712.9A patent/CN108416213A/en active Pending
- 2018-10-12 CN CN201811187768.9A patent/CN109241741B/en active Active
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110955588A (en) * | 2018-09-26 | 2020-04-03 | 华为技术有限公司 | Quality determination method and device for test cases |
CN110955588B (en) * | 2018-09-26 | 2021-10-22 | 华为技术有限公司 | Quality determination method and device for test cases |
CN109492692A (en) * | 2018-11-07 | 2019-03-19 | 北京知道创宇信息技术有限公司 | A kind of webpage back door detection method, device, electronic equipment and storage medium |
CN109858251A (en) * | 2019-02-26 | 2019-06-07 | 哈尔滨工程大学 | Malicious code classification and Detection method based on Bagging Ensemble Learning Algorithms |
CN109858251B (en) * | 2019-02-26 | 2023-02-10 | 哈尔滨工程大学 | Malicious code classification detection method based on Bagging ensemble learning algorithm |
CN112613521A (en) * | 2020-12-28 | 2021-04-06 | 德州正捷电气有限公司 | Multilevel data analysis system and method based on data conversion |
CN112613521B (en) * | 2020-12-28 | 2023-01-20 | 上海埃林哲软件系统股份有限公司 | Multilevel data analysis system and method based on data conversion |
CN112861135A (en) * | 2021-04-12 | 2021-05-28 | 中南大学 | Malicious code detection method based on attention mechanism |
CN112861135B (en) * | 2021-04-12 | 2024-05-31 | 中南大学 | Malicious code detection method based on attention mechanism |
Also Published As
Publication number | Publication date |
---|---|
CN109241741B (en) | 2021-06-22 |
CN109241741A (en) | 2019-01-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108416213A (en) | A kind of malicious code sorting technique based on image texture fingerprint | |
Kaur et al. | Plant species identification based on plant leaf using computer vision and machine learning techniques | |
Grauman et al. | Approximate correspondences in high dimensions | |
US8620087B2 (en) | Feature selection device | |
CN103942562B (en) | Hyperspectral image classifying method based on multi-classifier combining | |
CN101710334A (en) | Large-scale image library retrieving method based on image Hash | |
Khemchandani et al. | Color image classification and retrieval through ternary decision structure based multi-category TWSVM | |
KR101054107B1 (en) | A system for exposure retrieval of personal information using image features | |
Samma et al. | Adaptation of k-means algorithm for image segmentation | |
Talab et al. | A Novel Statistical Feature Analysis‐Based Global and Local Method for Face Recognition | |
Tao et al. | Local difference ternary sequences descriptor based on unsupervised min redundancy mutual information feature selection | |
Abir et al. | Bangla handwritten character recognition with multilayer convolutional neural network | |
Upadhyay et al. | Improvised number identification using SVM and random forest classifiers | |
White et al. | Digital fingerprinting of microstructures | |
Murshed et al. | Off-line signature verification, without a priori knowledge of class/spl omega//sub 2/. A new approach | |
CN105205487A (en) | Picture processing method and device | |
Bairwa et al. | Classification of Fruits Based on Shape, Color and Texture using Image Processing Techniques | |
Chen et al. | More about covariance descriptors for image set coding: Log-euclidean framework based kernel matrix representation | |
Chu et al. | Steel surface defects recognition based on multi-label classifier with hyper-sphere support vector machine | |
CN111582099B (en) | Identity verification method based on iris far-source feature traffic operation decision | |
David et al. | Authentication of Vincent van Gogh’s work | |
Rajab et al. | An efficient method for stamps verification using haar wavelet sub-bands with histogram and moment | |
CN111581640A (en) | Malicious software detection method, device and equipment and storage medium | |
CN105844296B (en) | Remote sensing images scene classification method based on CDCP local description | |
Fornés et al. | Handwritten symbol recognition by a boosted blurred shape model with error correction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20180817 |
|
WD01 | Invention patent application deemed withdrawn after publication |