CN107657175A

CN107657175A - A kind of homologous detection method of malice sample based on image feature descriptor

Info

Publication number: CN107657175A
Application number: CN201710835366.4A
Authority: CN
Inventors: 赵小林; 薛静锋; 李旭辉; 王勇; 张漪墁
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2017-09-15
Filing date: 2017-09-15
Publication date: 2018-02-02

Abstract

The invention discloses a kind of homologous detection method of malice sample based on image feature descriptor.Using the present invention malice sample can be avoided to obscure interference, quickly analyze the homology of malicious file, efficiency high, precision height, strong robustness, autgmentability are strong.The present invention carries out data prediction by file visualized algorithm, avoid because of the disturbing factor for the semantic level that file decompiling or sandbox operation are brought, then the Feature Descriptor of image feature extraction techniques extraction rogue program image is utilized on homology analysis field, and storehouse is described with image feature descriptor structure family feature, the homology for unknown rogue program of being analysed and compared using family's feature database.The image feature descriptor strong robustness obtained by image characteristics extraction algorithm, and Sample Storehouse post analysis efficiency high is established, autgmentability is strong.

Description

A kind of homologous detection method of malice sample based on image feature descriptor

Technical field

The present invention relates to technical field of network security, and in particular to a kind of malice sample based on image feature descriptor is same Source detection method.

Background technology

Malware quantity constantly increases in recent years, and malicious code mutation quantity also sharply increases, and mutation cost constantly drops Low, malicious code, which makes to be organized in, evades detection or the work done on slight change is more and more, how fast and effeciently to divide The species and homology of analysis malice sample are always the emphasis and difficult point of network security research.

It is more at present that homology analysis is carried out using the method for static analysis or dynamic analysis but less efficient.Existing base File operation, the network of malicious code can be more fully analyzed in the malice sample homology analysis method of dynamic behaviour capture Behavior etc., but its major defect is that overhead is larger, and expandability is weaker, and analytical cycle is relatively long；And pass through static state Disassembly method can obtain the API Calls sequence chart of malicious code, command information similitude between more different samples and Function call similitude, this can be avoided to a certain extent, and the overhead of dynamic behaviour analysis method is big, analytical cycle length The problem of, some achievements are also obtain, but this method has the problem of analysis result is not accurate enough：Malice sample passes through static anti- Compiling analysis, the API Calls figure of acquisition averagely have thousands of individual nodes, have research to be removed by way of beta pruning some of useless Node, operational efficiency is improved, but much noise point is still suffered from API Calls figure.；

Also someone realizes the automatic marking of malice sample using classification or clustering method, though certain achievement is achieved, but still Many restrictions be present, such as classification or clustering method need a large amount of training samples, the quantity of the more numerical examples of remote super large and its mutation.

Qu Wu et al. discloses a kind of method and device (Chinese Patent Application No. for realizing malicious code mark： CN201410142940.4), including：Transplantable execution body (PE) file of malicious code is handled, passes through comentropy Digest algorithm obtains the informative abstract signature and Datum dimension and textural characteristics of malicious code；According to Datum dimension and informative abstract Signature, the textural characteristics for belonging to same malicious code family are generated into corresponding textural characteristics set；According to textural characteristics set The first clustering cluster is generated, the first clustering cluster is merged to generate the second clustering cluster, combining information digest and malice generation The depth name of code family carries out deep annotation to the second clustering cluster.The invention to malicious code by carrying out Datum dimension and depth Scale, using informative abstract signature and the depth name of malicious code family, the mask method of specification Liao Ge malicious codes family, carry High accuracy and versatility to malicious code mark.

Kang Fei et al. provides a kind of malicious code homology analysis method (Chinese patent Shen of Behavior-based control characteristic similarity Please number：CN201510296976.2), it is primarily based on the extraction of binary pitching pile platform and the behavior of quantization means malicious code Feature, the similitude of behavioural characteristic between different malicious codes is measured on this basis, reflected with the similarity of behavioural characteristic The homology of malicious code differentiates result.Using the invention homology analysis can be carried out to the malicious code collected in network, And the tracking to follow-on attack source is traced to the source and provided strong support.This method can be reflected correctly between malicious code sample Homology, while the malicious code sample without homology has correctly been distinguished, to the homology analysis work of malicious code Work has important guidance and reference.

Jia Xiao, which is opened et al., proposes a kind of malicious code analysis detection method (China Patent No. based on dynamic semantics feature： CN201310682922.0), its step includes：1) by the code dynamic operation of detection to be analyzed in malice Sample Storehouse in virtual ring Among border, monitor its running and extract primitive character；2) the API Name letter for representing the code semantic feature is filtered out Breath；3) the API sequence semantic feature set for representing the code semantic feature is established；4) representative semantic feature is chosen to build Vertical semantic feature storehouse；5) the semantic feature set of code to be detected and semantic feature storehouse are subjected to similitude detection, draw detection As a result, i.e., code to be detected is benign code or malicious code.The invention can establish different semantemes according to different samples Feature, there is good universality, and propose the method for choosing representative feature, can accurately represent code Semantic feature, analysis detection to malicious code is more accurate, testing cost is lower.

In above-mentioned implementation, the textural characteristics that the method for Qu Wu et al. propositions is extracted by comentropy digest algorithm are inadequate Substantially, the focus of the invention is to divide malicious code family by clustering algorithm the manual intervention, it is necessary to certain, and It is required to cluster with all samples in former storehouse for emerging malice sample, it is less efficient, it is only suitable for marking, and Be not suitable for quick analysis sample homology；The scheme that the scheme and Jia Xiaoqi et al. that Kang Fei et al. is proposed propose its substantially belong to Based on the analysis method of dynamic behaviour capture, it is asked for analytical cycle existing for dynamic analysing method is long, extended capability is weak Topic is still difficult to break through.

The content of the invention

In view of this, the invention provides a kind of homologous detection method of malice sample based on image feature descriptor, energy Enough avoid malice sample from obscuring interference, quickly analyze the homology of malicious file, efficiency high, precision height, strong robustness, extension Property is strong.

The homologous detection method of malice sample based on image feature descriptor of the present invention, comprises the following steps：

Step 1, each binary malicious file in sample set is converted to by matrix using B2M algorithms；

Step 2, regard each matrix as a width picture respectively, all of each picture are positioned using image characteristic extracting method Characteristic point；

Step 3, Feature Descriptor extraction is carried out in feature neighborhood of a point；

Step 4, the set of the Feature Descriptor of the malicious file of same family will be belonged in sample set as the family Feature storehouse is described；

Step 5, the Feature Descriptor of malicious file to be detected is extracted by the way of step 1~step 3, will be to be detected All Feature Descriptors that the Feature Descriptor of malicious file describes with each family's feature in storehouse respectively carry out Similarity Measure, meter Obtained similarity is more than or equal to given threshold, it is believed that the Feature Descriptor of malicious file to be detected describes with this feature Son matching；Each family's feature is described in storehouse, is to treat with the most family of the Feature Descriptor coupling number of malicious file to be detected The affiliated family of malicious file is detected, completes the detection of malice sample homology.

Further, in the step 1, matrix column sum is the length of PE file sections in malicious file.

Further, in the step 2, picture feature point is realized using SIFT algorithms, SURF algorithm or FAST algorithms Positioning.

Further, in the step 2, SIFT algorithms are improved, picture is realized using the SIFT algorithms after improvement The positioning of characteristic point, specific localization method include following sub-step：

Step 2.1, for each picture, using the developing algorithm in SIFT algorithms, every group 6 layers totally 3 groups of Gauss gold word are built Tower；

Step 2.2, for the gaussian pyramid in step 2.1, using the developing algorithm in SIFT algorithms, every group 5 is built Layer totally 3 groups of difference of Gaussian pyramid；

Step 2.3, for the difference of Gaussian pyramid in step 2.2, each point in difference of Gaussian pyramid is scanned, is protected Stay centered on the point, the maximum value or minimum value point in 3 × 3 × 3 neighborhoods, form extreme value point set；

Step 2.4, for the extreme value point set in step 2.3, retain in set and put ash in gray value and 3 × 3 × 3 neighborhoods Angle value difference is more than threshold value T point, completes the positioning of characteristic point, wherein T >=0.

Further, the feature neighborhood of a point in the step 3 is N × n-quadrant centered on this feature point, wherein, N =5~15；Feature Descriptor extraction is carried out using following sub-step：

Step 3.1, for positioning each characteristic point drawn in step 2, by permanent order collection centered on this feature point N × N contiguous ranges in pixel gray value, composition this feature neighborhood of a point gray value ordered set (G₁, G₂, G₃... ...)；

Step 3.2, each feature neighborhood of a point gray value ordered set obtained for step 3.1, by set Gray value G₁With gray value G₂Compare, if gray value G₁More than gray value G₂, then C₁=1, otherwise C₁=0；By the gray scale in set Value G₁With gray value G₃Compare, if gray value G₁More than gray value G₃, then C₂=1, otherwise C₂=0, the like, until neighborhood ash All gray values in angle value ordered set compare two-by-two to be finished, and is obtained length and isBinary sequence, The binary sequence is the Feature Descriptor of this feature point.

Further, in the step 3.1,0/1 sample weight distribution is carried out to N × N neighborhoods of characteristic point, i.e.,：

Characteristic point is expert at L_nIn, the sample weight value of continuous N number of point centered on characteristic point is 1；

The lastrow L that characteristic point is expert at_n-1With next line L_n+1, the continuous N-2 point centered on characteristic point column Sample weight value be 1；

The upper two rows L that characteristic point is expert at_n-2With lower two rows L_n+2, the continuous N-4 point centered on characteristic point column Sample weight value be 1；

The sample weight value of remaining point is 0；

The pixel gray value for being 1 by sample weight value in permanent order collection N × N contiguous ranges, forms this feature point Neighborhood gray value ordered set.

Step 3.1, for positioning each characteristic point drawn in step 2, N × N centered on this feature point is gathered respectively Pixel gray value in contiguous range, using the histogram of gradients statistical in SIFT algorithms, calculate in the contiguous range The Grad of each pixel and direction, obtain the gradient matrix of this feature point；

Step 3.2, for the gradient matrix of each characteristic point obtained in step 3.1, rotate, use without reference axis Gauss window in SIFT algorithms is weighted, and obtains the weighted gradient matrix of each characteristic point；

Step 3.3, for the weighted gradient matrix of each characteristic point obtained in step 3.2, N × n-quadrant is merged into 4 × 4 regions, and the histogram of gradients in 4 × 4 regions per sub-regions is counted, and using histogram of gradients as this feature point sub-district The vector in domain, the vector of all subregion of this feature point is spliced, the sum vector for splicing to obtain is the spy of this feature point Sign description.

Further, in the step 4, the Feature Descriptor in storehouse is described to the feature of each family and clustered, wherein, If Feature Descriptor is binary sequence form, each feature in storehouse is described using Hamming distances calculation to family's feature Description carries out Similarity Measure, realizes cluster；If Feature Descriptor is vector form, using Euclidean distance calculation or Each Feature Descriptor that standardization Euclidean distance calculation or included angle cosine calculation are described to family's feature in storehouse is carried out Similarity Measure, realize cluster；After the completion of cluster, only retain the cluster that the point number in result class cluster is more than threshold value t, be configured to most Whole family's feature describes storehouse, and the matching of storehouse progress step 5 is described with final family's feature；Wherein t is more than or equal to the family The half of sample number during race's progress Feature Descriptor extraction.

Further, in the step 5, if Feature Descriptor is binary sequence, using Hamming distances calculation Carry out Similarity Measure；If Feature Descriptor is vector, calculated using Euclidean distance calculation or standardization Euclidean distance Mode or included angle cosine calculation carry out Similarity Measure.

Beneficial effect：

(1) present invention proposes the mapping relations of the homologous detection field of malice sample and field of image recognition, it is proposed that makes The foundation of homologous detection is carried out with image recognition algorithm and demonstrates feasibility.Data are carried out by file visualized algorithm in advance to locate Reason, avoid because of the disturbing factor for the semantic level that file decompiling or sandbox operation are brought, then on homology analysis field Using the Feature Descriptor of image feature extraction techniques extraction rogue program image, and it is special with image feature descriptor structure family Sign description storehouse, and then utilize the homology of the unknown rogue program of family's feature database analyses and comparison.The present invention substantially belongs to static Analysis method, but without decompiling binary file, overcome shell adding present in static decompiling analysis etc. and obscure means Interference problem, accuracy are higher.The image feature descriptor strong robustness obtained by image characteristics extraction algorithm, and establish sample This storehouse post analysis efficiency high, autgmentability are strong.

(2) the characteristics of being directed to malice text, is improved SIFT algorithms, reduces the group number and layer of gaussian pyramid Number, the fitting of cancellation feature point position, main gradient calculation and the rotation of main gradient when cancellation feature point extracts, improve algorithm Computational efficiency and accuracy.

(3) sample weight distribution is carried out to the point in feature vertex neighborhood, the row away from characteristic point, the sampling of Feature Descriptor Point is reduced, and improves the antialiasing ability of characteristic point.

(4) family's Feature Descriptor of extraction is screened by DBSCAN clustering algorithms, removes the feature description that peels off Son and the less not common Feature Descriptor of class cluster, structure read carefully and thoroughly higher family's feature and describe storehouse, improve homologous detection The degree of accuracy.

(5) present invention to the informative abstract signature of malice sample without extracting, without the word after progress decompiling The steps such as frequency statistics, Symbol processing, can effectively improve efficiency；Without being run by sandbox, rogue program can be avoided to sandbox The detection of environment or latent attack, and analysis efficiency and scalability are far above dynamic analysis；Compared to other data minings For algorithm thousands of individual image feature descriptors can be generated without a large amount of training samples, each individually malice sample.

Brief description of the drawings

Fig. 1 is present system schematic flow sheet.

Fig. 2 is difference of Gaussian pyramid schematic diagram.

Fig. 3 is local comparison point schematic diagram.

Fig. 4 is the Feature Descriptor sampled point schematic diagram that the present invention uses.

Fig. 5 is flow chart of the present invention.

Embodiment

The present invention will now be described in detail with reference to the accompanying drawings and examples.

The invention provides a kind of homologous detection method of malice sample based on image feature descriptor, to no offer source The binary system rogue program of code is visualized, and is positioned all characteristic points of each picture using image characteristic extracting method and carried Feature Descriptor is taken, image texture finger print information is generated, so as to accurately and efficiently be analyzed the species belonging to rogue program And family, the present invention relates to network security, has important application in Malware homology analysis.

Existing many malice sample mutation realize that technology is needed by complexity under existing homologous detection technique background Technological means is evaded, and every kind of disturbing factor needs limited means to be excluded, and the present invention have studied malice sample and change General character between kind realization rate and image recognition and its disturbing factor, and it is extracted mapping between the two as shown in table 1 Relation.As it can be seen from table 1 after these technology-mappeds are field of image recognition, belong to common disturbing factor, therefore, can be with The extraction to family's feature database of Malware is realized using the method for image feature extraction techniques extraction picture feature description, And then utilize the unknown rogue program of family's feature database analyses and comparison.The present invention makes full use of the advantage of image analysis technology, is keeping away Exempt to lift analysis efficiency while malice sample obscures interference.

Table 1

For obtain malicious code textural characteristics (i.e. Feature Descriptor), first by B2M (Binary to Matrix, It is matrix i.e. by File Mapping) malicious file is mapped as matrix by visual analysis method, due to the byte value model of malice text [0~255] is trapped among, it is identical with the intensity value ranges of gray level image, therefore as, each matrix can be regarded to a width gray-scale map respectively, The reverse-examination that can avoid analyzing for sandbox due to malicious file is surveyed and APT (Advanced Persistent Threat, it is advanced Continuation threatens) behavior, and behavior accuracy rate as caused by the modes such as shell adding, API sequence confusions is evaded to decompiling analysis The problem of decline, detection are time-consuming high, and efficiency reduces.

Then, for each gray-scale map, the Feature Descriptor of picture is extracted using image characteristic extracting method, that is, is disliked Anticipate the key code context of text, the foundation of similarity system design during as homologous detection.

Specifically comprise the following steps：

1. the Feature Descriptor extraction of malice sample gray-scale map

The pixel information of feature vertex neighborhood is needed to use before Feature Descriptor is extracted.The characteristic point of malicious file To reflect the pixel of the feature of gray-scale map caused by the mapping of malice sample, a usual malicious file (gray-scale map) can produce Multiple characteristic points；Feature Descriptor can produce one for characteristic point and its whole description of neighborhood territory pixel point, a characteristic point Individual Feature Descriptor, Feature Descriptor compare foundation using as distance during malice sample homology analysis.Malicious file feature Information around point is the byte chip segment of the front and rear or upper and lower file section of key bytes code, and the extraction of the information needs will malice text Part is mapped as a continuous texture image.The present invention first by B2M algorithms by malicious file it is byte-by-byte be mapped as image ash Angle value, and using the byte of length 512 of fixed width such as PE file sections as picture traverse, whole binary file is mapped as nothing The PNG images of compression, remove obscuring for malicious code semantic level.

Then the image characteristic extracting methods such as SIFT algorithms, SURF algorithm or FAST algorithms can be used to realize picture feature The positioning of point.

The present embodiment positions all characteristic points of each picture using SIFT algorithms.Specifically, use first in SIFT algorithms Developing algorithm structure gaussian pyramid and difference of Gaussian pyramid.Due to the unconspicuous feature of malice sample image texture, sheet Embodiment is simplified to SIFT algorithms, using only 3 groups, every group 6 layers of gaussian pyramid, generates 3 groups, every group 5 layers of height This difference pyramid, as shown in Figure 2.As shown in figure 3, each point in the difference of Gaussian pyramid of scanning structure, retains with this Extreme point (maximum value or minimum value point) in subrange centered on point (in 3 × 3 × 3 neighborhoods), form extreme value point set； Retain in set and put the point that gray value differences are more than threshold value T in gray value and 3 × 3 × 3 neighborhoods, complete the positioning of characteristic point, wherein T ≥0。

Then, the Feature Descriptor extraction of characteristic point is carried out.

In the Feature Descriptor extraction process of SIFT scheduling algorithms, a large amount of calculation resources are consumed for after digital simulation Characteristic point position, and characteristic point and its main gradient of neighborhood, and the point in feature vertex neighborhood is rotated relative to main gradient, most What is extracted afterwards is the gradient information of feature vertex neighborhood point.And graphical rule caused by malice sample keeps unified and (is not present because of thing Pixel precision problem caused by body distance), while image caused by malice sample can not possibly rotate change (semantic above The arrangement of part binary content is rotated without in all senses).Therefore, the present invention is to SIFT algorithm characteristics point location and extraction process Simplified, the mistake that fitting, main gradient calculation and feature vertex neighborhood rotate towards main gradient is removed to Feature Descriptor extraction process Journey.

Specifically, the present invention carries out Feature Descriptor extraction, institute using improved SIFT methods in feature neighborhood of a point Feature neighborhood of a point is stated as N × n-quadrant centered on this feature point, wherein, N=5~15；Carried out using following sub-step special Sign description son extraction：

(1) all characteristic points drawn for positioning, are gathered in N × N contiguous ranges centered on this feature point respectively Pixel gray value, using the histogram of gradients statistical in SIFT algorithms, calculate each pixel in the contiguous range Grad and direction, obtain the gradient matrix of this feature point；

(2) for each gradient matrix obtained in (1), rotated without reference axis, use the Gauss in SIFT algorithms Window is weighted, and obtains the gradient matrix after weighted calculation；

(3) for each gradient matrix after the weighted calculation of acquisition in (2), 4 × 4 regions are merged into N × n-quadrant, And the histogram of gradients in each region is counted, wherein an every 45 degree posts for histogram, each region produces 8 posts, and will The histogram saves as the vector in this feature point region, and a characteristic point common property gives birth to 4 × 4=16 vector, each vector length Spend for 8,16 vectors of this feature point are spliced into 16 × 8=128 dimensional vectors, the Feature Descriptor for completing this feature point carries Take；

(4) each characteristic point of the sample is directed to, is repeated (3), is obtained all Feature Descriptors of the sample, complete the sample Eigen description son extraction.

Further, it is also possible to Feature Descriptor extraction is carried out in feature neighborhood of a point using improved BRIEF methods, it is described Feature neighborhood of a point is N × n-quadrant centered on this feature point, wherein, N=5~15；Feature is carried out using following sub-step Description son extraction：

(1) all characteristic points drawn for positioning, N × N centered on this feature point is gathered by permanent order respectively Pixel gray value in contiguous range, wherein order can be from upper left to bottom right or from lower-left to upper right etc., form the spy Levy neighborhood of a point gray value ordered set；

(2) for each feature neighborhood of a point gray value ordered set of collection in (1), respectively by the gray value in set G₁With gray value G₂Compare, if gray value G₁More than gray value G₂, then C is obtained₁=1, otherwise C₁=0；By the gray value in set G₁With gray value G₃Compare, if gray value G₁More than gray value G₃, then C is obtained₂=1, otherwise C₂=0 ... until all gray values Finished to comparing two-by-two, obtaining length isBinary sequence, the Feature Descriptor for completing the point carries Take；

(3) each neighborhood gray value ordered set in (1) is directed to, repeats (2), the Feature Descriptor for completing the sample carries Take.

Due to malicious file mutation frequently with equivalent instruction replacement, shell adding, rubbish code insertion, controlling stream obscure, code Reset, register the mode such as redistributes and changes code syntax feature, it is believed that there is the malicious snippets of code of core feature It is stable, and more remote apart from the scope within the specific limits, code snippet is higher by the possibility of confounding effect, that is, gets over Close to characteristic point, value of the code snippet in identification is higher.Feature Descriptor in malicious code homology analysis field Sample weight distribution can be carried out to the point in feature vertex neighborhood, the row away from characteristic point, the sampled point of Feature Descriptor is reduced, The antialiasing ability of characteristic point can be improved.Wherein, often row sample weight is directly proportional to its sampled point number in region, using such as Lower sub- description is sampled：

Son description 1, characteristic point is expert at L_nRow hits is N, is continuous N number of point centered on characteristic point；

Son description 2, the lastrow L that characteristic point is expert at_n-1With next line L_n+1Hits is N-2, is with where characteristic point It is classified as the continuous N-2 point at center；

Son description 3, L_n-2Row and L_n+2Row hits is N-4, is the continuous N-4 point centered on characteristic point column；

The pixel gray value of above-mentioned sampled point is gathered by permanent order, forms this feature neighborhood of a point gray value ordered set Close, carry out Feature Descriptor extraction.

Specifically, the Feature Descriptor be to characteristic point in itself, each 4 neighborhoods before and after code segment n where characteristic point Each 7 neighborhoods point of point, n-1 and n+1 code segment correspondence positions, the common 9+7+7 of each 5 neighborhoods point of n-2 and n+2 code segment correspondence positions + 5+5=33 points are acquired, as shown in figure 4, simultaneously calculating the gray value comparative result of point pair one by one, form oneThe binary string of position.

2. malice sample families feature describes storehouse structure

The sample marked is extracted into Feature Descriptor by family, each sample will produce multiple Feature Descriptors.The family The set of the Feature Descriptor composition of all samples can describe the foundation of storehouse structure as family's feature in race.

To improve homologous accuracy in detection, when some characteristic points of extraction belong to multiple samples in the family simultaneously, The point can be regarded as to the shared characteristics of image gene of family.And some characteristic point is only sample spy in each sample Have, therefore to improve the accuracy rate and efficiency that compare, it is key therein to remove the distinctive characteristic point of sample.It is unknown to improve The efficiency and accuracy rate of sample homology analysis, the present invention propose the construction method that family's feature based on cluster describes storehouse, By clustering algorithm, the larger Feature Descriptor class cluster of accounting is found out, removes less class cluster and outlier.Assert cluster result It is the higher family's public characteristic description of similarity that point number, which exceedes point in the cluster of family input malice sample number 1/2, in middle cluster Son, retained；Other class clusters or outlier are then the distinctive Feature Descriptors of some mutation, will be removed.Most establish at last The family's feature for remaining multiple core feature description describes storehouse.

Wherein, if Feature Descriptor is binary sequence form, family's feature is retouched using Hamming distances calculation Each Feature Descriptor stated in storehouse carries out Similarity Measure, realizes cluster；If Feature Descriptor is vector form, using European Family's feature is described apart from calculation or standardization Euclidean distance calculation or included angle cosine calculation each in storehouse Feature Descriptor carries out Similarity Measure, realizes cluster.

3. malice sample homology analysis

After completing malicious code family feature database, system already has carries out homology point to malice sample and its mutation The foundation of analysis, and homology analysis operation is similar to the process of image recognition, it is necessary to carry out the comparison of Feature Descriptor, obtains sample Sheet and the matching result of feature database.

Carry out characteristic point comparison, it is necessary to extract with the identical Feature Descriptor of type in feature database, therefore malice The Feature Descriptor extracting mode of sample describes sub- extracting method using foregoing malice sample characteristics.Unlike foregoing, One sample does not possess the condition of extraction family common characteristic, after Feature Descriptor is extracted, without entering traveling one to description Walk cluster analysis screening.

Multiple Feature Descriptors that unknown sample is extracted description shared with the family in each family's feature database Similarity distance comparison is carried out, threshold value is corresponding with the cluster radius of foregoing clustering algorithm, when similarity is less than the threshold value, it is believed that 2 points Matching.The affiliated sample families of the largest number of feature databases of point matched with unknown sample are using as the homology analysis knot of the sample Fruit.

The present invention focus on propose being total between malice sample mutation realization rate and image recognition disturbing factor Property, both mapping relations are extracted, image feature extraction techniques are introduced in homology analysis field, i.e., are retouched using characteristics of image Sub- carry out homology analysis is stated, and uses the image characteristics extraction algorithm and clustering algorithm being improved for malice sample analysis Family's feature database of Malware is extracted, for unknown rogue program of analysing and comparing, using the advantage of image analysis technology, is being kept away Exempt to lift analysis efficiency while malice sample obscures interference.

In summary, presently preferred embodiments of the present invention is these are only, is not intended to limit the scope of the present invention. Within the spirit and principles of the invention, any modification, equivalent substitution and improvements made etc., it should be included in the present invention's Within protection domain.

Claims

1. a kind of homologous detection method of malice sample based on image feature descriptor, it is characterised in that comprise the following steps：

Step 2, regard each matrix as a width picture respectively, all features of each picture are positioned using image characteristic extracting method Point；

Step 4, spy of the set as the family of the Feature Descriptor of the malicious file of same family will be belonged in sample set Sign description storehouse；

Step 5, the Feature Descriptor of malicious file to be detected is extracted by the way of step 1~step 3, by malice to be detected text All Feature Descriptors that the Feature Descriptor of part describes with each family's feature in storehouse respectively carry out Similarity Measure, are calculated Similarity be more than or equal to given threshold, it is believed that the Feature Descriptor of malicious file to be detected and this feature description son Match somebody with somebody；Each family's feature is described in storehouse, and most families is as to be detected with the Feature Descriptor coupling number of malicious file to be detected The affiliated family of malicious file, complete the detection of malice sample homology.

2. the homologous detection method of malice sample based on image feature descriptor as claimed in claim 1, it is characterised in that institute State in step 1, matrix column sum is the length of PE file sections in malicious file.

3. the homologous detection method of malice sample based on image feature descriptor as claimed in claim 1, it is characterised in that institute State in step 2, the positioning of picture feature point is realized using SIFT algorithms, SURF algorithm or FAST algorithms.

4. the homologous detection method of malice sample based on image feature descriptor as claimed in claim 1, it is characterised in that institute State in step 2, SIFT algorithms are improved, the positioning of picture feature point is realized using the SIFT algorithms after improvement, it is specific fixed Position method includes following sub-step：

Step 2.1, for each picture, using the developing algorithm in SIFT algorithms, every group 6 layers totally 3 groups of gaussian pyramid are built；

Step 2.2, for the gaussian pyramid in step 2.1, using the developing algorithm in SIFT algorithms, every group 5 layers totally 3 are built Group difference of Gaussian pyramid；

Step 2.3, for the difference of Gaussian pyramid in step 2.2, scan each point in difference of Gaussian pyramid, retain with Centered on the point, the maximum value or minimum value point in 3 × 3 × 3 neighborhoods, extreme value point set is formed；

Step 2.4, for the extreme value point set in step 2.3, retain in set and put gray value in gray value and 3 × 3 × 3 neighborhoods Difference is more than threshold value T point, completes the positioning of characteristic point, wherein T >=0.

5. the homologous detection method of malice sample based on image feature descriptor as claimed in claim 1, it is characterised in that institute It is N × n-quadrant centered on this feature point to state the feature neighborhood of a point in step 3, wherein, N=5~15；Using following son Step carries out Feature Descriptor extraction：

Step 3.1, for positioning each characteristic point drawn in step 2, by N of the permanent order collection centered on this feature point × Pixel gray value in N contiguous ranges, composition this feature neighborhood of a point gray value ordered set (G₁, G₂, G₃... ...)；

Step 3.2, each feature neighborhood of a point gray value ordered set obtained for step 3.1, by the gray scale in set Value G₁With gray value G₂Compare, if gray value G₁More than gray value G₂, then C₁=1, otherwise C₁=0；By the gray value G in set₁ With gray value G₃Compare, if gray value G₁More than gray value G₃, then C₂=1, otherwise C₂=0, the like, until neighborhood gray value All gray values in ordered set compare two-by-two to be finished, and is obtained length and isBinary sequence, this two System sequence is the Feature Descriptor of this feature point.

6. the homologous detection method of malice sample based on image feature descriptor as claimed in claim 5, it is characterised in that institute State in step 3.1,0/1 sample weight distribution is carried out to N × N neighborhoods of characteristic point, i.e.,：

The lastrow L that characteristic point is expert at_n-1With next line L_n+1, continuous N-2 point centered on characteristic point column adopt Sample weighted value is 1；

The upper two rows L that characteristic point is expert at_n-2With lower two rows L_n+2, continuous N-4 point centered on characteristic point column adopt Sample weighted value is 1；

The sample weight value of remaining point is 0；

The pixel gray value for being 1 by sample weight value in permanent order collection N × N contiguous ranges, form the neighbour of this feature point Domain gray value ordered set.

7. the homologous detection method of malice sample based on image feature descriptor as claimed in claim 1, it is characterised in that institute It is N × n-quadrant centered on this feature point to state the feature neighborhood of a point in step 3, wherein, N=5~15；Using following son Step carries out Feature Descriptor extraction：

Step 3.1, for positioning each characteristic point drawn in step 2, N × N neighborhoods centered on this feature point are gathered respectively In the range of pixel gray value, using the histogram of gradients statistical in SIFT algorithms, calculate each in the contiguous range The Grad of pixel and direction, obtain the gradient matrix of this feature point；

Step 3.2, for the gradient matrix of each characteristic point obtained in step 3.1, rotate without reference axis, calculated using SIFT Gauss window in method is weighted, and obtains the weighted gradient matrix of each characteristic point；

Step 3.3, for the weighted gradient matrix of each characteristic point obtained in step 3.2,4th × 4 area are merged into N × n-quadrant Domain, and the histogram of gradients in 4 × 4 regions per sub-regions is counted, and using histogram of gradients as this feature point subregion Vector, the vector of all subregion of this feature point is spliced, the sum vector for splicing to obtain is the feature of this feature point Description.

8. the homologous detection method of malice sample based on image feature descriptor as claimed in claim 1, it is characterised in that institute State in step 4, describing the Feature Descriptor in storehouse to the feature of each family clusters, wherein, if Feature Descriptor is two to enter Sequence form processed, the then each Feature Descriptor described using Hamming distances calculation to family's feature in storehouse carry out similarity meter Calculate, realize cluster；If Feature Descriptor is vector form, calculated using Euclidean distance calculation or standardization Euclidean distance Each Feature Descriptor that mode or included angle cosine calculation are described to family's feature in storehouse carries out Similarity Measure, realizes poly- Class；After the completion of cluster, only retain the cluster that the point number in result class cluster is more than threshold value t, be configured to final family's feature description Storehouse, the matching of storehouse progress step 5 is described with final family's feature；Wherein t is more than or equal to the family and carries out Feature Descriptor The half of sample number during extraction.

9. the homologous detection method of malice sample based on image feature descriptor as claimed in claim 1, it is characterised in that institute State in step 5, if Feature Descriptor is binary sequence, Similarity Measure is carried out using Hamming distances calculation；It is if special Sign description is vector, then using Euclidean distance calculation or standardization Euclidean distance calculation or included angle cosine calculating side Formula carries out Similarity Measure.