CN107657175A - A kind of homologous detection method of malice sample based on image feature descriptor - Google Patents

A kind of homologous detection method of malice sample based on image feature descriptor Download PDF

Info

Publication number
CN107657175A
CN107657175A CN201710835366.4A CN201710835366A CN107657175A CN 107657175 A CN107657175 A CN 107657175A CN 201710835366 A CN201710835366 A CN 201710835366A CN 107657175 A CN107657175 A CN 107657175A
Authority
CN
China
Prior art keywords
feature
point
feature descriptor
family
gray value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710835366.4A
Other languages
Chinese (zh)
Inventor
赵小林
薛静锋
李旭辉
王勇
张漪墁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN201710835366.4A priority Critical patent/CN107657175A/en
Publication of CN107657175A publication Critical patent/CN107657175A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a kind of homologous detection method of malice sample based on image feature descriptor.Using the present invention malice sample can be avoided to obscure interference, quickly analyze the homology of malicious file, efficiency high, precision height, strong robustness, autgmentability are strong.The present invention carries out data prediction by file visualized algorithm, avoid because of the disturbing factor for the semantic level that file decompiling or sandbox operation are brought, then the Feature Descriptor of image feature extraction techniques extraction rogue program image is utilized on homology analysis field, and storehouse is described with image feature descriptor structure family feature, the homology for unknown rogue program of being analysed and compared using family's feature database.The image feature descriptor strong robustness obtained by image characteristics extraction algorithm, and Sample Storehouse post analysis efficiency high is established, autgmentability is strong.

Description

A kind of homologous detection method of malice sample based on image feature descriptor
Technical field
The present invention relates to technical field of network security, and in particular to a kind of malice sample based on image feature descriptor is same Source detection method.
Background technology
Malware quantity constantly increases in recent years, and malicious code mutation quantity also sharply increases, and mutation cost constantly drops Low, malicious code, which makes to be organized in, evades detection or the work done on slight change is more and more, how fast and effeciently to divide The species and homology of analysis malice sample are always the emphasis and difficult point of network security research.
It is more at present that homology analysis is carried out using the method for static analysis or dynamic analysis but less efficient.Existing base File operation, the network of malicious code can be more fully analyzed in the malice sample homology analysis method of dynamic behaviour capture Behavior etc., but its major defect is that overhead is larger, and expandability is weaker, and analytical cycle is relatively long;And pass through static state Disassembly method can obtain the API Calls sequence chart of malicious code, command information similitude between more different samples and Function call similitude, this can be avoided to a certain extent, and the overhead of dynamic behaviour analysis method is big, analytical cycle length The problem of, some achievements are also obtain, but this method has the problem of analysis result is not accurate enough:Malice sample passes through static anti- Compiling analysis, the API Calls figure of acquisition averagely have thousands of individual nodes, have research to be removed by way of beta pruning some of useless Node, operational efficiency is improved, but much noise point is still suffered from API Calls figure.;
Also someone realizes the automatic marking of malice sample using classification or clustering method, though certain achievement is achieved, but still Many restrictions be present, such as classification or clustering method need a large amount of training samples, the quantity of the more numerical examples of remote super large and its mutation.
Qu Wu et al. discloses a kind of method and device (Chinese Patent Application No. for realizing malicious code mark: CN201410142940.4), including:Transplantable execution body (PE) file of malicious code is handled, passes through comentropy Digest algorithm obtains the informative abstract signature and Datum dimension and textural characteristics of malicious code;According to Datum dimension and informative abstract Signature, the textural characteristics for belonging to same malicious code family are generated into corresponding textural characteristics set;According to textural characteristics set The first clustering cluster is generated, the first clustering cluster is merged to generate the second clustering cluster, combining information digest and malice generation The depth name of code family carries out deep annotation to the second clustering cluster.The invention to malicious code by carrying out Datum dimension and depth Scale, using informative abstract signature and the depth name of malicious code family, the mask method of specification Liao Ge malicious codes family, carry High accuracy and versatility to malicious code mark.
Kang Fei et al. provides a kind of malicious code homology analysis method (Chinese patent Shen of Behavior-based control characteristic similarity Please number:CN201510296976.2), it is primarily based on the extraction of binary pitching pile platform and the behavior of quantization means malicious code Feature, the similitude of behavioural characteristic between different malicious codes is measured on this basis, reflected with the similarity of behavioural characteristic The homology of malicious code differentiates result.Using the invention homology analysis can be carried out to the malicious code collected in network, And the tracking to follow-on attack source is traced to the source and provided strong support.This method can be reflected correctly between malicious code sample Homology, while the malicious code sample without homology has correctly been distinguished, to the homology analysis work of malicious code Work has important guidance and reference.
Jia Xiao, which is opened et al., proposes a kind of malicious code analysis detection method (China Patent No. based on dynamic semantics feature: CN201310682922.0), its step includes:1) by the code dynamic operation of detection to be analyzed in malice Sample Storehouse in virtual ring Among border, monitor its running and extract primitive character;2) the API Name letter for representing the code semantic feature is filtered out Breath;3) the API sequence semantic feature set for representing the code semantic feature is established;4) representative semantic feature is chosen to build Vertical semantic feature storehouse;5) the semantic feature set of code to be detected and semantic feature storehouse are subjected to similitude detection, draw detection As a result, i.e., code to be detected is benign code or malicious code.The invention can establish different semantemes according to different samples Feature, there is good universality, and propose the method for choosing representative feature, can accurately represent code Semantic feature, analysis detection to malicious code is more accurate, testing cost is lower.
In above-mentioned implementation, the textural characteristics that the method for Qu Wu et al. propositions is extracted by comentropy digest algorithm are inadequate Substantially, the focus of the invention is to divide malicious code family by clustering algorithm the manual intervention, it is necessary to certain, and It is required to cluster with all samples in former storehouse for emerging malice sample, it is less efficient, it is only suitable for marking, and Be not suitable for quick analysis sample homology;The scheme that the scheme and Jia Xiaoqi et al. that Kang Fei et al. is proposed propose its substantially belong to Based on the analysis method of dynamic behaviour capture, it is asked for analytical cycle existing for dynamic analysing method is long, extended capability is weak Topic is still difficult to break through.
The content of the invention
In view of this, the invention provides a kind of homologous detection method of malice sample based on image feature descriptor, energy Enough avoid malice sample from obscuring interference, quickly analyze the homology of malicious file, efficiency high, precision height, strong robustness, extension Property is strong.
The homologous detection method of malice sample based on image feature descriptor of the present invention, comprises the following steps:
Step 1, each binary malicious file in sample set is converted to by matrix using B2M algorithms;
Step 2, regard each matrix as a width picture respectively, all of each picture are positioned using image characteristic extracting method Characteristic point;
Step 3, Feature Descriptor extraction is carried out in feature neighborhood of a point;
Step 4, the set of the Feature Descriptor of the malicious file of same family will be belonged in sample set as the family Feature storehouse is described;
Step 5, the Feature Descriptor of malicious file to be detected is extracted by the way of step 1~step 3, will be to be detected All Feature Descriptors that the Feature Descriptor of malicious file describes with each family's feature in storehouse respectively carry out Similarity Measure, meter Obtained similarity is more than or equal to given threshold, it is believed that the Feature Descriptor of malicious file to be detected describes with this feature Son matching;Each family's feature is described in storehouse, is to treat with the most family of the Feature Descriptor coupling number of malicious file to be detected The affiliated family of malicious file is detected, completes the detection of malice sample homology.
Further, in the step 1, matrix column sum is the length of PE file sections in malicious file.
Further, in the step 2, picture feature point is realized using SIFT algorithms, SURF algorithm or FAST algorithms Positioning.
Further, in the step 2, SIFT algorithms are improved, picture is realized using the SIFT algorithms after improvement The positioning of characteristic point, specific localization method include following sub-step:
Step 2.1, for each picture, using the developing algorithm in SIFT algorithms, every group 6 layers totally 3 groups of Gauss gold word are built Tower;
Step 2.2, for the gaussian pyramid in step 2.1, using the developing algorithm in SIFT algorithms, every group 5 is built Layer totally 3 groups of difference of Gaussian pyramid;
Step 2.3, for the difference of Gaussian pyramid in step 2.2, each point in difference of Gaussian pyramid is scanned, is protected Stay centered on the point, the maximum value or minimum value point in 3 × 3 × 3 neighborhoods, form extreme value point set;
Step 2.4, for the extreme value point set in step 2.3, retain in set and put ash in gray value and 3 × 3 × 3 neighborhoods Angle value difference is more than threshold value T point, completes the positioning of characteristic point, wherein T >=0.
Further, the feature neighborhood of a point in the step 3 is N × n-quadrant centered on this feature point, wherein, N =5~15;Feature Descriptor extraction is carried out using following sub-step:
Step 3.1, for positioning each characteristic point drawn in step 2, by permanent order collection centered on this feature point N × N contiguous ranges in pixel gray value, composition this feature neighborhood of a point gray value ordered set (G1, G2, G3... ...);
Step 3.2, each feature neighborhood of a point gray value ordered set obtained for step 3.1, by set Gray value G1With gray value G2Compare, if gray value G1More than gray value G2, then C1=1, otherwise C1=0;By the gray scale in set Value G1With gray value G3Compare, if gray value G1More than gray value G3, then C2=1, otherwise C2=0, the like, until neighborhood ash All gray values in angle value ordered set compare two-by-two to be finished, and is obtained length and isBinary sequence, The binary sequence is the Feature Descriptor of this feature point.
Further, in the step 3.1,0/1 sample weight distribution is carried out to N × N neighborhoods of characteristic point, i.e.,:
Characteristic point is expert at LnIn, the sample weight value of continuous N number of point centered on characteristic point is 1;
The lastrow L that characteristic point is expert atn-1With next line Ln+1, the continuous N-2 point centered on characteristic point column Sample weight value be 1;
The upper two rows L that characteristic point is expert atn-2With lower two rows Ln+2, the continuous N-4 point centered on characteristic point column Sample weight value be 1;
The sample weight value of remaining point is 0;
The pixel gray value for being 1 by sample weight value in permanent order collection N × N contiguous ranges, forms this feature point Neighborhood gray value ordered set.
Further, the feature neighborhood of a point in the step 3 is N × n-quadrant centered on this feature point, wherein, N =5~15;Feature Descriptor extraction is carried out using following sub-step:
Step 3.1, for positioning each characteristic point drawn in step 2, N × N centered on this feature point is gathered respectively Pixel gray value in contiguous range, using the histogram of gradients statistical in SIFT algorithms, calculate in the contiguous range The Grad of each pixel and direction, obtain the gradient matrix of this feature point;
Step 3.2, for the gradient matrix of each characteristic point obtained in step 3.1, rotate, use without reference axis Gauss window in SIFT algorithms is weighted, and obtains the weighted gradient matrix of each characteristic point;
Step 3.3, for the weighted gradient matrix of each characteristic point obtained in step 3.2, N × n-quadrant is merged into 4 × 4 regions, and the histogram of gradients in 4 × 4 regions per sub-regions is counted, and using histogram of gradients as this feature point sub-district The vector in domain, the vector of all subregion of this feature point is spliced, the sum vector for splicing to obtain is the spy of this feature point Sign description.
Further, in the step 4, the Feature Descriptor in storehouse is described to the feature of each family and clustered, wherein, If Feature Descriptor is binary sequence form, each feature in storehouse is described using Hamming distances calculation to family's feature Description carries out Similarity Measure, realizes cluster;If Feature Descriptor is vector form, using Euclidean distance calculation or Each Feature Descriptor that standardization Euclidean distance calculation or included angle cosine calculation are described to family's feature in storehouse is carried out Similarity Measure, realize cluster;After the completion of cluster, only retain the cluster that the point number in result class cluster is more than threshold value t, be configured to most Whole family's feature describes storehouse, and the matching of storehouse progress step 5 is described with final family's feature;Wherein t is more than or equal to the family The half of sample number during race's progress Feature Descriptor extraction.
Further, in the step 5, if Feature Descriptor is binary sequence, using Hamming distances calculation Carry out Similarity Measure;If Feature Descriptor is vector, calculated using Euclidean distance calculation or standardization Euclidean distance Mode or included angle cosine calculation carry out Similarity Measure.
Beneficial effect:
(1) present invention proposes the mapping relations of the homologous detection field of malice sample and field of image recognition, it is proposed that makes The foundation of homologous detection is carried out with image recognition algorithm and demonstrates feasibility.Data are carried out by file visualized algorithm in advance to locate Reason, avoid because of the disturbing factor for the semantic level that file decompiling or sandbox operation are brought, then on homology analysis field Using the Feature Descriptor of image feature extraction techniques extraction rogue program image, and it is special with image feature descriptor structure family Sign description storehouse, and then utilize the homology of the unknown rogue program of family's feature database analyses and comparison.The present invention substantially belongs to static Analysis method, but without decompiling binary file, overcome shell adding present in static decompiling analysis etc. and obscure means Interference problem, accuracy are higher.The image feature descriptor strong robustness obtained by image characteristics extraction algorithm, and establish sample This storehouse post analysis efficiency high, autgmentability are strong.
(2) the characteristics of being directed to malice text, is improved SIFT algorithms, reduces the group number and layer of gaussian pyramid Number, the fitting of cancellation feature point position, main gradient calculation and the rotation of main gradient when cancellation feature point extracts, improve algorithm Computational efficiency and accuracy.
(3) sample weight distribution is carried out to the point in feature vertex neighborhood, the row away from characteristic point, the sampling of Feature Descriptor Point is reduced, and improves the antialiasing ability of characteristic point.
(4) family's Feature Descriptor of extraction is screened by DBSCAN clustering algorithms, removes the feature description that peels off Son and the less not common Feature Descriptor of class cluster, structure read carefully and thoroughly higher family's feature and describe storehouse, improve homologous detection The degree of accuracy.
(5) present invention to the informative abstract signature of malice sample without extracting, without the word after progress decompiling The steps such as frequency statistics, Symbol processing, can effectively improve efficiency;Without being run by sandbox, rogue program can be avoided to sandbox The detection of environment or latent attack, and analysis efficiency and scalability are far above dynamic analysis;Compared to other data minings For algorithm thousands of individual image feature descriptors can be generated without a large amount of training samples, each individually malice sample.
Brief description of the drawings
Fig. 1 is present system schematic flow sheet.
Fig. 2 is difference of Gaussian pyramid schematic diagram.
Fig. 3 is local comparison point schematic diagram.
Fig. 4 is the Feature Descriptor sampled point schematic diagram that the present invention uses.
Fig. 5 is flow chart of the present invention.
Embodiment
The present invention will now be described in detail with reference to the accompanying drawings and examples.
The invention provides a kind of homologous detection method of malice sample based on image feature descriptor, to no offer source The binary system rogue program of code is visualized, and is positioned all characteristic points of each picture using image characteristic extracting method and carried Feature Descriptor is taken, image texture finger print information is generated, so as to accurately and efficiently be analyzed the species belonging to rogue program And family, the present invention relates to network security, has important application in Malware homology analysis.
Existing many malice sample mutation realize that technology is needed by complexity under existing homologous detection technique background Technological means is evaded, and every kind of disturbing factor needs limited means to be excluded, and the present invention have studied malice sample and change General character between kind realization rate and image recognition and its disturbing factor, and it is extracted mapping between the two as shown in table 1 Relation.As it can be seen from table 1 after these technology-mappeds are field of image recognition, belong to common disturbing factor, therefore, can be with The extraction to family's feature database of Malware is realized using the method for image feature extraction techniques extraction picture feature description, And then utilize the unknown rogue program of family's feature database analyses and comparison.The present invention makes full use of the advantage of image analysis technology, is keeping away Exempt to lift analysis efficiency while malice sample obscures interference.
Table 1
For obtain malicious code textural characteristics (i.e. Feature Descriptor), first by B2M (Binary to Matrix, It is matrix i.e. by File Mapping) malicious file is mapped as matrix by visual analysis method, due to the byte value model of malice text [0~255] is trapped among, it is identical with the intensity value ranges of gray level image, therefore as, each matrix can be regarded to a width gray-scale map respectively, The reverse-examination that can avoid analyzing for sandbox due to malicious file is surveyed and APT (Advanced Persistent Threat, it is advanced Continuation threatens) behavior, and behavior accuracy rate as caused by the modes such as shell adding, API sequence confusions is evaded to decompiling analysis The problem of decline, detection are time-consuming high, and efficiency reduces.
Then, for each gray-scale map, the Feature Descriptor of picture is extracted using image characteristic extracting method, that is, is disliked Anticipate the key code context of text, the foundation of similarity system design during as homologous detection.
Specifically comprise the following steps:
1. the Feature Descriptor extraction of malice sample gray-scale map
The pixel information of feature vertex neighborhood is needed to use before Feature Descriptor is extracted.The characteristic point of malicious file To reflect the pixel of the feature of gray-scale map caused by the mapping of malice sample, a usual malicious file (gray-scale map) can produce Multiple characteristic points;Feature Descriptor can produce one for characteristic point and its whole description of neighborhood territory pixel point, a characteristic point Individual Feature Descriptor, Feature Descriptor compare foundation using as distance during malice sample homology analysis.Malicious file feature Information around point is the byte chip segment of the front and rear or upper and lower file section of key bytes code, and the extraction of the information needs will malice text Part is mapped as a continuous texture image.The present invention first by B2M algorithms by malicious file it is byte-by-byte be mapped as image ash Angle value, and using the byte of length 512 of fixed width such as PE file sections as picture traverse, whole binary file is mapped as nothing The PNG images of compression, remove obscuring for malicious code semantic level.
Then the image characteristic extracting methods such as SIFT algorithms, SURF algorithm or FAST algorithms can be used to realize picture feature The positioning of point.
The present embodiment positions all characteristic points of each picture using SIFT algorithms.Specifically, use first in SIFT algorithms Developing algorithm structure gaussian pyramid and difference of Gaussian pyramid.Due to the unconspicuous feature of malice sample image texture, sheet Embodiment is simplified to SIFT algorithms, using only 3 groups, every group 6 layers of gaussian pyramid, generates 3 groups, every group 5 layers of height This difference pyramid, as shown in Figure 2.As shown in figure 3, each point in the difference of Gaussian pyramid of scanning structure, retains with this Extreme point (maximum value or minimum value point) in subrange centered on point (in 3 × 3 × 3 neighborhoods), form extreme value point set; Retain in set and put the point that gray value differences are more than threshold value T in gray value and 3 × 3 × 3 neighborhoods, complete the positioning of characteristic point, wherein T ≥0。
Then, the Feature Descriptor extraction of characteristic point is carried out.
In the Feature Descriptor extraction process of SIFT scheduling algorithms, a large amount of calculation resources are consumed for after digital simulation Characteristic point position, and characteristic point and its main gradient of neighborhood, and the point in feature vertex neighborhood is rotated relative to main gradient, most What is extracted afterwards is the gradient information of feature vertex neighborhood point.And graphical rule caused by malice sample keeps unified and (is not present because of thing Pixel precision problem caused by body distance), while image caused by malice sample can not possibly rotate change (semantic above The arrangement of part binary content is rotated without in all senses).Therefore, the present invention is to SIFT algorithm characteristics point location and extraction process Simplified, the mistake that fitting, main gradient calculation and feature vertex neighborhood rotate towards main gradient is removed to Feature Descriptor extraction process Journey.
Specifically, the present invention carries out Feature Descriptor extraction, institute using improved SIFT methods in feature neighborhood of a point Feature neighborhood of a point is stated as N × n-quadrant centered on this feature point, wherein, N=5~15;Carried out using following sub-step special Sign description son extraction:
(1) all characteristic points drawn for positioning, are gathered in N × N contiguous ranges centered on this feature point respectively Pixel gray value, using the histogram of gradients statistical in SIFT algorithms, calculate each pixel in the contiguous range Grad and direction, obtain the gradient matrix of this feature point;
(2) for each gradient matrix obtained in (1), rotated without reference axis, use the Gauss in SIFT algorithms Window is weighted, and obtains the gradient matrix after weighted calculation;
(3) for each gradient matrix after the weighted calculation of acquisition in (2), 4 × 4 regions are merged into N × n-quadrant, And the histogram of gradients in each region is counted, wherein an every 45 degree posts for histogram, each region produces 8 posts, and will The histogram saves as the vector in this feature point region, and a characteristic point common property gives birth to 4 × 4=16 vector, each vector length Spend for 8,16 vectors of this feature point are spliced into 16 × 8=128 dimensional vectors, the Feature Descriptor for completing this feature point carries Take;
(4) each characteristic point of the sample is directed to, is repeated (3), is obtained all Feature Descriptors of the sample, complete the sample Eigen description son extraction.
Further, it is also possible to Feature Descriptor extraction is carried out in feature neighborhood of a point using improved BRIEF methods, it is described Feature neighborhood of a point is N × n-quadrant centered on this feature point, wherein, N=5~15;Feature is carried out using following sub-step Description son extraction:
(1) all characteristic points drawn for positioning, N × N centered on this feature point is gathered by permanent order respectively Pixel gray value in contiguous range, wherein order can be from upper left to bottom right or from lower-left to upper right etc., form the spy Levy neighborhood of a point gray value ordered set;
(2) for each feature neighborhood of a point gray value ordered set of collection in (1), respectively by the gray value in set G1With gray value G2Compare, if gray value G1More than gray value G2, then C is obtained1=1, otherwise C1=0;By the gray value in set G1With gray value G3Compare, if gray value G1More than gray value G3, then C is obtained2=1, otherwise C2=0 ... until all gray values Finished to comparing two-by-two, obtaining length isBinary sequence, the Feature Descriptor for completing the point carries Take;
(3) each neighborhood gray value ordered set in (1) is directed to, repeats (2), the Feature Descriptor for completing the sample carries Take.
Due to malicious file mutation frequently with equivalent instruction replacement, shell adding, rubbish code insertion, controlling stream obscure, code Reset, register the mode such as redistributes and changes code syntax feature, it is believed that there is the malicious snippets of code of core feature It is stable, and more remote apart from the scope within the specific limits, code snippet is higher by the possibility of confounding effect, that is, gets over Close to characteristic point, value of the code snippet in identification is higher.Feature Descriptor in malicious code homology analysis field Sample weight distribution can be carried out to the point in feature vertex neighborhood, the row away from characteristic point, the sampled point of Feature Descriptor is reduced, The antialiasing ability of characteristic point can be improved.Wherein, often row sample weight is directly proportional to its sampled point number in region, using such as Lower sub- description is sampled:
Son description 1, characteristic point is expert at LnRow hits is N, is continuous N number of point centered on characteristic point;
Son description 2, the lastrow L that characteristic point is expert atn-1With next line Ln+1Hits is N-2, is with where characteristic point It is classified as the continuous N-2 point at center;
Son description 3, Ln-2Row and Ln+2Row hits is N-4, is the continuous N-4 point centered on characteristic point column;
The pixel gray value of above-mentioned sampled point is gathered by permanent order, forms this feature neighborhood of a point gray value ordered set Close, carry out Feature Descriptor extraction.
Specifically, the Feature Descriptor be to characteristic point in itself, each 4 neighborhoods before and after code segment n where characteristic point Each 7 neighborhoods point of point, n-1 and n+1 code segment correspondence positions, the common 9+7+7 of each 5 neighborhoods point of n-2 and n+2 code segment correspondence positions + 5+5=33 points are acquired, as shown in figure 4, simultaneously calculating the gray value comparative result of point pair one by one, form oneThe binary string of position.
2. malice sample families feature describes storehouse structure
The sample marked is extracted into Feature Descriptor by family, each sample will produce multiple Feature Descriptors.The family The set of the Feature Descriptor composition of all samples can describe the foundation of storehouse structure as family's feature in race.
To improve homologous accuracy in detection, when some characteristic points of extraction belong to multiple samples in the family simultaneously, The point can be regarded as to the shared characteristics of image gene of family.And some characteristic point is only sample spy in each sample Have, therefore to improve the accuracy rate and efficiency that compare, it is key therein to remove the distinctive characteristic point of sample.It is unknown to improve The efficiency and accuracy rate of sample homology analysis, the present invention propose the construction method that family's feature based on cluster describes storehouse, By clustering algorithm, the larger Feature Descriptor class cluster of accounting is found out, removes less class cluster and outlier.Assert cluster result It is the higher family's public characteristic description of similarity that point number, which exceedes point in the cluster of family input malice sample number 1/2, in middle cluster Son, retained;Other class clusters or outlier are then the distinctive Feature Descriptors of some mutation, will be removed.Most establish at last The family's feature for remaining multiple core feature description describes storehouse.
Wherein, if Feature Descriptor is binary sequence form, family's feature is retouched using Hamming distances calculation Each Feature Descriptor stated in storehouse carries out Similarity Measure, realizes cluster;If Feature Descriptor is vector form, using European Family's feature is described apart from calculation or standardization Euclidean distance calculation or included angle cosine calculation each in storehouse Feature Descriptor carries out Similarity Measure, realizes cluster.
3. malice sample homology analysis
After completing malicious code family feature database, system already has carries out homology point to malice sample and its mutation The foundation of analysis, and homology analysis operation is similar to the process of image recognition, it is necessary to carry out the comparison of Feature Descriptor, obtains sample Sheet and the matching result of feature database.
Carry out characteristic point comparison, it is necessary to extract with the identical Feature Descriptor of type in feature database, therefore malice The Feature Descriptor extracting mode of sample describes sub- extracting method using foregoing malice sample characteristics.Unlike foregoing, One sample does not possess the condition of extraction family common characteristic, after Feature Descriptor is extracted, without entering traveling one to description Walk cluster analysis screening.
Multiple Feature Descriptors that unknown sample is extracted description shared with the family in each family's feature database Similarity distance comparison is carried out, threshold value is corresponding with the cluster radius of foregoing clustering algorithm, when similarity is less than the threshold value, it is believed that 2 points Matching.The affiliated sample families of the largest number of feature databases of point matched with unknown sample are using as the homology analysis knot of the sample Fruit.
The present invention focus on propose being total between malice sample mutation realization rate and image recognition disturbing factor Property, both mapping relations are extracted, image feature extraction techniques are introduced in homology analysis field, i.e., are retouched using characteristics of image Sub- carry out homology analysis is stated, and uses the image characteristics extraction algorithm and clustering algorithm being improved for malice sample analysis Family's feature database of Malware is extracted, for unknown rogue program of analysing and comparing, using the advantage of image analysis technology, is being kept away Exempt to lift analysis efficiency while malice sample obscures interference.
In summary, presently preferred embodiments of the present invention is these are only, is not intended to limit the scope of the present invention. Within the spirit and principles of the invention, any modification, equivalent substitution and improvements made etc., it should be included in the present invention's Within protection domain.

Claims (9)

1. a kind of homologous detection method of malice sample based on image feature descriptor, it is characterised in that comprise the following steps:
Step 1, each binary malicious file in sample set is converted to by matrix using B2M algorithms;
Step 2, regard each matrix as a width picture respectively, all features of each picture are positioned using image characteristic extracting method Point;
Step 3, Feature Descriptor extraction is carried out in feature neighborhood of a point;
Step 4, spy of the set as the family of the Feature Descriptor of the malicious file of same family will be belonged in sample set Sign description storehouse;
Step 5, the Feature Descriptor of malicious file to be detected is extracted by the way of step 1~step 3, by malice to be detected text All Feature Descriptors that the Feature Descriptor of part describes with each family's feature in storehouse respectively carry out Similarity Measure, are calculated Similarity be more than or equal to given threshold, it is believed that the Feature Descriptor of malicious file to be detected and this feature description son Match somebody with somebody;Each family's feature is described in storehouse, and most families is as to be detected with the Feature Descriptor coupling number of malicious file to be detected The affiliated family of malicious file, complete the detection of malice sample homology.
2. the homologous detection method of malice sample based on image feature descriptor as claimed in claim 1, it is characterised in that institute State in step 1, matrix column sum is the length of PE file sections in malicious file.
3. the homologous detection method of malice sample based on image feature descriptor as claimed in claim 1, it is characterised in that institute State in step 2, the positioning of picture feature point is realized using SIFT algorithms, SURF algorithm or FAST algorithms.
4. the homologous detection method of malice sample based on image feature descriptor as claimed in claim 1, it is characterised in that institute State in step 2, SIFT algorithms are improved, the positioning of picture feature point is realized using the SIFT algorithms after improvement, it is specific fixed Position method includes following sub-step:
Step 2.1, for each picture, using the developing algorithm in SIFT algorithms, every group 6 layers totally 3 groups of gaussian pyramid are built;
Step 2.2, for the gaussian pyramid in step 2.1, using the developing algorithm in SIFT algorithms, every group 5 layers totally 3 are built Group difference of Gaussian pyramid;
Step 2.3, for the difference of Gaussian pyramid in step 2.2, scan each point in difference of Gaussian pyramid, retain with Centered on the point, the maximum value or minimum value point in 3 × 3 × 3 neighborhoods, extreme value point set is formed;
Step 2.4, for the extreme value point set in step 2.3, retain in set and put gray value in gray value and 3 × 3 × 3 neighborhoods Difference is more than threshold value T point, completes the positioning of characteristic point, wherein T >=0.
5. the homologous detection method of malice sample based on image feature descriptor as claimed in claim 1, it is characterised in that institute It is N × n-quadrant centered on this feature point to state the feature neighborhood of a point in step 3, wherein, N=5~15;Using following son Step carries out Feature Descriptor extraction:
Step 3.1, for positioning each characteristic point drawn in step 2, by N of the permanent order collection centered on this feature point × Pixel gray value in N contiguous ranges, composition this feature neighborhood of a point gray value ordered set (G1, G2, G3... ...);
Step 3.2, each feature neighborhood of a point gray value ordered set obtained for step 3.1, by the gray scale in set Value G1With gray value G2Compare, if gray value G1More than gray value G2, then C1=1, otherwise C1=0;By the gray value G in set1 With gray value G3Compare, if gray value G1More than gray value G3, then C2=1, otherwise C2=0, the like, until neighborhood gray value All gray values in ordered set compare two-by-two to be finished, and is obtained length and isBinary sequence, this two System sequence is the Feature Descriptor of this feature point.
6. the homologous detection method of malice sample based on image feature descriptor as claimed in claim 5, it is characterised in that institute State in step 3.1,0/1 sample weight distribution is carried out to N × N neighborhoods of characteristic point, i.e.,:
Characteristic point is expert at LnIn, the sample weight value of continuous N number of point centered on characteristic point is 1;
The lastrow L that characteristic point is expert atn-1With next line Ln+1, continuous N-2 point centered on characteristic point column adopt Sample weighted value is 1;
The upper two rows L that characteristic point is expert atn-2With lower two rows Ln+2, continuous N-4 point centered on characteristic point column adopt Sample weighted value is 1;
The sample weight value of remaining point is 0;
The pixel gray value for being 1 by sample weight value in permanent order collection N × N contiguous ranges, form the neighbour of this feature point Domain gray value ordered set.
7. the homologous detection method of malice sample based on image feature descriptor as claimed in claim 1, it is characterised in that institute It is N × n-quadrant centered on this feature point to state the feature neighborhood of a point in step 3, wherein, N=5~15;Using following son Step carries out Feature Descriptor extraction:
Step 3.1, for positioning each characteristic point drawn in step 2, N × N neighborhoods centered on this feature point are gathered respectively In the range of pixel gray value, using the histogram of gradients statistical in SIFT algorithms, calculate each in the contiguous range The Grad of pixel and direction, obtain the gradient matrix of this feature point;
Step 3.2, for the gradient matrix of each characteristic point obtained in step 3.1, rotate without reference axis, calculated using SIFT Gauss window in method is weighted, and obtains the weighted gradient matrix of each characteristic point;
Step 3.3, for the weighted gradient matrix of each characteristic point obtained in step 3.2,4th × 4 area are merged into N × n-quadrant Domain, and the histogram of gradients in 4 × 4 regions per sub-regions is counted, and using histogram of gradients as this feature point subregion Vector, the vector of all subregion of this feature point is spliced, the sum vector for splicing to obtain is the feature of this feature point Description.
8. the homologous detection method of malice sample based on image feature descriptor as claimed in claim 1, it is characterised in that institute State in step 4, describing the Feature Descriptor in storehouse to the feature of each family clusters, wherein, if Feature Descriptor is two to enter Sequence form processed, the then each Feature Descriptor described using Hamming distances calculation to family's feature in storehouse carry out similarity meter Calculate, realize cluster;If Feature Descriptor is vector form, calculated using Euclidean distance calculation or standardization Euclidean distance Each Feature Descriptor that mode or included angle cosine calculation are described to family's feature in storehouse carries out Similarity Measure, realizes poly- Class;After the completion of cluster, only retain the cluster that the point number in result class cluster is more than threshold value t, be configured to final family's feature description Storehouse, the matching of storehouse progress step 5 is described with final family's feature;Wherein t is more than or equal to the family and carries out Feature Descriptor The half of sample number during extraction.
9. the homologous detection method of malice sample based on image feature descriptor as claimed in claim 1, it is characterised in that institute State in step 5, if Feature Descriptor is binary sequence, Similarity Measure is carried out using Hamming distances calculation;It is if special Sign description is vector, then using Euclidean distance calculation or standardization Euclidean distance calculation or included angle cosine calculating side Formula carries out Similarity Measure.
CN201710835366.4A 2017-09-15 2017-09-15 A kind of homologous detection method of malice sample based on image feature descriptor Pending CN107657175A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710835366.4A CN107657175A (en) 2017-09-15 2017-09-15 A kind of homologous detection method of malice sample based on image feature descriptor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710835366.4A CN107657175A (en) 2017-09-15 2017-09-15 A kind of homologous detection method of malice sample based on image feature descriptor

Publications (1)

Publication Number Publication Date
CN107657175A true CN107657175A (en) 2018-02-02

Family

ID=61129571

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710835366.4A Pending CN107657175A (en) 2017-09-15 2017-09-15 A kind of homologous detection method of malice sample based on image feature descriptor

Country Status (1)

Country Link
CN (1) CN107657175A (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463843A (en) * 2016-06-02 2017-12-12 重庆达特科技有限公司 Malicious code noise reduction big data detecting system
CN109117635A (en) * 2018-09-06 2019-01-01 腾讯科技(深圳)有限公司 Method for detecting virus, device, computer equipment and the storage medium of application program
CN110012000A (en) * 2019-03-29 2019-07-12 深圳市腾讯计算机系统有限公司 Order detection method, device, computer equipment and storage medium
CN110222511A (en) * 2019-06-21 2019-09-10 杭州安恒信息技术股份有限公司 The recognition methods of Malware family, device and electronic equipment
CN110543884A (en) * 2018-05-29 2019-12-06 国际关系学院 network attack organization tracing method based on image
CN110569629A (en) * 2019-09-10 2019-12-13 北京计算机技术及应用研究所 Binary code file tracing method
CN111027065A (en) * 2019-10-28 2020-04-17 哈尔滨安天科技集团股份有限公司 Lesovirus identification method and device, electronic equipment and storage medium
WO2020108760A1 (en) * 2018-11-29 2020-06-04 Huawei Technologies Co., Ltd. Apparatus and method for malware detection
CN111510449A (en) * 2020-04-10 2020-08-07 吴萌萌 Attack behavior mining method based on image big data and big data platform server
CN111770053A (en) * 2020-05-28 2020-10-13 江苏大学 Malicious program detection method based on improved clustering and self-similarity
CN112394935A (en) * 2020-11-30 2021-02-23 上海二三四五网络科技有限公司 Control method and device for realizing gray level setting in page
CN112472026A (en) * 2020-11-03 2021-03-12 黑龙江中医药大学 Novel medical internal medicine clinical diagnosis and treatment equipment and method
CN113362915A (en) * 2021-07-16 2021-09-07 上海大学 Material performance prediction method and system based on multi-modal learning
CN114186231A (en) * 2021-12-10 2022-03-15 中国电信股份有限公司 Method and system for detecting gambling APP and storage medium
CN114254317A (en) * 2021-11-29 2022-03-29 上海戎磐网络科技有限公司 Software processing method and device based on software gene and storage medium
CN115564970A (en) * 2022-09-20 2023-01-03 东华理工大学 Network attack tracing method, system and storage medium
CN115641177A (en) * 2022-10-20 2023-01-24 北京力尊信通科技股份有限公司 Prevent second and kill prejudgement system based on machine learning
CN116467607A (en) * 2023-03-28 2023-07-21 阿里巴巴(中国)有限公司 Information matching method and storage medium
US12124569B2 (en) 2019-03-29 2024-10-22 Tencent Technology (Shenzhen) Company Limited Command inspection method and apparatus, computer device, and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101604364A (en) * 2009-07-10 2009-12-16 珠海金山软件股份有限公司 Computer rogue program categorizing system and sorting technique based on file instruction sequence
CN101631023A (en) * 2009-07-31 2010-01-20 北京飞天诚信科技有限公司 Method for authenticating identity and system thereof
US20120159620A1 (en) * 2010-12-21 2012-06-21 Microsoft Corporation Scareware Detection
CN103077512A (en) * 2012-10-18 2013-05-01 北京工业大学 Feature extraction and matching method and device for digital image based on PCA (principal component analysis)
CN104268602A (en) * 2014-10-14 2015-01-07 大连理工大学 Shielded workpiece identifying method and device based on binary system feature matching
CN104574401A (en) * 2015-01-09 2015-04-29 北京环境特性研究所 Image registration method based on parallel line matching
CN104978522A (en) * 2014-04-10 2015-10-14 北京启明星辰信息安全技术有限公司 Method and device for detecting malicious code
CN105989287A (en) * 2015-12-30 2016-10-05 武汉安天信息技术有限责任公司 Method and system for judging homology of massive malicious samples
CN106203449A (en) * 2016-07-08 2016-12-07 大连大学 The approximation space clustering system of mobile cloud environment
CN106204660A (en) * 2016-07-26 2016-12-07 华中科技大学 A kind of Ground Target Tracking device of feature based coupling
CN107092829A (en) * 2017-04-21 2017-08-25 中国人民解放军国防科学技术大学 A kind of malicious code detecting method based on images match

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101604364A (en) * 2009-07-10 2009-12-16 珠海金山软件股份有限公司 Computer rogue program categorizing system and sorting technique based on file instruction sequence
CN101631023A (en) * 2009-07-31 2010-01-20 北京飞天诚信科技有限公司 Method for authenticating identity and system thereof
US20120159620A1 (en) * 2010-12-21 2012-06-21 Microsoft Corporation Scareware Detection
CN103077512A (en) * 2012-10-18 2013-05-01 北京工业大学 Feature extraction and matching method and device for digital image based on PCA (principal component analysis)
CN104978522A (en) * 2014-04-10 2015-10-14 北京启明星辰信息安全技术有限公司 Method and device for detecting malicious code
CN104268602A (en) * 2014-10-14 2015-01-07 大连理工大学 Shielded workpiece identifying method and device based on binary system feature matching
CN104574401A (en) * 2015-01-09 2015-04-29 北京环境特性研究所 Image registration method based on parallel line matching
CN105989287A (en) * 2015-12-30 2016-10-05 武汉安天信息技术有限责任公司 Method and system for judging homology of massive malicious samples
CN106203449A (en) * 2016-07-08 2016-12-07 大连大学 The approximation space clustering system of mobile cloud environment
CN106204660A (en) * 2016-07-26 2016-12-07 华中科技大学 A kind of Ground Target Tracking device of feature based coupling
CN107092829A (en) * 2017-04-21 2017-08-25 中国人民解放军国防科学技术大学 A kind of malicious code detecting method based on images match

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463843A (en) * 2016-06-02 2017-12-12 重庆达特科技有限公司 Malicious code noise reduction big data detecting system
CN110543884A (en) * 2018-05-29 2019-12-06 国际关系学院 network attack organization tracing method based on image
CN109117635A (en) * 2018-09-06 2019-01-01 腾讯科技(深圳)有限公司 Method for detecting virus, device, computer equipment and the storage medium of application program
WO2020108760A1 (en) * 2018-11-29 2020-06-04 Huawei Technologies Co., Ltd. Apparatus and method for malware detection
CN113015972A (en) * 2018-11-29 2021-06-22 华为技术有限公司 Malicious software detection device and method
CN110012000A (en) * 2019-03-29 2019-07-12 深圳市腾讯计算机系统有限公司 Order detection method, device, computer equipment and storage medium
US12124569B2 (en) 2019-03-29 2024-10-22 Tencent Technology (Shenzhen) Company Limited Command inspection method and apparatus, computer device, and storage medium
CN110222511A (en) * 2019-06-21 2019-09-10 杭州安恒信息技术股份有限公司 The recognition methods of Malware family, device and electronic equipment
CN110569629A (en) * 2019-09-10 2019-12-13 北京计算机技术及应用研究所 Binary code file tracing method
CN111027065A (en) * 2019-10-28 2020-04-17 哈尔滨安天科技集团股份有限公司 Lesovirus identification method and device, electronic equipment and storage medium
CN111027065B (en) * 2019-10-28 2023-09-08 安天科技集团股份有限公司 Leucavirus identification method and device, electronic equipment and storage medium
CN111510449A (en) * 2020-04-10 2020-08-07 吴萌萌 Attack behavior mining method based on image big data and big data platform server
CN111770053A (en) * 2020-05-28 2020-10-13 江苏大学 Malicious program detection method based on improved clustering and self-similarity
CN112472026A (en) * 2020-11-03 2021-03-12 黑龙江中医药大学 Novel medical internal medicine clinical diagnosis and treatment equipment and method
CN112394935A (en) * 2020-11-30 2021-02-23 上海二三四五网络科技有限公司 Control method and device for realizing gray level setting in page
CN113362915A (en) * 2021-07-16 2021-09-07 上海大学 Material performance prediction method and system based on multi-modal learning
CN114254317A (en) * 2021-11-29 2022-03-29 上海戎磐网络科技有限公司 Software processing method and device based on software gene and storage medium
CN114186231A (en) * 2021-12-10 2022-03-15 中国电信股份有限公司 Method and system for detecting gambling APP and storage medium
CN115564970A (en) * 2022-09-20 2023-01-03 东华理工大学 Network attack tracing method, system and storage medium
CN115641177A (en) * 2022-10-20 2023-01-24 北京力尊信通科技股份有限公司 Prevent second and kill prejudgement system based on machine learning
CN116467607A (en) * 2023-03-28 2023-07-21 阿里巴巴(中国)有限公司 Information matching method and storage medium
CN116467607B (en) * 2023-03-28 2024-03-01 阿里巴巴(中国)有限公司 Information matching method and storage medium

Similar Documents

Publication Publication Date Title
CN107657175A (en) A kind of homologous detection method of malice sample based on image feature descriptor
Mi et al. Wheat stripe rust grading by deep learning with attention mechanism and images from mobile devices
Fan et al. LeukocyteMask: An automated localization and segmentation method for leukocyte in blood smear images using deep neural networks
CN107909039B (en) High-resolution remote sensing image earth surface coverage classification method based on parallel algorithm
CN106228554B (en) Fuzzy coarse central coal dust image partition method based on many attribute reductions
Lei et al. Automatic detection and counting of urediniospores of Puccinia striiformis f. sp. tritici using spore traps and image processing
CN109871686A (en) Rogue program recognition methods and device based on icon representation and software action consistency analysis
CN110245697B (en) Surface contamination detection method, terminal device and storage medium
Liang et al. StomataScorer: a portable and high‐throughput leaf stomata trait scorer combined with deep learning and an improved CV model
CN108491228A (en) A kind of binary vulnerability Code Clones detection method and system
CN110472652A (en) A small amount of sample classification method based on semanteme guidance
CN116861431B (en) Malicious software classification method and system based on multichannel image and neural network
Fang et al. Identification of apple leaf diseases based on convolutional neural network
Somanchi et al. Discovering anomalous patterns in large digital pathology images
Rigaud et al. What do we expect from comic panel extraction?
Cai et al. Machine learning algorithms improve the power of phytolith analysis: A case study of the tribe Oryzeae (Poaceae)
Kovalev et al. Deep learning in big image data: Histology image classification for breast cancer diagnosis
CN114120138A (en) Method, device, equipment and medium for detecting and identifying remote sensing image target
Li et al. A grid‐based classification and box‐based detection fusion model for asphalt pavement crack
Li et al. An automatic plant leaf stoma detection method based on YOLOv5
CN104966109A (en) Medical laboratory report image classification method and apparatus
Hu et al. Generalized image recognition algorithm for sign inventory
US20240020958A1 (en) Systems and methods for determining regions of interest in histology images
Wang et al. Deep learning-based multi-classification for malware detection in IoT
CN109739840A (en) Data processing empty value method, apparatus and terminal device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180202

WD01 Invention patent application deemed withdrawn after publication