CN113378163A - Android malicious software family classification method based on DEX file partition characteristics - Google Patents
Android malicious software family classification method based on DEX file partition characteristics Download PDFInfo
- Publication number
- CN113378163A CN113378163A CN202010162791.3A CN202010162791A CN113378163A CN 113378163 A CN113378163 A CN 113378163A CN 202010162791 A CN202010162791 A CN 202010162791A CN 113378163 A CN113378163 A CN 113378163A
- Authority
- CN
- China
- Prior art keywords
- dex file
- features
- dex
- text
- classification method
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 34
- 238000005192 partition Methods 0.000 title claims abstract description 13
- 238000004458 analytical method Methods 0.000 claims description 6
- 230000002787 reinforcement Effects 0.000 claims description 2
- 238000013519 translation Methods 0.000 claims description 2
- 238000004364 calculation method Methods 0.000 claims 1
- 238000005516 engineering process Methods 0.000 claims 1
- 238000002156 mixing Methods 0.000 claims 1
- 239000000284 extract Substances 0.000 abstract description 2
- 238000000605 extraction Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000004927 fusion Effects 0.000 description 4
- 238000009434 installation Methods 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 238000007794 visualization technique Methods 0.000 description 3
- 238000010276 construction Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000012800 visualization Methods 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000007596 consolidation process Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013467 fragmentation Methods 0.000 description 1
- 238000006062 fragmentation reaction Methods 0.000 description 1
- 201000011243 gastrointestinal stromal tumor Diseases 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/562—Static detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/54—Extraction of image or video features relating to texture
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/56—Extraction of image or video features relating to colour
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention provides an Android malicious software family classification method based on DEX file partition characteristics. The method can automatically extract the DEX file of the Android malicious software, then converts the DEX file into the RGB image and the text, and realizes the classification of the Android malicious software through the RGB image characteristics and the text characteristics. Mainly comprises the following steps: (1) extracting a DEX file of the Android malicious software; (2) converting the DEX file into an RGB image; (3) converting the DEX file into a plain text file; (4) extracting texture features of the RGB image; (5) extracting color features of the RGB image; (6) extracting text features of the plain text file; (7) and fusing texture features, color features and text features by utilizing multi-core learning so as to realize the classification of the Android malicious software family.
Description
Technical Field
The invention provides an Android malicious software family classification method based on DEX file partition characteristics. The DEX file is visualized and textualized according to the DEX file block characteristics, the DEX file is converted into an RGB image and a plain text respectively, and then the RGB image characteristics and the text characteristics are extracted to serve as Android malicious sample characteristics. And finally, performing family classification on the Android malicious software by using a multi-feature fusion algorithm based on multi-core learning.
Background
Due to the open source characteristic of the Android system, the Android system occupies more than 85% of the market share of the mobile phone market. However, due to rapid iteration of the Android system and serious system fragmentation caused by the open source characteristic, the Android malware generates a large number of variants on the original numerous malicious families, and brings about a small challenge to the classification of the Android malicious families. Traditional static analysis methods are susceptible to confusion and consolidation, while dynamic analysis methods are time and space consuming. The new visualization method does not consider that the Android malware characteristics cause serious feature loss.
Many visualization methods and image processing methods are proposed for dealing with the classification of the malware family, but most methods do not aim at the classification of the Android malware family, and because the Android platform files have characteristics relative to other platform files, many methods are not suitable for the classification of the Android malware family and cause the loss of characteristics of the Android malware. In addition, many methods for the Android platform have low classification accuracy due to the defects in the visualization method and the image processing method. In order to solve the problems, the invention provides a more accurate Android malicious software family classification method. The method fully analyzes and utilizes the characteristics of the DEX file of the Android executable file, converts the DEX file into an RGB image and a text by means of the block characteristics of the DEX file, and then extracts the image characteristics and the text characteristics respectively to classify the Android malicious software. Compared with a dynamic and static analysis method, the method has higher analysis efficiency and interference resistance. Compared with a gray-scale image, the RGB image has color characteristics except texture characteristics, and can represent Android application software in a more multidimensional manner. In addition, text features are added besides the image features, and the image features and the text features are combined to enable the Android malicious family classification to be more accurate on the basis of not influencing the classification efficiency.
Disclosure of Invention
According to the method, the DEX file of the code execution file is extracted by decompressing the Android installation package file, and then the header file of the DEX file is analyzed to obtain blocks with different functions. And visualizing and textualizing the DEX file by using the characteristics of each block and among the blocks, and converting the DEX file into a more intuitive RGB image and text. Then, image features and text features are extracted as features of the DEX file. The method has the advantages that the DEX file byte codes are directly operated, the influence of confusion and reinforcement on analysis is reduced, the image characteristics and the text characteristics are combined to classify the malicious Android families, and the classification efficiency is improved on the premise of ensuring the accuracy.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
Fig. 1 is a frame diagram of an Android malware family classification method based on DEX file block characteristics provided by the invention.
FIG. 2 is a flow chart of a DEX file visualization process provided by the present invention.
FIG. 3 is a frame diagram of a feature fusion algorithm based on multi-core learning provided by the present invention.
Detailed Description
In order to make the implementation purpose, technical scheme and advantages of the invention more clear, the invention will be briefly described in the following with reference to the accompanying drawings of the specification of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The method for classifying the Android malicious family based on the DEX file characteristics mainly comprises four main steps: DEX file extraction, DEX file processing, feature extraction and learning classification. Extracting the DEX file, namely decompressing the Android installation package APK file, and extracting the DEX file with the suffix of DEX from a decompressed folder; the DEX file processing comprises visualization and textualization of the DEX file, after the DEX file is obtained, block division is carried out on the DEX file by analyzing a header file area of the DEX file, and the DEX file is converted into an RGB image and a plain text file by utilizing the divided block characteristics; the characteristic extraction comprises extracting image characteristics of RGB images and text characteristics of plain texts; and the learning classification is to integrate the obtained image features and text features by using a feature fusion algorithm of multi-kernel learning and then classify. The specific implementation includes seven small steps as follows.
Step one, extracting a DEX file: decompressing the Android installation package APK file, extracting a DEX file with a suffix of DEX from a decompressed folder, and then analyzing a header file of the DEX file to obtain 8 sections of the DEX file.
Step two, visualizing the DEX file: selecting a DEX file byte code as one channel, wherein the length and the content of a block are different due to different character strings, variables, methods and classes, so that the ratio and the entropy value of the block are selected as other two channels, and the three channels are combined to obtain a three-dimensional vector with the length corresponding to the size of the DEX file; in order to introduce file size characteristics, the matrix width is judged by the file size in the process of vector matrixing; and finally converting the matrix into an RGB image.
Step three, the DEX file is subjected to text conversion: the data section stores all character strings related to the DEX file, wherein the character strings comprise variable names, class names, method names and the like; the DEX file adopts an LEB128 coding mode, and each LEB128 coding value consists of 1-5 bytes and represents a value together; each byte has its most significant bit set (except for the last byte in the sequence, which has its most significant bit cleared); the remaining 7 bits of each byte are payload; the 7-bit valid bit corresponds exactly to the ASSIC code table. Text information can be generated by extracting the last seven bits of each byte; in the process of generating text information, due to the existence of the separation symbols in the DEX file format, a large number of irrelevant symbols are generated, and the subsequent text feature extraction is influenced; therefore, a text filter is generated according to the encoding range of the ASSIC code table, irrelevant symbols are filtered out through the text filter, and only pure text information relevant to the character string is left.
Step four, extracting texture features of the RGB image: extracting texture features by using a GIST algorithm; texture features are global features that describe the surface properties of the scene to which an image or image region corresponds. As a statistical feature, the texture feature often has rotation invariance and is resistant to noise.
Step five, extracting the color characteristics of the RGB image: color moments are used to extract color features of the RGB image, which describe surface properties of the image or image regions, unaffected by image rotation and translation changes.
Step six, extracting text features: the method comprises the steps of segmenting words of a text, calculating weight, sequencing all keywords according to the weight, selecting a certain number of keywords, and calculating the hash value of each keyword by using an md5 hash algorithm. By using the algorithm flow of the Simhash algorithm for reference, the weight of the keyword is given to the hash value, and then the positive and negative are set according to each digit. And finally accumulating the arrays of all the keywords to obtain the text characteristics.
And seventhly, classifying by using a multi-core learning feature fusion algorithm: selecting an optimal kernel function of texture features, color features and text features, then performing linear combination on the kernel functions, updating the weight of the kernel functions through continuous iteration, determining a weight construction kernel matrix under the optimal condition to realize the construction of a classifier, and finally classifying by using the classifier.
Claims (8)
1. A DEX file partition characteristic-based Android malicious software family classification method is characterized by comprising the following steps:
the method comprises the following steps: DEX file for extracting Android malicious software
Step two: converting DEX files into RGB images
Step three: converting DEX files into plain text files
Step four: extracting texture features of RGB images
Step five: extracting color features of RGB images
Step six: extracting text features of plain text files
Step seven: method for classifying Android malware families by fusing texture features, color features and text features through multi-core learning
The method for classifying the Android malicious software family based on the DEX file partition characteristics is characterized in that the Android malicious software is converted into an RGB image and a text file by utilizing the idea of subject blending, so that the requirements in the field are met by utilizing the advanced technology in the related field.
2. The DEX file partition characteristic-based Android malware family classification method as claimed in claim 1, wherein in the second step and the third step, when the DEX file is visualized and textualized, since the DEX file is directly operated, API of application software is not required to be extracted, and influence of reinforcement and confusion on analysis is reduced.
3. The DEX file partition characteristic-based Android malware family classification method according to claim 1 is characterized in that when the DEX file is visualized in the step two, the characteristic of a DEX section is fully utilized to convert the DEX file into an RGB image, and compared with the conventional gray level image, the method has more characteristics and is helpful for improving classification accuracy.
4. The DEX file partition characteristic-based Android malware family classification method of claim 1, characterized in that the DEX file is converted into a plain text file in step three.
5. The DEX file partition characteristic-based Android malware family classification method as claimed in claim 1, wherein when the texture features of the RGB images are extracted in the fourth step, the texture features are global features, describe surface properties of scenes corresponding to the images or image regions, and as statistical features, the texture features are always rotation invariant and have strong resistance to noise.
6. The DEX file partition characteristic-based Android malware family classification method as claimed in claim 1, wherein when RGB image color features are extracted in step five, the color features describe surface properties of images or image regions, and are not affected by image rotation and translation changes.
7. The DEX file partition characteristic-based Android malicious software family classification method according to claim 1, wherein when text features are extracted in the sixth step, the improved Simhash algorithm has the characteristic of keeping data similarity and can be used for massive text similarity calculation.
8. The DEX file partition characteristic-based Android malware family classification method as claimed in claim 1, wherein in the seventh classification step, the classification accuracy can be improved by using a multi-core learning classification method.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310412232.7A CN116383819A (en) | 2020-03-10 | 2020-03-10 | Android malicious software family classification method |
CN202010162791.3A CN113378163A (en) | 2020-03-10 | 2020-03-10 | Android malicious software family classification method based on DEX file partition characteristics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010162791.3A CN113378163A (en) | 2020-03-10 | 2020-03-10 | Android malicious software family classification method based on DEX file partition characteristics |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310412232.7A Division CN116383819A (en) | 2020-03-10 | 2020-03-10 | Android malicious software family classification method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113378163A true CN113378163A (en) | 2021-09-10 |
Family
ID=77568841
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310412232.7A Pending CN116383819A (en) | 2020-03-10 | 2020-03-10 | Android malicious software family classification method |
CN202010162791.3A Pending CN113378163A (en) | 2020-03-10 | 2020-03-10 | Android malicious software family classification method based on DEX file partition characteristics |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310412232.7A Pending CN116383819A (en) | 2020-03-10 | 2020-03-10 | Android malicious software family classification method |
Country Status (1)
Country | Link |
---|---|
CN (2) | CN116383819A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114329472A (en) * | 2021-12-31 | 2022-04-12 | 淮阴工学院 | BIOS (basic input output System) malicious program detection method and device based on double embedding and model pruning |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015101096A1 (en) * | 2013-12-30 | 2015-07-09 | 北京奇虎科技有限公司 | Method and device for detecting malicious code in smart terminal |
CN106096411A (en) * | 2016-06-08 | 2016-11-09 | 浙江工业大学 | A kind of Android malicious code family classification method based on bytecode image clustering |
CN107103235A (en) * | 2017-02-27 | 2017-08-29 | 广东工业大学 | A kind of Android malware detection method based on convolutional neural networks |
CN108280348A (en) * | 2018-01-09 | 2018-07-13 | 上海大学 | Android Malware recognition methods based on RGB image mapping |
CN108280350A (en) * | 2018-02-05 | 2018-07-13 | 南京航空航天大学 | A kind of mobile network's terminal Malware multiple features detection method towards Android |
CN108710608A (en) * | 2018-04-28 | 2018-10-26 | 四川大学 | A kind of malice domain name language material library generating method based on context semanteme |
CN109190371A (en) * | 2018-07-09 | 2019-01-11 | 四川大学 | A kind of the Android malware detection method and technology of Behavior-based control figure |
-
2020
- 2020-03-10 CN CN202310412232.7A patent/CN116383819A/en active Pending
- 2020-03-10 CN CN202010162791.3A patent/CN113378163A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015101096A1 (en) * | 2013-12-30 | 2015-07-09 | 北京奇虎科技有限公司 | Method and device for detecting malicious code in smart terminal |
CN106096411A (en) * | 2016-06-08 | 2016-11-09 | 浙江工业大学 | A kind of Android malicious code family classification method based on bytecode image clustering |
CN107103235A (en) * | 2017-02-27 | 2017-08-29 | 广东工业大学 | A kind of Android malware detection method based on convolutional neural networks |
CN108280348A (en) * | 2018-01-09 | 2018-07-13 | 上海大学 | Android Malware recognition methods based on RGB image mapping |
CN108280350A (en) * | 2018-02-05 | 2018-07-13 | 南京航空航天大学 | A kind of mobile network's terminal Malware multiple features detection method towards Android |
CN108710608A (en) * | 2018-04-28 | 2018-10-26 | 四川大学 | A kind of malice domain name language material library generating method based on context semanteme |
CN109190371A (en) * | 2018-07-09 | 2019-01-11 | 四川大学 | A kind of the Android malware detection method and technology of Behavior-based control figure |
Non-Patent Citations (1)
Title |
---|
YONG FANG 等: "Android Malware Familial Classification Based on DEX File Section Features", 《IEEE ACCESS》, vol. 8, 10 January 2020 (2020-01-10), pages 10614 - 10627, XP011767708, DOI: 10.1109/ACCESS.2020.2965646 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114329472A (en) * | 2021-12-31 | 2022-04-12 | 淮阴工学院 | BIOS (basic input output System) malicious program detection method and device based on double embedding and model pruning |
Also Published As
Publication number | Publication date |
---|---|
CN116383819A (en) | 2023-07-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101710334B (en) | Large-scale image library retrieving method based on image Hash | |
CN104661037B (en) | The detection method and system that compression image quantization table is distorted | |
CN109241741B (en) | Malicious code classification method based on image texture fingerprints | |
Roussev et al. | File fragment encoding classification—An empirical approach | |
CN113221115B (en) | Visual malicious software detection method based on collaborative learning | |
CN115511890B (en) | Analysis system for large-flow data of special-shaped network interface | |
JP6235414B2 (en) | Feature quantity computing device, feature quantity computing method, and feature quantity computing program | |
CN101794378B (en) | Rubbish image filtering method based on image encoding | |
CN111639185B (en) | Relation information extraction method, device, electronic equipment and readable storage medium | |
CN113962199B (en) | Text recognition method, text recognition device, text recognition equipment, storage medium and program product | |
US20220215679A1 (en) | Method of determining a density of cells in a cell image, electronic device, and storage medium | |
CN113378163A (en) | Android malicious software family classification method based on DEX file partition characteristics | |
JP2006351001A (en) | Content characteristic quantity extraction method and device, and content identity determination method and device | |
CN116975864A (en) | Malicious code detection method and device, electronic equipment and storage medium | |
KR102185831B1 (en) | Dhash-based Malicious Code Analysis Apparatus and method thereof | |
CN111552965A (en) | Malicious software classification method based on PE (provider edge) header visualization | |
CN116595525A (en) | Threshold mechanism malicious software detection method and system based on software map | |
CN107491423B (en) | Chinese document gene quantization and characterization method based on numerical value-character string mixed coding | |
CN115292702A (en) | Malicious code family identification method, device, equipment and storage medium | |
Ting et al. | Faster classification using compression analytics | |
CN108805132B (en) | Rubbish text filtering method based on deep learning | |
CN110826063A (en) | Malicious code detection method based on API fragment | |
Nguyen et al. | Decision tree algorithms for image data type identification | |
Singh et al. | Bytefreq: Malware clustering using byte frequency | |
KR20190111643A (en) | Data processing method for decoding text data and data processing apparatus thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210910 |
|
RJ01 | Rejection of invention patent application after publication |