CN110399729B - Binary software analysis method based on component characteristic weight - Google Patents
Binary software analysis method based on component characteristic weight Download PDFInfo
- Publication number
- CN110399729B CN110399729B CN201910669789.2A CN201910669789A CN110399729B CN 110399729 B CN110399729 B CN 110399729B CN 201910669789 A CN201910669789 A CN 201910669789A CN 110399729 B CN110399729 B CN 110399729B
- Authority
- CN
- China
- Prior art keywords
- component
- features
- feature
- value
- characteristic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 claims abstract description 26
- 238000003491 array Methods 0.000 claims description 19
- 230000008569 process Effects 0.000 claims description 6
- 238000000605 extraction Methods 0.000 abstract description 4
- 238000010276 construction Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000005538 encapsulation Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/57—Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
- G06F21/577—Assessing vulnerabilities and evaluating computer system security
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2221/00—Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F2221/03—Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
- G06F2221/033—Test or assess software
Landscapes
- Engineering & Computer Science (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Computing Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Stored Programmes (AREA)
Abstract
The invention discloses a binary software analysis method based on component characteristic weight, which describes a binary software component by introducing a plurality of characteristics and endows different weights to different characteristics according to the influence degree of the characteristics on the component, solves the problems of missing report and erroneous judgment of binary software analysis caused by incomplete component characteristic coverage, and realizes an extensible, wide application range and high efficiency component fingerprint extraction and judgment method based on the characteristic weight.
Description
Technical Field
The invention belongs to the technical field of static detection of software, and particularly relates to a binary software analysis method based on component characteristic weight.
Background
A Component (Component) is a software entity in a software system that has relatively independent functions, interfaces specified by contracts, and obvious dependencies on contexts, is independently deployable, assemblable, and is a simple encapsulation of data and methods. For known executable binary code, there is a need to quickly determine the components it uses, and the vulnerabilities associated with the components, in order to clarify the security risks of the binary code. For a component known to have a vulnerability, it is necessary to quickly determine the full binary code that uses the component in order to understand the scope of the component. For a known vulnerability, it is necessary to identify the components and binary code affected by the vulnerability in order to confirm the degree of risk posed by the vulnerability.
In the prior art, a Universal Extractor is a program that can extract files from any type of archived file, whether simple ZIP files, installers (such as Wise or NSIS), or Windows installer (. msi) packages. The Universal Extractor allows users to extract files from almost any type of archive regardless of their source, compression method, etc. It may provide an easy and convenient way to extract files from an installation package (e.g., the Inno Setup or Windows Insteller package) without using a command line each time. AppCheck is an analytical platform for comprehensive inspection of the software makeup and risk status of a device to help developers and device users improve the security of the technology. However, the above methods extract the feature fingerprint from the constant character string, and although the methods are efficient, the methods have problems of missing report, erroneous judgment and incomplete feature coverage.
Disclosure of Invention
In view of the above, the present invention provides a binary software analysis method based on component feature weight, which solves the problems of missing report and erroneous judgment of component identification by assigning different weights to different features, and solves the problem of incomplete feature coverage by adding feature items, thereby implementing an extensible, wide application range, and high efficiency component feature extraction and determination method based on feature weight.
The invention provides a binary software analysis method based on component feature weight, which comprises the steps of extracting a plurality of types of features of a binary component, giving weight to the component according to the influence degree of each feature, and constructing a component feature library;
extracting the features of the multiple types in the binary software to be analyzed, respectively matching the extracted features of the binary software with the features of the same type of each component in the component feature library, and if the matching result is greater than a threshold value, determining that the component is matched with the binary software.
Further, the plurality of types of features of the binary component include a dynamic symbol table, header information, and a constant string.
Further, the specific process of constructing the component feature library is as follows:
3.1, extracting the characteristics of the binary component i by a corresponding disassembling method according to the file type of the component i, wherein the characteristics are represented by characteristic value arrays, and the number of characteristic values in each characteristic value array of the component i is { n }1,n2,n3In which n is1Number of elements, n, included for dynamic symbology features2Number of elements, n, included for header information features3The number of elements contained in the constant string features;
step 3.2, traversing the feature value arrays of all the features of each component in the component feature library aiming at each feature value of the component i, finding out the components with the same feature value, forming a temporary matching result, calculating the intersection of the feature value arrays of all the features of the component i and the feature value arrays of the corresponding features of all the components in the temporary matching result, and recording the maximum value { m & lt/EN & gt of the number of elements contained in all the intersections of each type of features1,m2,m3In which m is1Is the maximum value, m, of the number of elements contained in the intersection of the features of the dynamic symbol table2Is the maximum value, m, of the number of elements contained in the intersection of the header information features3The maximum value of the number of elements contained in the intersection of the constant character string features is obtained;
and 3.3, calculating the weight of each characteristic of the component i according to the following formula:
a1=1-m1/n1
a2=1-m2/n2
a3=1-m3/n3
wherein, a1Weight of dynamic symbol table characteristic for component i, a2Weight of header information characteristic of component i, a3A weight that is a constant string feature of component i;
step 3.4, storing the characteristics of the component i and the weight corresponding to the characteristics into a component characteristic library;
and 3.5, selecting the next component, and executing the step 3.2 until the last component is executed.
Further, the specific process of respectively matching the extracted features of the binary software with the features of the same type of each component in the component feature library includes the following steps:
step 4.1, extracting the characteristics of the binary software by a disassembling method, wherein the method is the same as the step 3.1;
step 4.2, aiming at each characteristic value of the binary software, searching components with the same characteristic value in the component characteristic library constructed in the step 3.4 to form a temporary matching result component list;
and 4.3, calculating the intersection of the characteristic value arrays of the characteristics of the binary software and the characteristic value arrays of the corresponding characteristics of the components in the temporary matching result, if the number of elements contained in the intersection exceeds a set threshold value, indicating that the characteristics are matched characteristics, normalizing the number of the matched characteristic values, multiplying the normalized number of the matched characteristic values by the weight of the characteristics of the components, summing the normalized number of the matched characteristic values to obtain the matching coefficient of the components, if the matching coefficient is greater than the threshold value, considering the components as matched components, and outputting the components.
Further, the elements in the eigenvalue array are hash values of the eigenvalues.
Further, the matching process may be implemented by using an inverse index library, where the inverse index library is a set, an element of the set is an inverse index of each feature, in each inverse index, an index key value is a hash value of a feature value, and a value is a character string array formed by component names including the feature.
Has the advantages that:
according to the method, the binary software component is described by introducing various characteristics, different weights are given to different characteristics according to the influence degree of the different characteristics on the component, the problems of missing report and misjudgment in binary software analysis caused by incomplete component characteristic coverage are solved, and the characteristic weight-based component fingerprint extraction and judgment method which is extensible, wide in application range and high in efficiency is realized.
Drawings
Fig. 1 is a flow chart of component feature library construction of the binary software analysis method based on component feature weights provided by the present invention.
FIG. 2 is a flow chart of homology determination of the binary software analysis method based on component feature weight provided by the present invention.
Detailed Description
The invention is described in detail below by way of example with reference to the accompanying drawings.
The invention provides a binary software analysis method based on component characteristic weight, which has the basic idea that: firstly, extracting a plurality of types of features of binary components, calculating feature values to form a component feature array, setting the weight of the component features according to a feature weight calculation method to construct a component feature library, and analyzing binary software according to the generated component feature library to determine the binary components matched with the binary software.
The binary software analysis method based on the component feature weight comprises the following two aspects of constructing a component feature library and analyzing binary software according to the generated component feature library.
Firstly, the construction of a component feature library, as shown in fig. 1, includes the following steps:
step 1.1, judging the file type of the binary component i, and extracting the characteristics of the binary component by a disassembling method. Since the components cannot be completely and uniquely determined by adopting a single feature for different binary component files, the invention ensures the uniqueness of the components by adding feature items and improves the coverage rate of the features. The component features extracted in the invention comprise dynamic symbolsThe number table, the header information and the constant character string can also add more feature items according to the requirement of actual analysis, and the number of feature values in the feature array of each feature of the component i is { n }1,n2,n3In which n is1Number of elements, n, included for dynamic symbology features2Number of elements, n, included for header information features3The number of elements contained in a constant string feature. Wherein, different processing is required for components of different file types, which is specifically as follows:
a. for binary components in the PE32 format, reading a data module in python language, extracting constant character strings from the rdata segment, and reading function names in the dynamic symbol table extraction components;
b. for binary components in a Linux format, extracting component characteristics by adopting a readelf command in the Linux;
c. for the Jar package, the Jar package needs to be decompressed first, then the class file is decompiled by adopting a Java command of Java language, and corresponding characteristics are extracted.
Step 1.2, initializing a component feature library, inputting the feature set of the component i generated in the step 1.1, traversing the feature value arrays of all features of each component in the component feature library, searching by using an inverted index feature library, performing classification matching with the feature value arrays of all features of the component i respectively, finding out the components with the same feature values to form a temporary matching result, then calculating the intersection of the feature value arrays of all features of the component i and the feature value arrays of corresponding features of all components in the temporary matching result, and respectively recording the maximum value of elements contained in each intersection of each obtained feature, namely { m { (m) } m1,m2,m3In which m is1Is the maximum value, m, of the elements contained in the intersection of the features of the dynamic symbol table2Is the maximum value, m, of the elements contained in the intersection of the header information features3Is the maximum value of the elements contained in the intersection of the constant string features.
In the invention, the structure of the component feature library is a component set, wherein each component is a set of a plurality of features, and each feature is an array of feature values. For example, the component 1 includes three features of a dynamic symbol table, header information and a constant character string, wherein the constant character feature is an array of feature values, which includes feature values of a plurality of constant character strings.
Here, in order to save storage, the array element of the feature is a hash value of the feature value. Meanwhile, in order to improve the query speed, a reverse index library is established for query matching. The structure of the reverse index library is a set, the set element is a reverse index of each characteristic, in each reverse index, an index key value is a hash value of a characteristic value, and a value is a character string array formed by component names containing the characteristic.
Step 1.3, calculating the weight of each characteristic of the component i according to the following formula:
wherein, a1Weight of dynamic symbol table characteristic for component i, a2Weight of header information characteristic of component i, a3A weight that is a constant string feature of component i;
and step 1.4, storing the characteristics of the component i and the weights corresponding to the characteristics into a component characteristic library, wherein when the characteristics matched with the component i are not found, the weights of the characteristics in the component i can be manually set according to experience.
And step 1.5, selecting the next component, executing the step 1.2, and exiting the program until the last component is executed.
Finally, analyzing the binary software according to the generated component feature library, as shown in fig. 2, comprising the following steps:
2.1, extracting the characteristics of the binary software by a disassembling method, wherein the specific method is the same as the step 1.1;
step 2.2, searching components with the same characteristic value in the component characteristic library constructed in the step 1.4 according to the characteristic value in the characteristic value array of the characteristic according to the characteristic of the binary software extracted in the step 2.1 to form a temporary component list, wherein the reverse index characteristic library in the step 1.2 can be used for searching;
and 2.3, traversing the temporary component list, matching the characteristic value arrays of the characteristics of each component with the characteristic values of the same characteristics of the binary software to be detected, finding out the components with the same characteristic values, forming a temporary matching result, calculating the intersection of the characteristic value arrays of the characteristics of the binary software and the characteristic value arrays of the corresponding characteristics of the components in the temporary matching result, if the number of elements contained in the intersection exceeds a set threshold value, indicating that the characteristics are matched characteristics, normalizing the number of the matched characteristic values according to a formula (1), multiplying the normalized number of the matched characteristic values by the weight of the characteristics of the components, summing to obtain the matching coefficient of the components, and if the matching coefficient is greater than the threshold value, considering the components as matched components and outputting the components. And analyzing the vulnerability information of the binary software by obtaining the result of the matched component of the binary software and combining the vulnerability information of the component.
In summary, the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (3)
1. A binary software analysis method based on component feature weight is characterized in that a plurality of types of features of a binary component are extracted, weight is given to the component according to the influence degree of each feature, and a component feature library is constructed;
extracting the features of the multiple types in the binary software to be analyzed, respectively matching the extracted features of the binary software with the features of the same type of each component in the component feature library, and if the matching result is greater than a threshold value, determining that the component is matched with the binary software;
wherein the plurality of types of features of the binary component comprise a dynamic symbol table, header information and a constant string;
the specific process for constructing the component feature library comprises the following steps:
3.1, extracting the characteristics of the binary component i by a corresponding disassembling method according to the file type of the component i, wherein the characteristics are represented by characteristic value arrays, and the number of characteristic values in each characteristic value array of the component i is { n }1,n2,n3In which n is1Number of elements, n, included for dynamic symbology features2Number of elements, n, included for header information features3The number of elements contained in the constant string features;
step 3.2, traversing the feature value arrays of all the features of each component in the component feature library aiming at each feature value of the component i, finding out the components with the same feature value, forming a temporary matching result, calculating the intersection of the feature value arrays of all the features of the component i and the feature value arrays of the corresponding features of all the components in the temporary matching result, and recording the maximum value { m & lt/EN & gt of the number of elements contained in all the intersections of each type of features1,m2,m3In which m is1Is the maximum value, m, of the number of elements contained in the intersection of the features of the dynamic symbol table2Is the maximum value, m, of the number of elements contained in the intersection of the header information features3The maximum value of the number of elements contained in the intersection of the constant character string features is obtained;
and 3.3, calculating the weight of each characteristic of the component i according to the following formula:
a1=1-m1/n1
a2=1-m2/n2
a3=1-m3/n3
wherein, a1Weight of dynamic symbol table characteristic for component i, a2Weight of header information characteristic of component i, a3A weight that is a constant string feature of component i;
step 3.4, storing the characteristics of the component i and the weight corresponding to the characteristics into a component characteristic library;
3.5, selecting the next component, and executing the step 3.2 until the last component is executed;
the specific process of respectively matching the extracted features of the binary software with the features of the same type of each component in the component feature library comprises the following steps:
step 4.1, extracting the characteristics of the binary software by a disassembling method, wherein the method is the same as the step 3.1;
step 4.2, aiming at each characteristic value of the binary software, searching components with the same characteristic value in the component characteristic library constructed in the step 3.4 to form a temporary matching result component list;
and 4.3, calculating the intersection of the characteristic value arrays of the characteristics of the binary software and the characteristic value arrays of the corresponding characteristics of the components in the temporary matching result, if the number of elements contained in the intersection exceeds a set threshold value, indicating that the characteristics are matched characteristics, normalizing the number of the matched characteristic values, multiplying the normalized number of the matched characteristic values by the weight of the characteristics of the components, summing the normalized number of the matched characteristic values to obtain the matching coefficient of the components, if the matching coefficient is greater than the threshold value, considering the components as matched components, and outputting the components.
2. The method of claim 1, wherein the element in the eigenvalue array is a hash value of the eigenvalue.
3. The method according to claim 2, wherein the matching process is implemented by using an inverted index library, the inverted index library is a set, the set element is an inverted index of each feature, in each inverted index, an index key value is a hash value of a feature value, and a value is a character string array formed by component names including the feature.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2019102868789 | 2019-04-11 | ||
CN201910286878 | 2019-04-11 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110399729A CN110399729A (en) | 2019-11-01 |
CN110399729B true CN110399729B (en) | 2021-04-27 |
Family
ID=68325877
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910669789.2A Expired - Fee Related CN110399729B (en) | 2019-04-11 | 2019-07-24 | Binary software analysis method based on component characteristic weight |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110399729B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111046388B (en) * | 2019-12-16 | 2022-09-13 | 北京智游网安科技有限公司 | Method for identifying third-party SDK in application, intelligent terminal and storage medium |
CN116954701B (en) * | 2023-08-09 | 2024-05-14 | 软安科技有限公司 | Binary component detection method and system based on blood relationship |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102779257A (en) * | 2012-06-28 | 2012-11-14 | 奇智软件(北京)有限公司 | Security detection method and system of Android application program |
CN103226583A (en) * | 2013-04-08 | 2013-07-31 | 北京奇虎科技有限公司 | Method and device for recognizing advertisement plugin |
CN104517053A (en) * | 2013-09-29 | 2015-04-15 | 北京金山网络科技有限公司 | Software recognition method and device |
CN106650450A (en) * | 2016-12-29 | 2017-05-10 | 哈尔滨安天科技股份有限公司 | Malicious script heuristic detection method and system based on code fingerprint identification |
CN107844705A (en) * | 2017-11-14 | 2018-03-27 | 苏州棱镜七彩信息科技有限公司 | Third party's component leak detection method based on binary code feature |
CN108763928A (en) * | 2018-05-03 | 2018-11-06 | 北京邮电大学 | A kind of open source software leak analysis method, apparatus and storage medium |
CN109062792A (en) * | 2018-07-21 | 2018-12-21 | 东南大学 | A kind of Open Source Code detection method based on String matching and characteristic matching |
CN109543408A (en) * | 2018-10-29 | 2019-03-29 | 卓望数码技术(深圳)有限公司 | A kind of Malware recognition methods and system |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9323923B2 (en) * | 2012-06-19 | 2016-04-26 | Deja Vu Security, Llc | Code repository intrusion detection |
CN107704501B (en) * | 2017-08-28 | 2020-04-24 | 中国科学院信息工程研究所 | Method and system for identifying homologous binary file |
-
2019
- 2019-07-24 CN CN201910669789.2A patent/CN110399729B/en not_active Expired - Fee Related
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102779257A (en) * | 2012-06-28 | 2012-11-14 | 奇智软件(北京)有限公司 | Security detection method and system of Android application program |
CN103226583A (en) * | 2013-04-08 | 2013-07-31 | 北京奇虎科技有限公司 | Method and device for recognizing advertisement plugin |
CN104517053A (en) * | 2013-09-29 | 2015-04-15 | 北京金山网络科技有限公司 | Software recognition method and device |
CN106650450A (en) * | 2016-12-29 | 2017-05-10 | 哈尔滨安天科技股份有限公司 | Malicious script heuristic detection method and system based on code fingerprint identification |
CN107844705A (en) * | 2017-11-14 | 2018-03-27 | 苏州棱镜七彩信息科技有限公司 | Third party's component leak detection method based on binary code feature |
CN108763928A (en) * | 2018-05-03 | 2018-11-06 | 北京邮电大学 | A kind of open source software leak analysis method, apparatus and storage medium |
CN109062792A (en) * | 2018-07-21 | 2018-12-21 | 东南大学 | A kind of Open Source Code detection method based on String matching and characteristic matching |
CN109543408A (en) * | 2018-10-29 | 2019-03-29 | 卓望数码技术(深圳)有限公司 | A kind of Malware recognition methods and system |
Non-Patent Citations (1)
Title |
---|
《基于Android恶意软件检测系统的设计与实现》;左玲;《中国优秀硕士学位论文全文数据库(信息科技辑)》;20130115;第I136-188页 * |
Also Published As
Publication number | Publication date |
---|---|
CN110399729A (en) | 2019-11-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Yakura et al. | Malware analysis of imaged binary samples by convolutional neural network with attention mechanism | |
KR101337874B1 (en) | System and method for detecting malwares in a file based on genetic map of the file | |
US10878087B2 (en) | System and method for detecting malicious files using two-stage file classification | |
US20130173648A1 (en) | Software Application Recognition | |
CN103226583A (en) | Method and device for recognizing advertisement plugin | |
NL2012421A (en) | Computer-implemented systems and methods for comparing and associating objects. | |
CN105224600B (en) | A kind of detection method and device of Sample Similarity | |
CN102171702A (en) | Detection of confidential information | |
CN104123493A (en) | Method and device for detecting safety performance of application program | |
CN110399729B (en) | Binary software analysis method based on component characteristic weight | |
RU2722692C1 (en) | Method and system for detecting malicious files in a non-isolated medium | |
CN103761480A (en) | Method and device for detecting file security | |
CN115221516B (en) | Malicious application program identification method and device, storage medium and electronic equipment | |
KR101473535B1 (en) | Malware classification method using multi n―gram | |
Oliver et al. | Designing the elements of a fuzzy hashing scheme | |
CN111104674A (en) | Power firmware homologous binary file association method and system | |
JP7007551B2 (en) | Image similarity judgment program, image similarity judgment device and image similarity judgment method | |
EP3588349B1 (en) | System and method for detecting malicious files using two-stage file classification | |
CN111797397B (en) | Malicious code visualization and variant detection method, device and storage medium | |
CN114925365A (en) | File processing method and device, electronic equipment and storage medium | |
CN114579965A (en) | Malicious code detection method and device and computer readable storage medium | |
KR101907443B1 (en) | Component-based malicious file similarity analysis device and method | |
CN109359462B (en) | Virtual standby identification method, equipment, storage medium and device | |
JP2018121262A (en) | Security monitoring server, security monitoring method, program | |
CN112163217A (en) | Malicious software variant identification method, device, equipment and computer storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20210427 |