CN110399729B - Binary software analysis method based on component characteristic weight - Google Patents

Binary software analysis method based on component characteristic weight Download PDF

Info

Publication number
CN110399729B
CN110399729B CN201910669789.2A CN201910669789A CN110399729B CN 110399729 B CN110399729 B CN 110399729B CN 201910669789 A CN201910669789 A CN 201910669789A CN 110399729 B CN110399729 B CN 110399729B
Authority
CN
China
Prior art keywords
component
features
feature
value
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910669789.2A
Other languages
Chinese (zh)
Other versions
CN110399729A (en
Inventor
于渤
付海涛
高卫栋
何清林
刘中金
何跃鹰
袁开国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Computer Network and Information Security Management Center
Original Assignee
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Computer Network and Information Security Management Center filed Critical National Computer Network and Information Security Management Center
Publication of CN110399729A publication Critical patent/CN110399729A/en
Application granted granted Critical
Publication of CN110399729B publication Critical patent/CN110399729B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/033Test or assess software

Abstract

The invention discloses a binary software analysis method based on component characteristic weight, which describes a binary software component by introducing a plurality of characteristics and endows different weights to different characteristics according to the influence degree of the characteristics on the component, solves the problems of missing report and erroneous judgment of binary software analysis caused by incomplete component characteristic coverage, and realizes an extensible, wide application range and high efficiency component fingerprint extraction and judgment method based on the characteristic weight.

Description

Binary software analysis method based on component characteristic weight
Technical Field
The invention belongs to the technical field of static detection of software, and particularly relates to a binary software analysis method based on component characteristic weight.
Background
A Component (Component) is a software entity in a software system that has relatively independent functions, interfaces specified by contracts, and obvious dependencies on contexts, is independently deployable, assemblable, and is a simple encapsulation of data and methods. For known executable binary code, there is a need to quickly determine the components it uses, and the vulnerabilities associated with the components, in order to clarify the security risks of the binary code. For a component known to have a vulnerability, it is necessary to quickly determine the full binary code that uses the component in order to understand the scope of the component. For a known vulnerability, it is necessary to identify the components and binary code affected by the vulnerability in order to confirm the degree of risk posed by the vulnerability.
In the prior art, a Universal Extractor is a program that can extract files from any type of archived file, whether simple ZIP files, installers (such as Wise or NSIS), or Windows installer (. msi) packages. The Universal Extractor allows users to extract files from almost any type of archive regardless of their source, compression method, etc. It may provide an easy and convenient way to extract files from an installation package (e.g., the Inno Setup or Windows Insteller package) without using a command line each time. AppCheck is an analytical platform for comprehensive inspection of the software makeup and risk status of a device to help developers and device users improve the security of the technology. However, the above methods extract the feature fingerprint from the constant character string, and although the methods are efficient, the methods have problems of missing report, erroneous judgment and incomplete feature coverage.
Disclosure of Invention
In view of the above, the present invention provides a binary software analysis method based on component feature weight, which solves the problems of missing report and erroneous judgment of component identification by assigning different weights to different features, and solves the problem of incomplete feature coverage by adding feature items, thereby implementing an extensible, wide application range, and high efficiency component feature extraction and determination method based on feature weight.
The invention provides a binary software analysis method based on component feature weight, which comprises the steps of extracting a plurality of types of features of a binary component, giving weight to the component according to the influence degree of each feature, and constructing a component feature library;
extracting the features of the multiple types in the binary software to be analyzed, respectively matching the extracted features of the binary software with the features of the same type of each component in the component feature library, and if the matching result is greater than a threshold value, determining that the component is matched with the binary software.
Further, the plurality of types of features of the binary component include a dynamic symbol table, header information, and a constant string.
Further, the specific process of constructing the component feature library is as follows:
3.1, extracting the characteristics of the binary component i by a corresponding disassembling method according to the file type of the component i, wherein the characteristics are represented by characteristic value arrays, and the number of characteristic values in each characteristic value array of the component i is { n }1,n2,n3In which n is1Number of elements, n, included for dynamic symbology features2Number of elements, n, included for header information features3The number of elements contained in the constant string features;
step 3.2, traversing the feature value arrays of all the features of each component in the component feature library aiming at each feature value of the component i, finding out the components with the same feature value, forming a temporary matching result, calculating the intersection of the feature value arrays of all the features of the component i and the feature value arrays of the corresponding features of all the components in the temporary matching result, and recording the maximum value { m & lt/EN & gt of the number of elements contained in all the intersections of each type of features1,m2,m3In which m is1Is the maximum value, m, of the number of elements contained in the intersection of the features of the dynamic symbol table2Is the maximum value, m, of the number of elements contained in the intersection of the header information features3The maximum value of the number of elements contained in the intersection of the constant character string features is obtained;
and 3.3, calculating the weight of each characteristic of the component i according to the following formula:
a1=1-m1/n1
a2=1-m2/n2
a3=1-m3/n3
wherein, a1Weight of dynamic symbol table characteristic for component i, a2Weight of header information characteristic of component i, a3A weight that is a constant string feature of component i;
step 3.4, storing the characteristics of the component i and the weight corresponding to the characteristics into a component characteristic library;
and 3.5, selecting the next component, and executing the step 3.2 until the last component is executed.
Further, the specific process of respectively matching the extracted features of the binary software with the features of the same type of each component in the component feature library includes the following steps:
step 4.1, extracting the characteristics of the binary software by a disassembling method, wherein the method is the same as the step 3.1;
step 4.2, aiming at each characteristic value of the binary software, searching components with the same characteristic value in the component characteristic library constructed in the step 3.4 to form a temporary matching result component list;
and 4.3, calculating the intersection of the characteristic value arrays of the characteristics of the binary software and the characteristic value arrays of the corresponding characteristics of the components in the temporary matching result, if the number of elements contained in the intersection exceeds a set threshold value, indicating that the characteristics are matched characteristics, normalizing the number of the matched characteristic values, multiplying the normalized number of the matched characteristic values by the weight of the characteristics of the components, summing the normalized number of the matched characteristic values to obtain the matching coefficient of the components, if the matching coefficient is greater than the threshold value, considering the components as matched components, and outputting the components.
Further, the elements in the eigenvalue array are hash values of the eigenvalues.
Further, the matching process may be implemented by using an inverse index library, where the inverse index library is a set, an element of the set is an inverse index of each feature, in each inverse index, an index key value is a hash value of a feature value, and a value is a character string array formed by component names including the feature.
Has the advantages that:
according to the method, the binary software component is described by introducing various characteristics, different weights are given to different characteristics according to the influence degree of the different characteristics on the component, the problems of missing report and misjudgment in binary software analysis caused by incomplete component characteristic coverage are solved, and the characteristic weight-based component fingerprint extraction and judgment method which is extensible, wide in application range and high in efficiency is realized.
Drawings
Fig. 1 is a flow chart of component feature library construction of the binary software analysis method based on component feature weights provided by the present invention.
FIG. 2 is a flow chart of homology determination of the binary software analysis method based on component feature weight provided by the present invention.
Detailed Description
The invention is described in detail below by way of example with reference to the accompanying drawings.
The invention provides a binary software analysis method based on component characteristic weight, which has the basic idea that: firstly, extracting a plurality of types of features of binary components, calculating feature values to form a component feature array, setting the weight of the component features according to a feature weight calculation method to construct a component feature library, and analyzing binary software according to the generated component feature library to determine the binary components matched with the binary software.
The binary software analysis method based on the component feature weight comprises the following two aspects of constructing a component feature library and analyzing binary software according to the generated component feature library.
Firstly, the construction of a component feature library, as shown in fig. 1, includes the following steps:
step 1.1, judging the file type of the binary component i, and extracting the characteristics of the binary component by a disassembling method. Since the components cannot be completely and uniquely determined by adopting a single feature for different binary component files, the invention ensures the uniqueness of the components by adding feature items and improves the coverage rate of the features. The component features extracted in the invention comprise dynamic symbolsThe number table, the header information and the constant character string can also add more feature items according to the requirement of actual analysis, and the number of feature values in the feature array of each feature of the component i is { n }1,n2,n3In which n is1Number of elements, n, included for dynamic symbology features2Number of elements, n, included for header information features3The number of elements contained in a constant string feature. Wherein, different processing is required for components of different file types, which is specifically as follows:
a. for binary components in the PE32 format, reading a data module in python language, extracting constant character strings from the rdata segment, and reading function names in the dynamic symbol table extraction components;
b. for binary components in a Linux format, extracting component characteristics by adopting a readelf command in the Linux;
c. for the Jar package, the Jar package needs to be decompressed first, then the class file is decompiled by adopting a Java command of Java language, and corresponding characteristics are extracted.
Step 1.2, initializing a component feature library, inputting the feature set of the component i generated in the step 1.1, traversing the feature value arrays of all features of each component in the component feature library, searching by using an inverted index feature library, performing classification matching with the feature value arrays of all features of the component i respectively, finding out the components with the same feature values to form a temporary matching result, then calculating the intersection of the feature value arrays of all features of the component i and the feature value arrays of corresponding features of all components in the temporary matching result, and respectively recording the maximum value of elements contained in each intersection of each obtained feature, namely { m { (m) } m1,m2,m3In which m is1Is the maximum value, m, of the elements contained in the intersection of the features of the dynamic symbol table2Is the maximum value, m, of the elements contained in the intersection of the header information features3Is the maximum value of the elements contained in the intersection of the constant string features.
In the invention, the structure of the component feature library is a component set, wherein each component is a set of a plurality of features, and each feature is an array of feature values. For example, the component 1 includes three features of a dynamic symbol table, header information and a constant character string, wherein the constant character feature is an array of feature values, which includes feature values of a plurality of constant character strings.
Here, in order to save storage, the array element of the feature is a hash value of the feature value. Meanwhile, in order to improve the query speed, a reverse index library is established for query matching. The structure of the reverse index library is a set, the set element is a reverse index of each characteristic, in each reverse index, an index key value is a hash value of a characteristic value, and a value is a character string array formed by component names containing the characteristic.
Step 1.3, calculating the weight of each characteristic of the component i according to the following formula:
Figure BDA0002141309910000061
wherein, a1Weight of dynamic symbol table characteristic for component i, a2Weight of header information characteristic of component i, a3A weight that is a constant string feature of component i;
and step 1.4, storing the characteristics of the component i and the weights corresponding to the characteristics into a component characteristic library, wherein when the characteristics matched with the component i are not found, the weights of the characteristics in the component i can be manually set according to experience.
And step 1.5, selecting the next component, executing the step 1.2, and exiting the program until the last component is executed.
Finally, analyzing the binary software according to the generated component feature library, as shown in fig. 2, comprising the following steps:
2.1, extracting the characteristics of the binary software by a disassembling method, wherein the specific method is the same as the step 1.1;
step 2.2, searching components with the same characteristic value in the component characteristic library constructed in the step 1.4 according to the characteristic value in the characteristic value array of the characteristic according to the characteristic of the binary software extracted in the step 2.1 to form a temporary component list, wherein the reverse index characteristic library in the step 1.2 can be used for searching;
and 2.3, traversing the temporary component list, matching the characteristic value arrays of the characteristics of each component with the characteristic values of the same characteristics of the binary software to be detected, finding out the components with the same characteristic values, forming a temporary matching result, calculating the intersection of the characteristic value arrays of the characteristics of the binary software and the characteristic value arrays of the corresponding characteristics of the components in the temporary matching result, if the number of elements contained in the intersection exceeds a set threshold value, indicating that the characteristics are matched characteristics, normalizing the number of the matched characteristic values according to a formula (1), multiplying the normalized number of the matched characteristic values by the weight of the characteristics of the components, summing to obtain the matching coefficient of the components, and if the matching coefficient is greater than the threshold value, considering the components as matched components and outputting the components. And analyzing the vulnerability information of the binary software by obtaining the result of the matched component of the binary software and combining the vulnerability information of the component.
In summary, the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (3)

1. A binary software analysis method based on component feature weight is characterized in that a plurality of types of features of a binary component are extracted, weight is given to the component according to the influence degree of each feature, and a component feature library is constructed;
extracting the features of the multiple types in the binary software to be analyzed, respectively matching the extracted features of the binary software with the features of the same type of each component in the component feature library, and if the matching result is greater than a threshold value, determining that the component is matched with the binary software;
wherein the plurality of types of features of the binary component comprise a dynamic symbol table, header information and a constant string;
the specific process for constructing the component feature library comprises the following steps:
3.1, extracting the characteristics of the binary component i by a corresponding disassembling method according to the file type of the component i, wherein the characteristics are represented by characteristic value arrays, and the number of characteristic values in each characteristic value array of the component i is { n }1,n2,n3In which n is1Number of elements, n, included for dynamic symbology features2Number of elements, n, included for header information features3The number of elements contained in the constant string features;
step 3.2, traversing the feature value arrays of all the features of each component in the component feature library aiming at each feature value of the component i, finding out the components with the same feature value, forming a temporary matching result, calculating the intersection of the feature value arrays of all the features of the component i and the feature value arrays of the corresponding features of all the components in the temporary matching result, and recording the maximum value { m & lt/EN & gt of the number of elements contained in all the intersections of each type of features1,m2,m3In which m is1Is the maximum value, m, of the number of elements contained in the intersection of the features of the dynamic symbol table2Is the maximum value, m, of the number of elements contained in the intersection of the header information features3The maximum value of the number of elements contained in the intersection of the constant character string features is obtained;
and 3.3, calculating the weight of each characteristic of the component i according to the following formula:
a1=1-m1/n1
a2=1-m2/n2
a3=1-m3/n3
wherein, a1Weight of dynamic symbol table characteristic for component i, a2Weight of header information characteristic of component i, a3A weight that is a constant string feature of component i;
step 3.4, storing the characteristics of the component i and the weight corresponding to the characteristics into a component characteristic library;
3.5, selecting the next component, and executing the step 3.2 until the last component is executed;
the specific process of respectively matching the extracted features of the binary software with the features of the same type of each component in the component feature library comprises the following steps:
step 4.1, extracting the characteristics of the binary software by a disassembling method, wherein the method is the same as the step 3.1;
step 4.2, aiming at each characteristic value of the binary software, searching components with the same characteristic value in the component characteristic library constructed in the step 3.4 to form a temporary matching result component list;
and 4.3, calculating the intersection of the characteristic value arrays of the characteristics of the binary software and the characteristic value arrays of the corresponding characteristics of the components in the temporary matching result, if the number of elements contained in the intersection exceeds a set threshold value, indicating that the characteristics are matched characteristics, normalizing the number of the matched characteristic values, multiplying the normalized number of the matched characteristic values by the weight of the characteristics of the components, summing the normalized number of the matched characteristic values to obtain the matching coefficient of the components, if the matching coefficient is greater than the threshold value, considering the components as matched components, and outputting the components.
2. The method of claim 1, wherein the element in the eigenvalue array is a hash value of the eigenvalue.
3. The method according to claim 2, wherein the matching process is implemented by using an inverted index library, the inverted index library is a set, the set element is an inverted index of each feature, in each inverted index, an index key value is a hash value of a feature value, and a value is a character string array formed by component names including the feature.
CN201910669789.2A 2019-04-11 2019-07-24 Binary software analysis method based on component characteristic weight Active CN110399729B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910286878 2019-04-11
CN2019102868789 2019-04-11

Publications (2)

Publication Number Publication Date
CN110399729A CN110399729A (en) 2019-11-01
CN110399729B true CN110399729B (en) 2021-04-27

Family

ID=68325877

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910669789.2A Active CN110399729B (en) 2019-04-11 2019-07-24 Binary software analysis method based on component characteristic weight

Country Status (1)

Country Link
CN (1) CN110399729B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111046388B (en) * 2019-12-16 2022-09-13 北京智游网安科技有限公司 Method for identifying third-party SDK in application, intelligent terminal and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102779257A (en) * 2012-06-28 2012-11-14 奇智软件(北京)有限公司 Security detection method and system of Android application program
CN103226583A (en) * 2013-04-08 2013-07-31 北京奇虎科技有限公司 Method and device for recognizing advertisement plugin
CN104517053A (en) * 2013-09-29 2015-04-15 北京金山网络科技有限公司 Software recognition method and device
CN106650450A (en) * 2016-12-29 2017-05-10 哈尔滨安天科技股份有限公司 Malicious script heuristic detection method and system based on code fingerprint identification
CN107844705A (en) * 2017-11-14 2018-03-27 苏州棱镜七彩信息科技有限公司 Third party's component leak detection method based on binary code feature
CN108763928A (en) * 2018-05-03 2018-11-06 北京邮电大学 A kind of open source software leak analysis method, apparatus and storage medium
CN109062792A (en) * 2018-07-21 2018-12-21 东南大学 A kind of Open Source Code detection method based on String matching and characteristic matching
CN109543408A (en) * 2018-10-29 2019-03-29 卓望数码技术(深圳)有限公司 A kind of Malware recognition methods and system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9323923B2 (en) * 2012-06-19 2016-04-26 Deja Vu Security, Llc Code repository intrusion detection
CN107704501B (en) * 2017-08-28 2020-04-24 中国科学院信息工程研究所 Method and system for identifying homologous binary file

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102779257A (en) * 2012-06-28 2012-11-14 奇智软件(北京)有限公司 Security detection method and system of Android application program
CN103226583A (en) * 2013-04-08 2013-07-31 北京奇虎科技有限公司 Method and device for recognizing advertisement plugin
CN104517053A (en) * 2013-09-29 2015-04-15 北京金山网络科技有限公司 Software recognition method and device
CN106650450A (en) * 2016-12-29 2017-05-10 哈尔滨安天科技股份有限公司 Malicious script heuristic detection method and system based on code fingerprint identification
CN107844705A (en) * 2017-11-14 2018-03-27 苏州棱镜七彩信息科技有限公司 Third party's component leak detection method based on binary code feature
CN108763928A (en) * 2018-05-03 2018-11-06 北京邮电大学 A kind of open source software leak analysis method, apparatus and storage medium
CN109062792A (en) * 2018-07-21 2018-12-21 东南大学 A kind of Open Source Code detection method based on String matching and characteristic matching
CN109543408A (en) * 2018-10-29 2019-03-29 卓望数码技术(深圳)有限公司 A kind of Malware recognition methods and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《基于Android恶意软件检测系统的设计与实现》;左玲;《中国优秀硕士学位论文全文数据库(信息科技辑)》;20130115;第I136-188页 *

Also Published As

Publication number Publication date
CN110399729A (en) 2019-11-01

Similar Documents

Publication Publication Date Title
KR101337874B1 (en) System and method for detecting malwares in a file based on genetic map of the file
US8924389B2 (en) Computer-implemented systems and methods for comparing and associating objects
US10878087B2 (en) System and method for detecting malicious files using two-stage file classification
US20120159625A1 (en) Malicious code detection and classification system using string comparison and method thereof
US11470097B2 (en) Profile generation device, attack detection device, profile generation method, and profile generation computer program
CN103226583A (en) Method and device for recognizing advertisement plugin
CN105224600B (en) A kind of detection method and device of Sample Similarity
CN102171702A (en) Detection of confidential information
CN104123493A (en) Method and device for detecting safety performance of application program
RU2722692C1 (en) Method and system for detecting malicious files in a non-isolated medium
CN112005532A (en) Malware classification of executable files over convolutional networks
CN115221516B (en) Malicious application program identification method and device, storage medium and electronic equipment
CN110399729B (en) Binary software analysis method based on component characteristic weight
KR20190102456A (en) Method for clustering application and apparatus thereof
CN108959922B (en) Malicious document detection method and device based on Bayesian network
KR101473535B1 (en) Malware classification method using multi n―gram
CN110855635B (en) URL (Uniform resource locator) identification method and device and data processing equipment
CN111104674A (en) Power firmware homologous binary file association method and system
Oliver et al. Designing the elements of a fuzzy hashing scheme
CN111368128A (en) Target picture identification method and device and computer readable storage medium
CN114925365A (en) File processing method and device, electronic equipment and storage medium
CN114579965A (en) Malicious code detection method and device and computer readable storage medium
CN109359462B (en) Virtual standby identification method, equipment, storage medium and device
JP2018121262A (en) Security monitoring server, security monitoring method, program
CN112163217A (en) Malicious software variant identification method, device, equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant