CN112084146A - Firmware homology detection method based on multi-dimensional features - Google Patents

Firmware homology detection method based on multi-dimensional features Download PDF

Info

Publication number
CN112084146A
CN112084146A CN202010932458.6A CN202010932458A CN112084146A CN 112084146 A CN112084146 A CN 112084146A CN 202010932458 A CN202010932458 A CN 202010932458A CN 112084146 A CN112084146 A CN 112084146A
Authority
CN
China
Prior art keywords
firmware
hash
file
similarity
homology
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010932458.6A
Other languages
Chinese (zh)
Inventor
宋岩
何道敬
李亭辉
郭乃网
吴裔
沈泉江
王彬彬
张蕾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
State Grid Shanghai Electric Power Co Ltd
Original Assignee
East China Normal University
State Grid Shanghai Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University, State Grid Shanghai Electric Power Co Ltd filed Critical East China Normal University
Priority to CN202010932458.6A priority Critical patent/CN112084146A/en
Publication of CN112084146A publication Critical patent/CN112084146A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • G06F16/137Hash-based
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/10Protecting distributed programs or content, e.g. vending or licensing of copyrighted material ; Digital rights management [DRM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/08Network architectures or network communication protocols for network security for authentication of entities
    • H04L63/0876Network architectures or network communication protocols for network security for authentication of entities based on the identity of the terminal or configuration, e.g. MAC address, hardware or software configuration or device fingerprint

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • Technology Law (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Power Engineering (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a firmware homology detection method based on multidimensional characteristics, which comprises firmware format identification, unpacking and file extraction; extracting multidimensional characteristics such as file character string characteristics and function control flow graph characteristics; calculating the file similarity under a single characteristic dimension by using fuzzy hash, constant matching, graph matching and other methods; and weighting and calculating the overall similarity of the firmware under the multi-dimensional characteristics according to the calculation result of the single dimension. The invention aims at the problems that under the conditions of confusion, encryption and the like of firmware, single characteristic error is large and homology is difficult to determine, and the accuracy of homology detection is improved by extracting, analyzing and inducing according to multidimensional characteristics.

Description

Firmware homology detection method based on multi-dimensional features
Technical Field
The invention relates to the field of information security, in particular to a multi-dimensional firmware homology detection method.
Background
With the advent of the internet of things era, terminal devices such as network cameras, wearable devices, activity trackers, intelligent automobiles, intelligent homes and the like of the internet of things devices are rapidly developed and widely applied. According to Gartner's report, the number of internet of things devices will exceed 200 billion in 2020. Meanwhile, security attack events aiming at the Internet of things equipment are continuously rising. The main attack mode is to utilize the equipment loophole to acquire the equipment control authority, further propagate large-scale malicious codes to control the network space, or utilize the loophole to steal user information data and hijack network flow to carry out other hacker underground industry transactions.
Because the functional realization of the internet of things equipment is mainly considered in the process of designing and developing, the safety consideration is neglected in the design, a vulnerability is introduced due to negligence in the process of developing, and the later safety check is lacked; meanwhile, due to the multiplexing of the components, a large amount of binary codes compiled by the same source code exist in the device firmware of different manufacturers, types and CPU architectures, and the binary codes potentially have the same vulnerability. It is the vulnerability mining technique of firmware homology detection that performs large-scale homology detection for this case. The prior art only starts from single angles of character string matching, function control flow diagrams and the like, has one-sidedness, and particularly has low detection precision for firmware adopting measures such as confusion and encryption. In this case, a firmware homology detection method based on multi-dimensional features is extracted.
Disclosure of Invention
The invention aims to overcome the defects that the existing homology detection method has single dimension, and has low accuracy and low precision when detecting the firmware adopting measures such as confusion, encryption and the like. The method uses multidimensional characteristics such as character strings and function control flow graphs to carry out firmware homology detection, improves the detection accuracy and has the capability of cross-platform similarity detection.
The specific technical scheme for realizing the purpose of the invention is as follows:
a firmware homology detection method based on multidimensional characteristics is characterized by comprising the following specific steps:
step S1: identifying the firmware format, unpacking and extracting an identifiable file;
step S2: generating a file hash feature for the identifiable file by using a hash algorithm; extracting character strings in the recognizable file, generating character string hash characteristics by using a hash algorithm, filtering the character strings, filtering out character strings related to a firmware operating system platform, a compiler, a kernel and the like, and generating character string constant characteristics; extracting binary files in the identifiable files and generating the characteristics of the function control flow graph;
step S3: performing hash similarity calculation on the file hash characteristics and the character string hash characteristics, and giving different weights to the file hash characteristics and the character string hash characteristics to generate hash similarity indexes; matching the character string constant characteristics to generate a constant matching similarity index; performing graph similarity calculation on the characteristics of the function control flow graph to generate a graph similarity index;
step S4: and giving different weights according to the Hash similarity index, the constant matching similarity index and the graph similarity index, and further calculating to obtain the firmware similarity among the to-be-detected firmware.
The recognizable files extracted in step S1 include third-party components such as busy, opennssl, and JavaScript in the firmware, dynamic link library files such as libsctp.
In step S2, the hash feature of the file is generated by using a hash algorithm for the recognizable file, where the hash algorithm includes a BKDRHash, an APHash, a JSHash, ssdeep, sdhash, or a CTPH hash algorithm.
Step S2, extracting the character strings in the recognizable file, in the following manner: string commands and third party open source tools.
And step S2, filtering to generate a character string constant, wherein the character string constant comprises a third-party version library, stack space character string information and symbol table, and a human readable character string with realistic meaning.
Step S2, extracting the binary file in the recognizable file and generating the feature of the function control flow graph, in the following manner: angr third party open source tool, IDA Pro, or other reverse tool.
Step S3, matching the string constants to generate a constant matching similarity index, including a Jaro-Winkler similarity algorithm or an edit distance algorithm.
And step S3, performing graph similarity calculation on the characteristics of the function control flow graph, wherein the graph similarity calculation includes K neighbor, VF2 or llmann algorithm.
The invention has the beneficial effects that:
the method can solve the problems that the accuracy of the single-dimensional feature is not high and the accuracy is low in the existing method under the condition that the firmware adopts protection measures such as confusion and encryption. Through multi-dimensional feature comparison, the homology comparison of the firmware can be rapidly and accurately realized, and meanwhile, the capability of cross-platform firmware homology detection is achieved.
Drawings
FIG. 1 is a flow chart of the present invention;
fig. 2 is a detailed flow chart of an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the following specific examples and the accompanying drawings. The procedures, conditions, experimental methods and the like for carrying out the present invention are general knowledge and common general knowledge in the art except for the contents specifically mentioned below, and the present invention is not particularly limited.
As shown in fig. 1, the present invention comprises the steps of:
step S1: identifying the firmware format, unpacking and extracting an identifiable file;
step S2: generating a file hash feature for the identifiable file by using a hash algorithm; extracting character strings in the identifiable file, generating character string hash characteristics by using a hash algorithm, and filtering the character strings to generate character string constant characteristics; generating function control flow graph characteristics for binary files in the identifiable files;
step S3: performing hash similarity calculation on the file hash characteristics and the character string hash characteristics to generate a hash similarity index; matching the character string constant characteristics to generate a constant matching similarity index; performing graph similarity calculation on the characteristics of the function control flow graph to generate a graph similarity index;
step S4: giving different weights according to the Hash similarity index, the constant matching similarity index and the graph similarity index, and further calculating to obtain the firmware similarity; the overall similarity index is obtained by assigning different weights to the haar similarity index, the constant matching similarity index, and the graph similarity index in step S3 to calculate the firmware similarity. The larger the value, the higher the similarity of the firmware to be compared, and the smaller the value, the lower the similarity of the firmware to be compared.
Examples
Referring to fig. 2, the present embodiment is described in detail below:
step S1:
for the firmware 1 and the firmware 2 to be detected, using open source tools such as binwalk or BAP to identify the type of the firmware, and scanning the whole signature of the file to extract an identifiable file, wherein the extracted firmware file comprises but is not limited to third-party components such as busy, opennssl and JavaScript, dynamic link library files such as libsctp.
Step S2:
and directly generating a hash value for the extracted identifiable file by using a hash algorithm, wherein the hash value is the file hash characteristic of the file, and the extracted firmware files all generate a file hash characteristic. The hash algorithm used includes, but is not limited to, BKDRHash, APHash, JSHash, CTPH, ssdeep, sdhash, and other hash algorithms. And extracting the character strings of the recognizable file by using string commands or a third-party open source tool. On one hand, the extracted character strings are not filtered, and hash values are generated directly by using a hash algorithm and are used as the hash characteristics of the character strings of the file; on the other hand, the character strings are filtered, the character strings which influence the accuracy rate and are related to the SDK, the instruction set, the operating system, the kernel, the compiler and the like are filtered, and the filtered character string constants comprise a third-party version library, stack space character string information, a symbol table, human-readable character strings with practical significance and the like, so that the character string constant characteristics of the file are generated; and extracting the function control flow graph of the binary file by using a function control flow graph generation tool such as angr and IDA Pro for the binary file in the extracted identifiable file to generate the characteristics of the function control flow graph.
Step S3:
and calculating the file hash characteristics and the character string hash characteristics of the firmware to be compared to generate a hash similarity index, and calculating the hash similarity index in a manner of weighting. Constant matching similarity index calculations include, but are not limited to, the Jaro-Winkler similarity algorithm, edit distance, and the like. The graph similarity index calculation includes, but is not limited to, K neighbor, VF2, Ullmann, etc. algorithms. Wherein, the numerical range of the similarity index obtained by calculating each characteristic is 0-100, 0 represents no similarity, and 100 represents complete consistency.
Step S4:
the firmware similarity is obtained by assigning different weights to the hash similarity index, the constant matching similarity index, and the graph similarity index in step S3 to calculate the similarity between the firmware to be compared. The larger the value, the higher the similarity of the firmware to be compared, and the smaller the value, the lower the similarity of the two. The calculated values range from 0 to 100, 0 representing no similarity and 100 representing perfect agreement.
The protection of the present invention is not limited to the above embodiments. Variations and advantages that may occur to those skilled in the art may be incorporated into the invention without departing from the spirit and scope of the inventive concept, and the scope of the appended claims is intended to be protected.

Claims (8)

1. A firmware homology detection method based on multidimensional characteristics is characterized by comprising the following specific steps:
step S1: identifying the firmware format, unpacking and extracting an identifiable file;
step S2: generating a file hash feature for the identifiable file by using a hash algorithm; extracting character strings in the recognizable file, generating character string hash characteristics by using a hash algorithm, filtering the character strings, filtering out character strings related to a firmware operating system platform, a compiler, a kernel and the like, and generating character string constant characteristics; extracting binary files in the identifiable files and generating the characteristics of the function control flow graph;
step S3: performing hash similarity calculation on the file hash characteristics and the character string hash characteristics, and giving different weights to the file hash characteristics and the character string hash characteristics to generate hash similarity indexes; matching the character string constant characteristics to generate a constant matching similarity index; performing graph similarity calculation on the characteristics of the function control flow graph to generate a graph similarity index;
step S4: and giving different weights according to the Hash similarity index, the constant matching similarity index and the graph similarity index, and further calculating to obtain the firmware similarity among the to-be-detected firmware.
2. The firmware homology detecting method according to claim 1, wherein the step S1 extracts identifiable files, which include third-party components such as busy box, opennssl and JavaScript, dynamic link library files such as libsctp.
3. The firmware homology detecting method according to claim 1, wherein the step S2 is to generate the file hash feature by using a hash algorithm on the recognizable file, wherein the hash algorithm includes BKDRHash, APHash, JSHash, ssdeep, sdhash, or CTPH hash algorithm.
4. The firmware homology detecting method according to claim 1, wherein the step S2 is performed by extracting the character strings in the recognizable file by: string commands and third party open source tools.
5. The firmware homology detecting method according to claim 1, wherein the filtering of step S2 generates string constants, the string constants including third party version library, stack space string information and symbol table, human readable realistic character string.
6. The firmware homology detecting method according to claim 1, wherein the step S2 is to extract the binary file in the recognizable file and generate the feature of the function control flow graph by: angr third party open source tool, IDA Pro, or other reverse tool.
7. The firmware homology detecting method according to claim 1, wherein the matching of the string constants in step S3 is performed to generate a constant matching similarity index, which comprises a Jaro-Winkler similarity algorithm or an edit distance algorithm.
8. The firmware homology detection method according to claim 1, wherein the step S3 is to perform graph similarity calculation on the characteristics of the function control flow graph, and the graph similarity calculation includes K-nearest neighbor, VF2 or llmann algorithm.
CN202010932458.6A 2020-09-08 2020-09-08 Firmware homology detection method based on multi-dimensional features Pending CN112084146A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010932458.6A CN112084146A (en) 2020-09-08 2020-09-08 Firmware homology detection method based on multi-dimensional features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010932458.6A CN112084146A (en) 2020-09-08 2020-09-08 Firmware homology detection method based on multi-dimensional features

Publications (1)

Publication Number Publication Date
CN112084146A true CN112084146A (en) 2020-12-15

Family

ID=73732151

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010932458.6A Pending CN112084146A (en) 2020-09-08 2020-09-08 Firmware homology detection method based on multi-dimensional features

Country Status (1)

Country Link
CN (1) CN112084146A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113704180A (en) * 2021-07-10 2021-11-26 国网浙江省电力有限公司信息通信分公司 Lossless firmware extraction method based on embedded equipment firmware file information feature library
CN114489787A (en) * 2022-04-06 2022-05-13 奇安信科技集团股份有限公司 Software component analysis method, device, electronic equipment and storage medium
CN116578979A (en) * 2023-05-15 2023-08-11 软安科技有限公司 Cross-platform binary code matching method and system based on code features
CN116578979B (en) * 2023-05-15 2024-05-31 软安科技有限公司 Cross-platform binary code matching method and system based on code features

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105868108A (en) * 2016-03-28 2016-08-17 中国科学院信息工程研究所 Instruction-set-irrelevant binary code similarity detection method based on neural network
CN109063055A (en) * 2018-07-19 2018-12-21 中国科学院信息工程研究所 Homologous binary file search method and device
CN109460386A (en) * 2018-10-29 2019-03-12 杭州安恒信息技术股份有限公司 The matched malicious file homology analysis method and device of Hash is obscured based on various dimensions
CN110362966A (en) * 2019-07-11 2019-10-22 华东师范大学 A kind of cross-platform firmware homology safety detection method based on fuzzy Hash
CN110414238A (en) * 2019-06-18 2019-11-05 中国科学院信息工程研究所 The search method and device of homologous binary code
CN111104674A (en) * 2019-11-06 2020-05-05 中国电力科学研究院有限公司 Power firmware homologous binary file association method and system
CN111310178A (en) * 2020-01-20 2020-06-19 武汉理工大学 Firmware vulnerability detection method and system under cross-platform scene

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105868108A (en) * 2016-03-28 2016-08-17 中国科学院信息工程研究所 Instruction-set-irrelevant binary code similarity detection method based on neural network
CN109063055A (en) * 2018-07-19 2018-12-21 中国科学院信息工程研究所 Homologous binary file search method and device
CN109460386A (en) * 2018-10-29 2019-03-12 杭州安恒信息技术股份有限公司 The matched malicious file homology analysis method and device of Hash is obscured based on various dimensions
CN110414238A (en) * 2019-06-18 2019-11-05 中国科学院信息工程研究所 The search method and device of homologous binary code
CN110362966A (en) * 2019-07-11 2019-10-22 华东师范大学 A kind of cross-platform firmware homology safety detection method based on fuzzy Hash
CN111104674A (en) * 2019-11-06 2020-05-05 中国电力科学研究院有限公司 Power firmware homologous binary file association method and system
CN111310178A (en) * 2020-01-20 2020-06-19 武汉理工大学 Firmware vulnerability detection method and system under cross-platform scene

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113704180A (en) * 2021-07-10 2021-11-26 国网浙江省电力有限公司信息通信分公司 Lossless firmware extraction method based on embedded equipment firmware file information feature library
CN113704180B (en) * 2021-07-10 2024-03-15 国网浙江省电力有限公司信息通信分公司 Lossless firmware extraction method based on embedded device firmware file information feature library
CN114489787A (en) * 2022-04-06 2022-05-13 奇安信科技集团股份有限公司 Software component analysis method, device, electronic equipment and storage medium
CN116578979A (en) * 2023-05-15 2023-08-11 软安科技有限公司 Cross-platform binary code matching method and system based on code features
CN116578979B (en) * 2023-05-15 2024-05-31 软安科技有限公司 Cross-platform binary code matching method and system based on code features

Similar Documents

Publication Publication Date Title
CN111400719B (en) Firmware vulnerability distinguishing method and system based on open source component version identification
Euh et al. Comparative analysis of low-dimensional features and tree-based ensembles for malware detection systems
Jeon et al. Hybrid malware detection based on bi-lstm and spp-net for smart iot
D’Angelo et al. Association rule-based malware classification using common subsequences of API calls
CN107239678B (en) Android application repacking detection method based on Java file directory structure
Zhu et al. Android malware detection based on multi-head squeeze-and-excitation residual network
CN110034921B (en) Webshell detection method based on weighted fuzzy hash
CN111552969A (en) Embedded terminal software code vulnerability detection method and device based on neural network
EP2609506A1 (en) Mining source code for violations of programming rules
CN112084146A (en) Firmware homology detection method based on multi-dimensional features
CN111382438B (en) Malware detection method based on multi-scale convolutional neural network
Liu et al. Vfdetect: A vulnerable code clone detection system based on vulnerability fingerprint
CN116366377B (en) Malicious file detection method, device, equipment and storage medium
WO2021167483A1 (en) Method and system for detecting malicious files in a non-isolated environment
CN105046152A (en) Function call graph fingerprint based malicious software detection method
Ugarte-Pedrero et al. Structural feature based anomaly detection for packed executable identification
CN105809034A (en) Malicious software identification method
Khan et al. Determining malicious executable distinguishing attributes and low-complexity detection
CN112817877B (en) Abnormal script detection method and device, computer equipment and storage medium
CN108171057B (en) Android platform malicious software detection method based on feature matching
CN108573148B (en) Confusion encryption script identification method based on lexical analysis
CN109241706B (en) Software plagiarism detection method based on static birthmarks
CN111104674A (en) Power firmware homologous binary file association method and system
CN112163217B (en) Malware variant identification method, device, equipment and computer storage medium
CN109446809B (en) Malicious program identification method and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination