US20210202031A1 - Methods and systems for an integrated disassembler with a function-queue manager and a disassembly interrupter for rapid, efficient, and scalable code gene extraction and analysis - Google Patents

Methods and systems for an integrated disassembler with a function-queue manager and a disassembly interrupter for rapid, efficient, and scalable code gene extraction and analysis Download PDF

Info

Publication number
US20210202031A1
US20210202031A1 US16/731,195 US201916731195A US2021202031A1 US 20210202031 A1 US20210202031 A1 US 20210202031A1 US 201916731195 A US201916731195 A US 201916731195A US 2021202031 A1 US2021202031 A1 US 2021202031A1
Authority
US
United States
Prior art keywords
code
individually
verified
queue
identifiable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US16/731,195
Other versions
US11056212B1 (en
Inventor
Itai Tevet
Roy Halevi
Jonathan Abrahamy
Ari Eitan
David Tufik
Jay Rosenberg
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Intezer Labs Ltd
Original Assignee
Intezer Labs Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intezer Labs Ltd filed Critical Intezer Labs Ltd
Priority to US16/731,195 priority Critical patent/US11056212B1/en
Assigned to Intezer Labs, Ltd. reassignment Intezer Labs, Ltd. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: ABRAHAMY, JONATHAN, EITAN, ARI, HALEVI, ROY, ROSENBERG, JAY, TEVET, ITAI, TUFIK, DAVID
Publication of US20210202031A1 publication Critical patent/US20210202031A1/en
Application granted granted Critical
Publication of US11056212B1 publication Critical patent/US11056212B1/en
Assigned to COMERICA BANK reassignment COMERICA BANK SECURITY INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: Intezer Labs, Ltd.
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/564Static detection by virus signature recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/53Decompilation; Disassembly
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements

Definitions

  • the present invention relates to methods and systems for an integrated disassembler with a function-queue manager and a disassembly interrupter for rapid, efficient, and scalable code gene extraction and analysis.
  • code fragments need to be identified in a source file before such fragments can be classified as code genes and analyzed.
  • files are binary files which need to be disassembled into assembly code containing instructions. This involves parsing the code into blocks from which code genes can be extracted representing at least one logic unit (i.e., from the start-block location/address to the stop-block location/address of a single block).
  • SRE Software Reverse Engineering
  • IDA Pro IDA Pro
  • RADARE RADARE
  • GHIDRA GHIDRA
  • Disassemblers identify the entry points of a binary file as the starting points for an assembly stream signature.
  • the byte sequence of the code is disassembled into functions with their associated arguments. Disassembly is necessary if one wants to analyze a binary file in any meaningful way to determine its inherent functionality from a bitstream of indistinguishable zeroes and ones. By breaking the file into its component functions, each function can be analyzed and understood.
  • disassembly can be multifold. Applications include ultimately searching for shared code genes by analyzing the genes with a database of known assembly-code fragments including both malicious and trusted code fragments. While disassemblers differ slightly, all disassemblers require a file to be fully disassembled before being able to further extract, normalize, and analyze such fragments for detection of code genes using a gene-analysis system, either trusted code or malware.
  • RADARE Using RADARE, one can analyze code of a function at a known address. However, one is limited to only the code fragments that are known in advance. A series of code fragments in an unknown file would require full disassembly before inspection of any one of the code fragments extracted, making such undertakings tediously manual and lacking scalability.
  • exemplary is used herein to refer to examples of embodiments and/or implementations, and is not meant to necessarily convey a more-desirable use-case.
  • alternative and “alternatively” are used herein to refer to an example out of an assortment of contemplated embodiments and/or implementations, and is not meant to necessarily convey a more-desirable use-case. Therefore, it is understood from the above that “exemplary” and “alternative” may be applied herein to multiple embodiments and/or implementations. Various combinations of such alternative and/or exemplary embodiments are also contemplated herein.
  • Embodiments of the present invention enable disassembly of binary code into assembly code by starting at one or more entry points, and continuing through other analysis points (such as blocks from jump commands and export calls).
  • a function i.e., when the end of the last block of a function, or the beginning of the first block of the next function, is found
  • the disassembled function can be accessed by a code-matching analysis program, without having to wait for all functions in the file to be fully disassembled. This saves substantial time in the overall detection and analysis of shared code genes.
  • the disassembler can terminate the disassembly process during disassembly, saving valuable resources of the disassembler to process other files.
  • Such a disassembler makes the overall process of disassembly and gene analysis streamlined through automation of the extracted gene analysis as each function is separated from the binary file and becomes available. Each function can be addressed during the analysis stage, allowing for scalability to process bulk files. Both aspects provide significant enhancement in the ability to rapidly process binary files to analyze their code fragments for malicious and/or trusted genes in a shared code database.
  • Embodiments of the present invention provide a disassembler with an integrated Function-Queue Manager (FQM) for submitting disassembled functions to be searched within a database of known shared genes, both trusted and malicious.
  • Embodiments of the present invention further provide a disassembly interrupter for determining whether to terminate disassembling of a target binary file during disassembly based on the gene information.
  • FQM Function-Queue Manager
  • the gene information regarding the target binary file is received from the gene-analysis system.
  • Such gene information can include the total number of detected genes, the number of detected genes by category or type, the number of detected genes by gene criticality, severity, and/or importance, and/or the presence of detected genes by gene criticality, severity, and/or importance.
  • the gene information can also include ancillary information about the file such as a current elapsed time for a given disassembly process.
  • a method for an integrated disassembler for code gene analysis including the steps of: (a) upon receiving a target binary file, disassembling the target binary file into assembly code; (b) extracting individually-identifiable code fragments from the assembly code; (c) as each individually-identifiable code fragment is extracted, verifying each individually-identifiable code fragment; (d) upon availability, placing each verified individually-identifiable code fragment in an extractor queue; and (e) upon availability, submitting each individually-identifiable code fragment in the extractor queue to a gene-analysis system having a code genome database.
  • the step of placing includes placing only each individually-identifiable code fragment that has been completely verified to be a valid function.
  • the method further including the steps of: ( 0 upon determining each individually-identifiable code fragment has not been completely verified, placing each partially-verified individually-identifiable code fragment in a verification queue; (g) performing additional verification on each partially-verified individually-identifiable code fragment; and (h) upon successfully completing the additional verification on each partially-verified individually-identifiable code fragment, transferring each completely-verified individually-identifiable code fragment to the extractor queue.
  • the method further including the step of: (i) upon determining the extractor queue is empty, transferring each partially-verified individually-identifiable code fragment to the extractor queue.
  • the method further including the step of: (i) upon determining resources of the gene-analysis system are underutilized, transferring each partially-verified individually-identifiable code fragment to the extractor queue.
  • the method further including the step of: (f) upon receiving gene information regarding the target binary file from the gene-analysis system during disassembly, determining whether to terminate the step of disassembling based on the gene information.
  • a system for an integrated disassembler for code gene analysis including: (a) a CPU for performing computational operations; (b) a memory module for storing data; (c) a disassembly module configured for, upon receiving a target binary file, disassembling the target binary file into assembly code; (d) an extracting module configured for extracting individually-identifiable code fragments from the assembly code; (e) a verification module configured for, as each individually-identifiable code fragment is extracted, verifying each individually-identifiable code fragment; and (f) a function-queue manager configured for: (i) upon availability, placing each verified individually-identifiable code fragment in an extractor queue; and (ii) upon availability, submitting each individually-identifiable code fragment in the extractor queue to a gene-analysis system having a code genome database.
  • the function-queue manager is further configured for: (iii) placing only each individually-identifiable code fragment that has been completely verified to be a valid function.
  • the function-queue manager is further configured for: (iv) upon the verification module determining each individually-identifiable code fragment has not been completely verified, placing each partially-verified individually-identifiable code fragment in a verification queue; (v) performing additional verification on each partially-verified individually-identifiable code fragment by the verification module; and (vi) upon successfully completing the additional verification on each partially-verified individually-identifiable code fragment, transferring each completely-verified individually-identifiable code fragment to the extractor queue.
  • the function-queue manager is further configured for: (vii) upon determining the extractor queue is empty, transferring each partially-verified individually-identifiable code fragment to the extractor queue.
  • the function-queue manager is further configured for: (vii) upon determining resources of the gene-analysis system are underutilized, transferring each partially-verified individually-identifiable code fragment to the extractor queue.
  • system further including: (g) a disassembly interrupter configured for, upon receiving gene information regarding the target binary file from the gene-analysis system during disassembly, determining whether to terminate the step of disassembling based on the gene information.
  • a disassembly interrupter configured for, upon receiving gene information regarding the target binary file from the gene-analysis system during disassembly, determining whether to terminate the step of disassembling based on the gene information.
  • a non-transitory computer-readable storage medium having computer-readable code embodied on the non-transitory computer-readable storage medium, for an integrated disassembler for code gene analysis
  • the computer-readable code including: (a) program code for, upon receiving a target binary file, disassembling the target binary file into assembly code; (b) program code for extracting individually-identifiable code fragments from the assembly code; (c) program code for, as each individually-identifiable code fragment is extracted, verifying each individually-identifiable code fragment; (d) program code for, upon availability, placing each verified individually-identifiable code fragment in an extractor queue; and (e) program code for, upon availability, submitting each individually-identifiable code fragment in the extractor queue to a gene-analysis system having a code genome database.
  • the placing includes placing only each individually-identifiable code fragment that has been completely verified to be a valid function.
  • the computer-readable code further including: (f) program code for, upon determining each individually-identifiable code fragment has not been completely verified, placing each partially-verified individually-identifiable code fragment in a verification queue; (g) program code for performing additional verification on each partially-verified individually-identifiable code fragment; and (h) program code for, upon successfully completing the additional verification on each partially-verified individually-identifiable code fragment, transferring each completely-verified individually-identifiable code fragment to the extractor queue.
  • the computer-readable code further including: (i) program code for, upon determining the extractor queue is empty, transferring each partially-verified individually-identifiable code fragment to the extractor queue.
  • the computer-readable code further including: (i) program code for, upon determining resources of the gene-analysis system are underutilized, transferring each partially-verified individually-identifiable code fragment to the extractor queue.
  • the computer-readable code further including: ( 0 program code for, upon receiving gene information regarding the target binary file from the gene-analysis system during disassembly, determining whether to terminate the disassembling based on the gene information.
  • FIG. 1 is a simplified flowchart of the major process steps for an integrated disassembler for code gene extraction and analysis, according to embodiments of the present invention
  • FIG. 2 is a simplified flowchart of the major process steps for the Function-Queue Manager (FQM) and disassembly interrupter, according to embodiments of the present invention.
  • FQM Function-Queue Manager
  • the present invention relates to methods and systems for an integrated disassembler with a function-queue manager and a disassembly interrupter for rapid, efficient, and scalable code gene extraction and analysis.
  • the principles and operation for providing such methods and systems, according to the present invention may be better understood with reference to the accompanying description and the drawings.
  • FIG. 1 is a simplified flowchart of the major process steps for an integrated disassembler with a function-queue manager for code gene extraction and analysis, according to embodiments of the present invention.
  • the process starts with activation of the disassembly process upon accessing a target binary file and finding the entry points (Step 2 ).
  • the binary file is then disassembled into assembly code by finding instructions such as function calls or starts of loops (Step 4 ).
  • Individually identified code fragments are extracted from the assembly code (Step 6 ).
  • the individually-identifiable code fragments are then queued upon availability for gene analysis without requiring the entire binary file to be fully disassembled (Step 8 ).
  • the individually-identifiable code fragments are then submitted to a gene-analysis system for determining whether the code fragments are trusted or malicious (Step 10 ).
  • FIG. 2 is a simplified flowchart of the major process steps for the Function-Queue Manager (FQM) and scan interrupter, according to embodiments of the present invention.
  • FQM Function-Queue Manager
  • scan interrupter scan interrupter
  • a function is only placed in the extractor queue if the function has been completely verified (Step 28 ). If a function hasn't been completely verified, the function is placed in a verification queue (Step 30 ). Functions in the verification queue undergo further verification to determine if they are truly valid, unique, and meaningful functions (Step 32 ). Functions in the verification queue are transferred to the extractor queue upon successfully completing verification (Step 34 ).
  • functions in the verification queue can also be transferred to the extractor queue upon the extractor queue becoming empty in order to prevent the gene-analysis system from becoming idle even without being completely verified (Step 36 ).
  • the FQM can check if the gene-analysis system is idle or underutilized before transferring functions from verification queue to extractor queue (Step 38 ).
  • a disassembly interrupter can determine whether to terminate disassembly based on the gene information (Step 40 ). Finally, the process returns to Step 26 by submitting the verified functions to the gene-analysis system.
  • the disassembly interrupter prevents the disassembler from continuing to disassemble the target binary file unnecessarily once the gene-analysis system has obtained enough gene information regarding the file to categorize the nature of the file (e.g., known shared genes, trusted genes, and/or malicious genes), saving valuable resources of the disassembler to process other files.
  • gene information can include the total number of detected genes, the number of detected genes by category or type, the number of detected genes by gene criticality, severity, and/or importance, and/or the presence of detected genes by gene criticality, severity, and/or importance.
  • the gene information can also include ancillary information about the file such as a current elapsed time for a given disassembly process.

Abstract

The present invention discloses methods and systems for an integrated disassembler with a function-queue manager and a disassembly interrupter for rapid, efficient, and scalable code gene extraction and analysis. Methods include the steps of: upon receiving a target binary file, disassembling the target binary file into assembly code; extracting code fragments from the assembly code; as each code fragment is extracted, verifying each code fragment; upon availability, placing each verified code fragment in an extractor queue; and upon availability, submitting each code fragment in the extractor queue to a gene-analysis system having a code genome database. Alternatively, upon determining the extractor queue is empty or determining resources of the gene-analysis system are underutilized, transferring partially-verified code fragments to the extractor queue. Alternatively, upon receiving gene information regarding the target binary file from the gene-analysis system during disassembly, determining whether to terminate the step of disassembling based on the gene information.

Description

    FIELD AND BACKGROUND OF THE INVENTION
  • The present invention relates to methods and systems for an integrated disassembler with a function-queue manager and a disassembly interrupter for rapid, efficient, and scalable code gene extraction and analysis.
  • Despite the rapid pace of technology in general, few industries today are as dynamic as that of cyber security. Attackers' techniques are constantly evolving, and along with them, the potential threat.
  • For security teams, the challenge remains not to keep up, but rather, to outpace them. It is a persistent struggle: a never-ending, record-setting marathon at a constant sprint. Even as security professionals rest, attackers are hard at work. The tools and approaches used must also adapt in order to stay a step ahead in defending their organizations. Malware classification, which encompasses both the identification and attribution of code, has the power to unlock many clues that aid security teams in achieving this.
  • Whether legitimate or malicious, nearly every software is composed of previously written code; the key to deeply understanding its nature and origins lies in discovering code that has appeared in previously known software. Reports on malware statistics indicate that there are around 350,000 new samples every day.
  • In order to determine if a file is benign/trusted or malicious, code fragments need to be identified in a source file before such fragments can be classified as code genes and analyzed. Typically, such files are binary files which need to be disassembled into assembly code containing instructions. This involves parsing the code into blocks from which code genes can be extracted representing at least one logic unit (i.e., from the start-block location/address to the stop-block location/address of a single block).
  • Software Reverse Engineering (SRE) relies on disassembling binary files using a disassembler (such as IDA Pro, RADARE, and GHIDRA). Such disassemblers identify the entry points of a binary file as the starting points for an assembly stream signature. The byte sequence of the code is disassembled into functions with their associated arguments. Disassembly is necessary if one wants to analyze a binary file in any meaningful way to determine its inherent functionality from a bitstream of indistinguishable zeroes and ones. By breaking the file into its component functions, each function can be analyzed and understood.
  • The goal of such disassembly can be multifold. Applications include ultimately searching for shared code genes by analyzing the genes with a database of known assembly-code fragments including both malicious and trusted code fragments. While disassemblers differ slightly, all disassemblers require a file to be fully disassembled before being able to further extract, normalize, and analyze such fragments for detection of code genes using a gene-analysis system, either trusted code or malware.
  • Using RADARE, one can analyze code of a function at a known address. However, one is limited to only the code fragments that are known in advance. A series of code fragments in an unknown file would require full disassembly before inspection of any one of the code fragments extracted, making such undertakings tediously manual and lacking scalability.
  • Given that there can be a very large number of such functions in a binary file, when code matching of genes is the goal, such disassemblers are slow, clumsy, and inefficient in processing a file, requiring manual entry and consuming valuable processing time. To appreciate the significance of efficiency and scalability, typically such code matching currently involves analyzing tens of billions of genes from tens of millions of files. Disassembling each binary file for extracting genes with current disassemblers would take around 16 years (calculated based on 10 seconds per file for 50M files).
  • It would be desirable to have methods and systems for an integrated disassembler with a function-queue manager and a disassembly interrupter for rapid, efficient, and scalable code gene extraction and analysis. Such methods and systems would, inter alia, overcome the various limitations mentioned above.
  • SUMMARY
  • It is the purpose of the present invention to provide methods and systems for an integrated disassembler with a function-queue manager and a disassembly interrupter for rapid, efficient, and scalable code gene extraction and analysis.
  • It is noted that the term “exemplary” is used herein to refer to examples of embodiments and/or implementations, and is not meant to necessarily convey a more-desirable use-case. Similarly, the terms “alternative” and “alternatively” are used herein to refer to an example out of an assortment of contemplated embodiments and/or implementations, and is not meant to necessarily convey a more-desirable use-case. Therefore, it is understood from the above that “exemplary” and “alternative” may be applied herein to multiple embodiments and/or implementations. Various combinations of such alternative and/or exemplary embodiments are also contemplated herein.
  • Embodiments of the present invention enable disassembly of binary code into assembly code by starting at one or more entry points, and continuing through other analysis points (such as blocks from jump commands and export calls). When a function is detected (i.e., when the end of the last block of a function, or the beginning of the first block of the next function, is found), the disassembled function can be accessed by a code-matching analysis program, without having to wait for all functions in the file to be fully disassembled. This saves substantial time in the overall detection and analysis of shared code genes. Moreover, once a code-matching analysis program has detected a requisite amount of gene information that the file contains, the disassembler can terminate the disassembly process during disassembly, saving valuable resources of the disassembler to process other files.
  • Such a disassembler makes the overall process of disassembly and gene analysis streamlined through automation of the extracted gene analysis as each function is separated from the binary file and becomes available. Each function can be addressed during the analysis stage, allowing for scalability to process bulk files. Both aspects provide significant enhancement in the ability to rapidly process binary files to analyze their code fragments for malicious and/or trusted genes in a shared code database.
  • Embodiments of the present invention provide a disassembler with an integrated Function-Queue Manager (FQM) for submitting disassembled functions to be searched within a database of known shared genes, both trusted and malicious. Embodiments of the present invention further provide a disassembly interrupter for determining whether to terminate disassembling of a target binary file during disassembly based on the gene information.
  • The gene information regarding the target binary file is received from the gene-analysis system. Such gene information can include the total number of detected genes, the number of detected genes by category or type, the number of detected genes by gene criticality, severity, and/or importance, and/or the presence of detected genes by gene criticality, severity, and/or importance. Furthermore, the gene information can also include ancillary information about the file such as a current elapsed time for a given disassembly process.
  • Therefore, according to the present invention, there is provided for the first time a method for an integrated disassembler for code gene analysis, the method including the steps of: (a) upon receiving a target binary file, disassembling the target binary file into assembly code; (b) extracting individually-identifiable code fragments from the assembly code; (c) as each individually-identifiable code fragment is extracted, verifying each individually-identifiable code fragment; (d) upon availability, placing each verified individually-identifiable code fragment in an extractor queue; and (e) upon availability, submitting each individually-identifiable code fragment in the extractor queue to a gene-analysis system having a code genome database.
  • Alternatively, the step of placing includes placing only each individually-identifiable code fragment that has been completely verified to be a valid function.
  • More alternatively, the method further including the steps of: (0 upon determining each individually-identifiable code fragment has not been completely verified, placing each partially-verified individually-identifiable code fragment in a verification queue; (g) performing additional verification on each partially-verified individually-identifiable code fragment; and (h) upon successfully completing the additional verification on each partially-verified individually-identifiable code fragment, transferring each completely-verified individually-identifiable code fragment to the extractor queue.
  • Most alternatively, the method further including the step of: (i) upon determining the extractor queue is empty, transferring each partially-verified individually-identifiable code fragment to the extractor queue.
  • Most alternatively, the method further including the step of: (i) upon determining resources of the gene-analysis system are underutilized, transferring each partially-verified individually-identifiable code fragment to the extractor queue.
  • Alternatively, the method further including the step of: (f) upon receiving gene information regarding the target binary file from the gene-analysis system during disassembly, determining whether to terminate the step of disassembling based on the gene information.
  • According to the present invention, there is provided for the first time a system for an integrated disassembler for code gene analysis, the system including: (a) a CPU for performing computational operations; (b) a memory module for storing data; (c) a disassembly module configured for, upon receiving a target binary file, disassembling the target binary file into assembly code; (d) an extracting module configured for extracting individually-identifiable code fragments from the assembly code; (e) a verification module configured for, as each individually-identifiable code fragment is extracted, verifying each individually-identifiable code fragment; and (f) a function-queue manager configured for: (i) upon availability, placing each verified individually-identifiable code fragment in an extractor queue; and (ii) upon availability, submitting each individually-identifiable code fragment in the extractor queue to a gene-analysis system having a code genome database.
  • Alternatively, the function-queue manager is further configured for: (iii) placing only each individually-identifiable code fragment that has been completely verified to be a valid function.
  • More alternatively, the function-queue manager is further configured for: (iv) upon the verification module determining each individually-identifiable code fragment has not been completely verified, placing each partially-verified individually-identifiable code fragment in a verification queue; (v) performing additional verification on each partially-verified individually-identifiable code fragment by the verification module; and (vi) upon successfully completing the additional verification on each partially-verified individually-identifiable code fragment, transferring each completely-verified individually-identifiable code fragment to the extractor queue.
  • Most alternatively, the function-queue manager is further configured for: (vii) upon determining the extractor queue is empty, transferring each partially-verified individually-identifiable code fragment to the extractor queue.
  • Most alternatively, the function-queue manager is further configured for: (vii) upon determining resources of the gene-analysis system are underutilized, transferring each partially-verified individually-identifiable code fragment to the extractor queue.
  • Alternatively, the system further including: (g) a disassembly interrupter configured for, upon receiving gene information regarding the target binary file from the gene-analysis system during disassembly, determining whether to terminate the step of disassembling based on the gene information.
  • According to the present invention, there is provided for the first time a non-transitory computer-readable storage medium, having computer-readable code embodied on the non-transitory computer-readable storage medium, for an integrated disassembler for code gene analysis, the computer-readable code including: (a) program code for, upon receiving a target binary file, disassembling the target binary file into assembly code; (b) program code for extracting individually-identifiable code fragments from the assembly code; (c) program code for, as each individually-identifiable code fragment is extracted, verifying each individually-identifiable code fragment; (d) program code for, upon availability, placing each verified individually-identifiable code fragment in an extractor queue; and (e) program code for, upon availability, submitting each individually-identifiable code fragment in the extractor queue to a gene-analysis system having a code genome database.
  • Alternatively, the placing includes placing only each individually-identifiable code fragment that has been completely verified to be a valid function.
  • More alternatively, the computer-readable code further including: (f) program code for, upon determining each individually-identifiable code fragment has not been completely verified, placing each partially-verified individually-identifiable code fragment in a verification queue; (g) program code for performing additional verification on each partially-verified individually-identifiable code fragment; and (h) program code for, upon successfully completing the additional verification on each partially-verified individually-identifiable code fragment, transferring each completely-verified individually-identifiable code fragment to the extractor queue.
  • Most alternatively, the computer-readable code further including: (i) program code for, upon determining the extractor queue is empty, transferring each partially-verified individually-identifiable code fragment to the extractor queue.
  • Most alternatively, the computer-readable code further including: (i) program code for, upon determining resources of the gene-analysis system are underutilized, transferring each partially-verified individually-identifiable code fragment to the extractor queue.
  • Alternatively, the computer-readable code further including: (0 program code for, upon receiving gene information regarding the target binary file from the gene-analysis system during disassembly, determining whether to terminate the disassembling based on the gene information.
  • These and further embodiments will be apparent from the detailed description and examples that follow.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The present invention is herein described, by way of example only, with reference to the accompanying drawings, wherein:
  • FIG. 1 is a simplified flowchart of the major process steps for an integrated disassembler for code gene extraction and analysis, according to embodiments of the present invention;
  • FIG. 2 is a simplified flowchart of the major process steps for the Function-Queue Manager (FQM) and disassembly interrupter, according to embodiments of the present invention.
  • DESCRIPTION OF THE ILLUSTRATIVE EMBODIMENTS
  • The present invention relates to methods and systems for an integrated disassembler with a function-queue manager and a disassembly interrupter for rapid, efficient, and scalable code gene extraction and analysis. The principles and operation for providing such methods and systems, according to the present invention, may be better understood with reference to the accompanying description and the drawings.
  • Referring to the drawings, FIG. 1 is a simplified flowchart of the major process steps for an integrated disassembler with a function-queue manager for code gene extraction and analysis, according to embodiments of the present invention. The process starts with activation of the disassembly process upon accessing a target binary file and finding the entry points (Step 2). The binary file is then disassembled into assembly code by finding instructions such as function calls or starts of loops (Step 4). Individually identified code fragments are extracted from the assembly code (Step 6). The individually-identifiable code fragments are then queued upon availability for gene analysis without requiring the entire binary file to be fully disassembled (Step 8). The individually-identifiable code fragments are then submitted to a gene-analysis system for determining whether the code fragments are trusted or malicious (Step 10).
  • The queuing of the code fragments for gene analysis upon availability in Step 8 is performed by an integrated function-queue manager of the disassembler. FIG. 2 is a simplified flowchart of the major process steps for the Function-Queue Manager (FQM) and scan interrupter, according to embodiments of the present invention. Once a function has been potentially identified during disassembly of target binary file (Step 20), each function is verified by the FQM (Step 22). Upon verification, the verified functions are placed in an extractor queue before transferring for gene analysis without waiting for the entire binary file to be disassembled (Step 24). Verified functions in the extractor queue are then submitted to the code gene database for code matching and gene analysis to identify trusted and malicious genes (Step 26).
  • In some embodiments, a function is only placed in the extractor queue if the function has been completely verified (Step 28). If a function hasn't been completely verified, the function is placed in a verification queue (Step 30). Functions in the verification queue undergo further verification to determine if they are truly valid, unique, and meaningful functions (Step 32). Functions in the verification queue are transferred to the extractor queue upon successfully completing verification (Step 34).
  • Alternatively, functions in the verification queue can also be transferred to the extractor queue upon the extractor queue becoming empty in order to prevent the gene-analysis system from becoming idle even without being completely verified (Step 36). Alternatively, the FQM can check if the gene-analysis system is idle or underutilized before transferring functions from verification queue to extractor queue (Step 38). Alternatively, upon receiving gene information of the target binary file from the gene-analysis system during disassembly, a disassembly interrupter can determine whether to terminate disassembly based on the gene information (Step 40). Finally, the process returns to Step 26 by submitting the verified functions to the gene-analysis system.
  • The disassembly interrupter prevents the disassembler from continuing to disassemble the target binary file unnecessarily once the gene-analysis system has obtained enough gene information regarding the file to categorize the nature of the file (e.g., known shared genes, trusted genes, and/or malicious genes), saving valuable resources of the disassembler to process other files. Examples of such gene information can include the total number of detected genes, the number of detected genes by category or type, the number of detected genes by gene criticality, severity, and/or importance, and/or the presence of detected genes by gene criticality, severity, and/or importance. Furthermore, the gene information can also include ancillary information about the file such as a current elapsed time for a given disassembly process.
  • While the present invention has been described with respect to a limited number of embodiments, it will be appreciated that many variations, modifications, and other applications of the present invention may be made.

Claims (15)

1. A method for an integrated disassembler for code gene analysis, the method comprising the steps of:
(a) upon receiving a target binary file, disassembling said target binary file into assembly code;
(b) extracting individually-identifiable code fragments from said assembly code;
(c) upon each said individually-identifiable code fragment being extracted, verifying said each individually-identifiable code fragment;
(d) upon said each individually-identifiable code fragment being verified, placing each verified said individually-identifiable code fragment in an extractor queue, wherein said placing includes placing said each individually-identifiable code fragment that has been completely verified to be a valid function;
(e) upon said each verified individually-identifiable code fragment being placed in said extractor queue, submitting said each individually-identifiable code fragment in said extractor queue to a gene-analysis system having a code genome database;
(f) placing each partially-verified individually-identifiable code fragment in a verification queue, wherein said each partially-verified individually-identifiable code fragment has not been completely verified;
(g) performing additional verification on said each partially-verified individually-identifiable code fragment; and
(h) transferring each completely-verified said individually-identifiable code fragment to said extractor queue.
2-3. (canceled)
4. The method of claim 1, the method further comprising the step of:
(i) transferring said each partially-verified individually-identifiable code fragment to said extractor queue when said extractor queue is empty.
5. The method of claim 1, the method further comprising the step of:
(i) transferring said each partially-verified individually-identifiable code fragment to said extractor queue when resources of said gene-analysis system are underutilized.
6. The method of claim 1, the method further comprising the step of:
(i) upon receiving gene information regarding said target binary file from said gene-analysis system during disassembly, determining whether to terminate said step of disassembling based on said gene information.
7. A system for an integrated disassembler for code gene analysis, the system comprising:
(a) a CPU for performing computational operations;
(b) a memory for storing data and having computer-readable code embodied therein, wherein said computer-readable code includes:
(i) program code for, upon receiving a target binary file, disassembling said target binary file into assembly code;
(ii) program code for extracting individually-identifiable code fragments from said assembly code;
(iii) program code for, upon each said individually-identifiable code fragment being extracted, verifying said each individually-identifiable code fragment;
(iv) program code for, upon said each individually-identifiable code fragment being verified, placing each verified said individually-identifiable code fragment in an extractor queue, wherein said placing includes placing said each individually-identifiable code fragment that has been completely verified to be a valid function;
(v) program code for, upon said each verified individually-identifiable code fragment being placed in said extractor queue, submitting said each individually-identifiable code fragment in said extractor queue to a gene-analysis system having a code genome database;
(vi) program code for placing each partially-verified said individually-identifiable code fragment in a verification queue, wherein said each partially-verified individually-identifiable code fragment has not been completely verified;
(vii) program code for performing additional verification on said each partially-verified individually-identifiable code fragment by said verification module; and
(viii) program code for transferring each completely-verified said individually-identifiable code fragment to said extractor queue.
8-9. (canceled)
10. The system of claim 7, wherein said computer-readable code further includes:
(ix) program code for transferring said each partially-verified individually-identifiable code fragment to said extractor queue when said extractor queue is empty.
11. The system of claim 7, wherein said computer-readable code further includes:
(ix) program code for transferring said each partially-verified individually-identifiable code fragment to said extractor queue when resources of said gene-analysis system are underutilized.
12. The system of claim 7, said computer-readable code further includes:
(ix) program code for, upon receiving gene information regarding said target binary file from said gene-analysis system during disassembly, determining whether to terminate said step of disassembling based on said gene information.
13. A non-transitory computer-readable storage medium, having computer-readable code embodied on the non-transitory computer-readable storage medium, for an integrated disassembler for code gene analysis, the computer-readable code comprising:
(a) program code for, upon receiving a target binary file, disassembling said target binary file into assembly code;
(b) program code for extracting individually-identifiable code fragments from said assembly code;
(c) program code for, upon each said individually-identifiable code fragment being extracted, verifying said each individually-identifiable code fragment;
(d) program code for, upon said each individually-identifiable code fragment being verified, placing each verified said individually-identifiable code fragment in an extractor queue, wherein said placing includes placing said each individually-identifiable code fragment that has been completely verified to be a valid function;
(e) program code for, upon said each verified individually-identifiable code fragment being placed in said extractor queue, submitting said each individually-identifiable code fragment in said extractor queue to a gene-analysis system having a code genome database;
(f) program code for placing each partially-verified said individually-identifiable code fragment in a verification queue, wherein said each partially-verified individually-identifiable code fragment has not been completely verified;
(g) program code for performing additional verification on said each partially-verified individually-identifiable code fragment; and
(h) program code for transferring each completely-verified said individually-identifiable code fragment to said extractor queue.
14-15. (canceled)
16. The non-transitory computer-readable storage medium of claim 13, the computer-readable code further comprising:
(i) program code for transferring said each partially-verified individually-identifiable code fragment to said extractor queue when said extractor queue is empty.
17. The non-transitory computer-readable storage medium of claim 13, the computer-readable code further comprising:
(i) program code for transferring said each partially-verified individually-identifiable code fragment to said extractor queue when resources of said gene-analysis system are underutilized.
18. The non-transitory computer-readable storage medium of claim 13, the computer-readable code further comprising:
(i) program code for, upon receiving gene information regarding said target binary file from said gene-analysis system during disassembly, determining whether to terminate said disassembling based on said gene information.
US16/731,195 2019-12-31 2019-12-31 Methods and systems for an integrated disassembler with a function-queue manager and a disassembly interrupter for rapid, efficient, and scalable code gene extraction and analysis Active US11056212B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/731,195 US11056212B1 (en) 2019-12-31 2019-12-31 Methods and systems for an integrated disassembler with a function-queue manager and a disassembly interrupter for rapid, efficient, and scalable code gene extraction and analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US16/731,195 US11056212B1 (en) 2019-12-31 2019-12-31 Methods and systems for an integrated disassembler with a function-queue manager and a disassembly interrupter for rapid, efficient, and scalable code gene extraction and analysis

Publications (2)

Publication Number Publication Date
US20210202031A1 true US20210202031A1 (en) 2021-07-01
US11056212B1 US11056212B1 (en) 2021-07-06

Family

ID=76545681

Family Applications (1)

Application Number Title Priority Date Filing Date
US16/731,195 Active US11056212B1 (en) 2019-12-31 2019-12-31 Methods and systems for an integrated disassembler with a function-queue manager and a disassembly interrupter for rapid, efficient, and scalable code gene extraction and analysis

Country Status (1)

Country Link
US (1) US11056212B1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220414113A1 (en) * 2021-06-29 2022-12-29 International Business Machines Corporation Managing extract, transform and load systems

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8533836B2 (en) * 2012-01-13 2013-09-10 Accessdata Group, Llc Identifying software execution behavior
US9003529B2 (en) * 2012-08-29 2015-04-07 The Johns Hopkins University Apparatus and method for identifying related code variants in binaries
US20150186649A1 (en) * 2013-12-31 2015-07-02 Cincinnati Bell, Inc. Function Fingerprinting

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220414113A1 (en) * 2021-06-29 2022-12-29 International Business Machines Corporation Managing extract, transform and load systems
US11841871B2 (en) * 2021-06-29 2023-12-12 International Business Machines Corporation Managing extract, transform and load systems

Also Published As

Publication number Publication date
US11056212B1 (en) 2021-07-06

Similar Documents

Publication Publication Date Title
CN101593253B (en) Method and device for judging malicious programs
US9043917B2 (en) Automatic signature generation for malicious PDF files
US20070152854A1 (en) Forgery detection using entropy modeling
CN101923617B (en) Cloud-based sample database dynamic maintaining method
US20110154495A1 (en) Malware identification and scanning
US9454658B2 (en) Malware detection using feature analysis
EP3447669B1 (en) Information leakage detection method and device, server, and computer-readable storage medium
US20080201779A1 (en) Automatic extraction of signatures for malware
CN103020521B (en) Wooden horse scan method and system
CN108256329B (en) Fine-grained RAT program detection method and system based on dynamic behavior and corresponding APT attack detection method
CN111869176B (en) System and method for malware signature generation
US10607010B2 (en) System and method using function length statistics to determine file similarity
CN111988341B (en) Data processing method, device, computer system and storage medium
US11056212B1 (en) Methods and systems for an integrated disassembler with a function-queue manager and a disassembly interrupter for rapid, efficient, and scalable code gene extraction and analysis
Vadrevu et al. Maxs: Scaling malware execution with sequential multi-hypothesis testing
Torres et al. Malicious PDF documents detection using machine learning techniques
CN108229168B (en) Heuristic detection method, system and storage medium for nested files
US20130179975A1 (en) Method for Extracting Digital Fingerprints of a Malicious Document File
EP3800570A1 (en) Methods and systems for genetic malware analysis and classification using code reuse patterns
Shekhawat et al. A review of malware classification methods using machine learning
US8918873B1 (en) Systems and methods for exonerating untrusted software components
CN114218561A (en) Weak password detection method, terminal equipment and storage medium
CN114448614A (en) Weak password detection method, device, system and storage medium
CN112651026A (en) Application version mining method and device with business safety problem
CN115622818B (en) Network attack data processing method and device

Legal Events

Date Code Title Description
FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO SMALL (ORIGINAL EVENT CODE: SMAL); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

AS Assignment

Owner name: INTEZER LABS, LTD., ISRAEL

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:TEVET, ITAI;HALEVI, ROY;ABRAHAMY, JONATHAN;AND OTHERS;REEL/FRAME:051900/0216

Effective date: 20191225

STCF Information on status: patent grant

Free format text: PATENTED CASE

AS Assignment

Owner name: COMERICA BANK, MICHIGAN

Free format text: SECURITY INTEREST;ASSIGNOR:INTEZER LABS, LTD.;REEL/FRAME:056833/0884

Effective date: 20210712