US20170214704A1 - Method and device for feature extraction - Google Patents

Method and device for feature extraction Download PDF

Info

Publication number
US20170214704A1
US20170214704A1 US15/109,343 US201415109343A US2017214704A1 US 20170214704 A1 US20170214704 A1 US 20170214704A1 US 201415109343 A US201415109343 A US 201415109343A US 2017214704 A1 US2017214704 A1 US 2017214704A1
Authority
US
United States
Prior art keywords
file
files
appearing
function
black
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US15/109,343
Other languages
English (en)
Inventor
Kang Yang
Zhuo Chen
Hai Tang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Assigned to QIZHI SOFTWARE (BEIJING) COMPANY LIMITED reassignment QIZHI SOFTWARE (BEIJING) COMPANY LIMITED ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: TANG, HAI
Assigned to BEIJING QIHOO TECHNOLOGY COMPANY LIMITED reassignment BEIJING QIHOO TECHNOLOGY COMPANY LIMITED ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: CHEN, Zhou, YANG, KANG
Assigned to BEIJING QIHOO TECHNOLOGY COMPANY LIMITED reassignment BEIJING QIHOO TECHNOLOGY COMPANY LIMITED ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: QIZHI SOFTWARE (BEIJING) COMPANY LIMITED
Publication of US20170214704A1 publication Critical patent/US20170214704A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/566Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S40/00Systems for electrical power generation, transmission, distribution or end-user application management characterised by the use of communication or information technologies, or communication or information technology specific aspects supporting them
    • Y04S40/20Information technology specific aspects, e.g. CAD, simulation, modelling, system security

Definitions

  • the present invention relates to the technical field of network security, and more specifically relates to a method and device for feature extraction.
  • smart terminals are provided with more and more functions.
  • mobile phones have turned from traditional GSM and TDMA digital mobile phones into smart phones that have capabilities of processing multimedia resources and providing various kinds of information services such as network browsing, telephone conference, electronic commerce, etc.
  • GSM Global System for Mobile Communications
  • TDMA Time Division Multiple Access
  • Smart mobile phone users suffer deeply from more and more mobile phone viruses.
  • Mobile phone malicious code protection technologies perform protection against malicious codes.
  • a variety of mobile phone malicious code protection approaches have been provided, for example, feature value scanning approach, virtual machine technology-based malicious code protection, heuristic scanning and similar samples clustering, etc.
  • an efficient scanning algorithm also named as matching algorithm
  • a malicious code feature library that is reasonably organized is basis. Therefore, how to accurately and efficiently extract features is crucial to build a feature library or even to the entire protection technology.
  • a method and device for feature extraction according to the present invention is provided so as to overcome the above problems or at least partially solve the above problems.
  • a method for feature extraction comprising acquiring a batch of black sample files and white sample files from an application layer of a smart terminal operating system; parsing each file to obtain information structure of all functions contained in each file, and computing a check code of each function; determining whether each file contains functions corresponding to respective check codes so as to count times that each function appears in the black sample files and white sample files; extracting black sample features based on functions only appearing in the black sample files while not appearing in the white sample files, or extracting white sample features based on functions only appearing in the white sample files while not appearing in the black sample files.
  • a device for feature extraction comprising a file acquiring unit configured to acquire a batch of black sample files and white sample files from an application layer of a smart terminal operating system; a parsing unit configured to parse each file to obtain information structure of all functions contained in each file, and a check code computing unit configured to compute a check code of each function; a counting unit configured to determine whether each file contains functions corresponding to respective check codes so as to count times that each function appears in the black sample files and white sample files; an extracting unit configured to extract black sample features based on functions only appearing in the black sample files while not appearing in the white sample files, or extract white sample features based on functions only appearing in the white sample files while not appearing in the black sample files.
  • the embodiments of the present invention only use functions appearing in the black sample files while not appearing in the white sample files as the basis for feature extraction.
  • the fast and accurate feature extraction may guarantee building of an efficient feature library and guarantee implementation of the defending technologies.
  • the features may be optimized so as to detect most files with least features after acquiring a large amount of extractable black sample features.
  • FIG. 1 illustrates a flow diagram of a method for feature extraction according to one embodiment of the present invention
  • FIG. 2 illustrates a flow diagram of optimizing features in a method for feature extraction according to one embodiment of the present invention
  • FIG. 3 illustrates a schematic diagram of a device for feature extraction according to one embodiment of the present invention
  • FIG. 4 illustrates a block diagram of a smart electronic device for executing the method according to the present invention
  • FIG. 5 illustrates a schematic diagram of a storage unit for maintaining or carrying program codes that implement the method according to the present invention.
  • Android operating system contains an application layer (app layer) and a system framework layer (framework layer); As for other layers that might be comprised in terms of functional partitioning, will not be discussed here.
  • the app layer may be generally understood as an upper layer, in charge of interfaces for interaction with a user, e.g., application maintenance, identifying different kinds of click contents upon clicking onto a page so as to display different context menus, and etc.
  • the framework layer is generally used as an intermediate layer, mainly for forwarding a user request (e.g., starting an application, clicking on a link, click to save a picture, and the like) to a lower layer; and distributing contents completely processed by the lower layer to the upper layer either via a message or via an intermediate proxy class, so as to present them to the user.
  • a user request e.g., starting an application, clicking on a link, click to save a picture, and the like
  • the inventors of the present invention have found in researching that by counting times that a check code of a function contained in a sample file appears in files, it may be determined whether the function is a black sample or a white sample.
  • FIG. 1 a flow diagram of a method for feature extraction according to one embodiment of the present invention is presented.
  • the method for feature extraction comprises steps of
  • the black sample files refer to files preliminarily determined as containing a black sample, e.g., a file containing malicious codes
  • the white sample files refer to files preliminarily determined not containing a black sample, e.g., a file not containing malicious codes.
  • a feature library needs to be built during matching, detecting, and removing malicious codes, and building of the feature library is based on extracting features from sample files.
  • whether a batch of files are black sample files or white sample files is preliminarily determined manually in advance. More black sample files and white sample files will be beneficial for accurate extraction of sample features.
  • the black sample files or white sample files may be, for example, dex files.
  • Dex files refer to virtual machine executable files directly loaded and running in a Dalvik virtual machine (Dalvik VM) in Android system.
  • Dalvik is a Java virtual machine for an Android platform.
  • An optimized Dalvik allows concurrently running instances of multiple virtual machines in a limited internal memory, and each Dalvik application is executed as an independent Linux process. The independent process can prevent closing of all programs when the virtual machine breaks down.
  • the Dalvik virtual machine may support running of a Java application that has been converted into a dex (Dalvik Executable) format.
  • the dex format is a kind of compressed format specifically designed for Dalvik and is suitable for a system with limited memory and processor speed.
  • Java source codes may be converted into a dex file by ADT (Android Development Tools) through a complex compilation.
  • the dex file is an optimized result for an embedded system.
  • the Dalvik virtual machine does not employ standard Java virtual machine instruction codes, but uses its specific instruction set.
  • the dex file shares a plenty of class names and constant strings, thus its volume is small and operating efficiency is relatively high.
  • obtaining a batch of black sample dex files and white sample dex files from a smart terminal may comprise finding an installation package of an application from an application layer of a smart terminal operating system; parsing the installation package to obtain a dex file of the application; using a dex executing file as a black sample file or a white sample file.
  • it can be obtained by parsing an APK (Android Package).
  • the APK file is actually a compressed package of a zip format, but its affix name is modified to apk; a Dex file may be obtained after decompression via UnZip.
  • the Android operating system comprises an application layer (app layer) and a system framework layer (framework layer).
  • the present invention focuses on study and improvement of the app layer.
  • Dalvik VM monitors all programs (APK files) and frameworks and create a dependency relationship tree for them. Through this dependency relationship tree, the Dalvik VM optimizes code for each program and stores the optimized codes into a Dalvik cache (dalvik-cache). In this way, all programs will use optimized code upon running.
  • a program (or framework) changes the Dalvik VM will re-optimize the code and store them into the cache again.
  • the cache/dalvik-cache is for depositing dex files generated by programs on the system, while data/dalvik-cache is for depositing dex files generated by data/app.
  • the present invention focuses on analyzing and processing of dex files generated by data/app.
  • the theory and operation of the present invention is likewise applicable to dex files generated by programs on the system.
  • parsing a file to obtain information structure of all functions contained in the file comprises decompiling the dex file to obtain decompiled information structure of all functions contained in the dex file.
  • the dex file is decompiled in a plurality of manners.
  • Manner 1 parsing the dex file according to a dex file format to obtain a function information structure of each class; determining a location and size of the dex file according to fields in the function information structure, to obtain a decompiled function information structure. Wherein, by parsing the function information structure, a bytecode array field indicating a function position of the dex file and a list length field indicating a function size of the dex file are obtained, thereby determining the position and size of the function of the dex file.
  • the dex file is parsed according to a dex file format to obtain the function information body of each class.
  • the function information structure contains fields in Table 1.
  • handlers encoded_catch_handler_list These bytes represent a series of abnormal types and an address (optional) list of their processing methods; each try_item has an offset of one byte width; and the element only exists when the tries_size is not 0.
  • the insns_size and insns fields in each function information structure represent the function size and position, respectively.
  • the information structure of the function may be decompiled according to the fields insns_size and insns.
  • the decompiled information structure is comprised of Dalvik VM bytes, which will be detailed later.
  • Manner 2 decompiling the dex file into a virtual machine byte code using a dex file decompilation tool.
  • the Dalvik virtual machine runs a Dalvik bytecode, which exists in a dex executable file form.
  • the Dalvik virtual machine executes codes by interpreting the dex file.
  • some tools are provided to decompile a DEX file into Dalvik compilation codes, such dex file decompiling tools include baksmali, Dedexer 1.26, dexdump, dexinspecto 03-12-12r, IDA Pro, androguard, dex2jar, and 010 Editor, etc.
  • the function information structure comprises function execution codes, which, in the present embodiment, are formed by a virtual machine instruction sequence and a virtual machine memonic sequence.
  • the function information structure is formed by an instruction sequence of Dalvik VM and a memonic sequence of the Dalvik VM.
  • a function information structure obtained by decompiling the dex file according to one embodiment of the present invention is specified below:
  • the dex file is decompiled into an instruction sequence of Dalvik VM and a memonic sequence of the Dalvik VM.
  • the first 2 digits of each line in the machine code field denote an instruction sequence (the left circled part in the example above), while the part corresponding to the instruction sequence is a memonic (right side of the example, partially circled, not completely selected).
  • the memonic is mainly for facilitating user communication and code compilation.
  • the check code of the function may be computed. Later, the check code may be used to represent its corresponding unique function.
  • the check code of the function may be calculated using an existing or future algorithm. For example, a hash algorithm may be used to calculate the hash value of the function as the previous check code.
  • the hash algorithm has many kinds, e.g., CRC (Cyclic Redundancy Check), MD5 (Message Digest Algorithm), or SHA (Secure Hash Algorithm), etc.
  • This step is to count times that a hash value appears in a batch of black sample files and white sample files obtained in step S 101 .
  • a hash value of each function is determined by analyzing and computing the black sample files and white sample files; then, times that each hash value appears in the black sample files and white sample files are counted.
  • n sample files including a part of black sample files and a part of white sample files
  • the first file comprises function hash values A, B, C
  • the second file comprises function hash values A, C, D
  • the third file comprises function hash values B, C, E
  • . . . the nth file comprises hash values C, D. All in all, after all files are analyzed, suppose 5 function values A, B, C, D, E are determined. Then, times that the 5 hash values appear in the black samples and in white sample files are counted. Suppose the results are shown in Table 2 below after counting.
  • the method before counting the times that each function appears in the black sample files and the white sample files, the method further comprises de-duplicating a check code of the function within the file.
  • de-duplicating the check code of the function within the file refers to for each file, if a plurality of functions have a same check code, extracting one function from the plurality of functions as a function corresponding to the check code. For example, suppose that for a dex file, the information structure of all functions contained therein are obtained by parsing it. Suppose that three information structure s1, s2, and s3 are parsed out; 3 hash values hash 1, hash 2, and hash 3 of the three information structure s1, s2, and 3 are obtained further through a hash algorithm.
  • functions B and E are selected for black sample feature extraction. Specifically, functions B, E are taken as black sample features, or part of codes of the functions B and E are taken as black sample features. Likewise, functions only appearing in the white sample files while not appearing in the black sample files are selected as white sample features.
  • function C is selected for performing white sample feature extraction. Specifically, function C may be used as a white sample feature or part of code of the function C is used as a white sample feature.
  • step S 104 After the black sample feature is extracted in step S 104 , the following steps may be continued to execute the following steps: adding a black sample feature to the black sample feature library; matching a target file using the black sample feature library, and if the target file comprises a function or a subset of functions corresponding to the black sample feature, determining that malicious code exists in the target file.
  • sample feature detecting and removing, virtual machine-based detecting removing, heuristic detecting and removing or similar samples clustering may be performed to the target files using the function corresponding to the black sample feature in the black sample feature library.
  • malware code and the malicious code protection schemes will be introduced.
  • the malicious code refers to a program or code that is disseminated via a storage medium or a network, destroys integrity of the operating system and steals undisclosed confidential information in the system without authorization.
  • a mobile phone malicious code refers to a malicious code against a portable device and a PDA.
  • the mobile phone malicious code may be simply divided into a replication-type malicious code and a non-replication-type malicious code, wherein the replication-type malicious code mainly contains a virus and a worm, while the non-replication-type malicious code mainly contains a Trojan horse, rogue software, a malicious mobile code, a rootkit program, and etc.
  • a mobile phone malicious code protection technology performs protection against malicious code.
  • a virtual machine technology based malicious code protection This kind of protection scheme is mainly directed against polymorph viruses and metamorphic viruses.
  • the virtual machine refers to a complete computer system simulated through software to have a complete hardware system function and run in a completely isolated environment.
  • This scheme is also referred to as a software simulation method, where a software analyzer simulates and analyzes program running using a software method. It essentially simulates a small closed program execution environment in the inner memory, and all files to be subject to virus detection and removal are executed virtually therein.
  • the feature value scanning technology is also used first, and only when finding that the target has a feature of encrypted malicious code, will the virtual machine module be started to make the encrypted code decoded autonomously.
  • the traditional feature value scanning manner may be employed to detect and remove. For another example, a heuristic detection and removal manner.
  • the heuristic detection and removal manner is mainly directed against constant mutation of malicious code for the purpose of enhancing the study on unknown malicious code.
  • the so-called “heuristic” is originated from artificial intelligence, which refers to “a capability of self-discovery” or “a knowledge or technique that exerts a certain manner or method to judge an object.”
  • the heuristic detection and killing of the malicious code means the scanning software can detect a virus by analyzing a structure of the program and its behavior using a rule extracted empirically. Because usual behaviors of a malicious code will have certain features such as reading and writing a file in an unconventional manner, terminating itself, or entering into a zero ring in an unconventional manner, so as to achieve the objectives of infection and damage.
  • a program is a malicious code may be determined by scanning specific behaviors or a combination of multiple behaviors.
  • similar samples clustering may be performed to a target program, e.g., clustering similar samples determined through analysis using a K-mean value clustering algorithm.
  • the matching algorithm is generally divided into a single-mode matching algorithm and a multi-mode matching algorithm.
  • the single-mode matching algorithm comprises a BF (Brute-Force) algorithm, a KMP (Knuth-Morris-Pratt) algorithm, a BM (Boyer-Moore) algorithm, and a QS (Quick Search) algorithm, etc.
  • the multi-mode matching algorithm contains a typical multi-mode matching DFSA algorithm and an ordered binary tree-based multi-mode matching algorithm. Additionally, the matching algorithm may be divided into a fussy matching algorithm and a similar matching algorithm.
  • the present invention does not limit which malicious code protection solution is employed to detect a malicious code.
  • the sample feature detection and removal feature value scan
  • the virtual machine-based scan or heuristic detection and removal as introduced above
  • a similar sample clustering may also be performed.
  • the present application makes no limitation to the matching algorithm.
  • the fussy matching algorithm or similarity matching algorithm as introduced above may be employed.
  • a file set with function A being detected contains a file set with function B being detected.
  • This scenario preferably uses function A as a feature, while abandons function B feature. This is because after a considerable number of black sample features are obtained, it is needed to consider how to detect most files with least features.
  • the embodiments of the present invention achieve this objective through a feature optimization method.
  • the feature optimization method comprises, for different file sets with different features, if one file set contains all files in another file set, the feature corresponding to a file set with a larger scope will be reserved, while the feature corresponding to the file set with a smaller scope will be abandoned.
  • the feature optimization method comprises, for different file sets with different features, if one file set contains all files in another file set, the feature corresponding to a file set with a larger scope will be reserved, while the feature corresponding to the file set with a smaller scope will be abandoned.
  • the files containing the first feature form a first file set
  • the files containing the second feature form a second file set
  • the first feature is reserved, while the second feature is abandoned.
  • FIG. 2 illustrates a flow diagram of optimizing features in a method for feature extraction according to one embodiment of the present invention.
  • feature optimization comprises steps of:
  • S 204 determining whether the set contains the compared vector; if the set contains the compared vector, performing S 205 ; if the set does not contain the compared vector, performing S 206 ;
  • An M-dimension vector is generated for each extractable feature; the ith-dimension vector represents whether the black sample file indexed by i can be detected with the feature.
  • the vector generated by feature A is 1:1, 2:0, 3:1, 4:1, 5:0, 6:0. This represents that the feature may detect three files indexed by 1, 3, 4.
  • vectors generated by features A, B, C, and D are specified below:
  • ABD 1:1, 2:1, 3:1, 4:1, 5:1, 6:1.
  • the shortest feature set of the M files may be detected.
  • the embodiments of the present invention only use the functions appearing in the black sample files while not appearing in the white sample files as the basis for feature extraction.
  • the fast and accurate feature extraction may guarantee building of an efficient feature library and guarantee implementation of the protection technology.
  • the features may be optimized so as to detect most files with least features after acquiring a large amount of extractable black sample features.
  • the embodiments of the present invention further provide a device for feature extraction.
  • the device may be implemented by software, hardware or a combination of software and hardware.
  • the device may be a terminal device or a functional entity inside the device.
  • the device may be a functional module inside the mobile phone.
  • the device is running under Android operating system.
  • the feature extracting device comprises:
  • a file acquiring unit 301 configured to acquire a batch of black sample files and white sample files from an application layer of a smart terminal operating system
  • a parsing unit 302 configured to parse each file to obtain information structure of all functions contained in each file
  • a check code computing unit 303 configured to compute a check code of each function
  • a counting unit 304 configured to determine whether each file contains functions corresponding to respective check codes so as to count times that each function appears in the black sample files and white sample files;
  • an extracting unit 305 configured to extract black sample features based on functions only appearing in the black sample files while not appearing in the white sample files, or extract white sample features based on functions only appearing in the white sample files while not appearing in the black sample files.
  • the device further comprises a feature optimization unit 306 configured to for different file sets with different features, if one file set contains all files in another file set, reserve the feature corresponding to a file set with a larger scope, while abandoning the feature corresponding to the file set with a smaller scope.
  • the feature optimization unit 306 reserves a first feature corresponding to the first file set, while abandoning a second feature corresponding to the second file set.
  • the device further comprises a feature optimization unit 306 configured to establish a vector for each feature with respect to all files; initialize a set to be compared sequentially with the vector of each feature; if the set contains the compared vector, reserve the set; if the set does not contain the compared vector, get a union of the set and the compared vector; sequentially compare the vectors of all features, and take the features contained in the finally obtained set as the last reserved features.
  • a feature optimization unit 306 configured to establish a vector for each feature with respect to all files; initialize a set to be compared sequentially with the vector of each feature; if the set contains the compared vector, reserve the set; if the set does not contain the compared vector, get a union of the set and the compared vector; sequentially compare the vectors of all features, and take the features contained in the finally obtained set as the last reserved features.
  • the device further comprises: an inner de-duplicating unit 307 configured to perform intra-file de-duplication to a check code of a function.
  • the inner de-duplicating unit 307 is specifically configured to, for each file, if a plurality of functions have a same check code, extract a function from the plurality of functions as a function corresponding to the check code.
  • the black sample files and the white sample files are all virtual machine executable files; the parsing unit 302 is specifically configured to decompile the virtual machine executable file to obtain a decompiled information structure of all functions contained in the virtual machine executable file.
  • the check code computing unit 303 is specifically configured to compute a hash value of the information structure of the function to use the hash value as the check code of the function.
  • the parsing unit 302 is further configured to parse the virtual machine executable file according to format of the virtual machine executable file to obtain the function information structure of each class; determine a position and size of each function of the virtual machine executable file according to fields in the function information structure, and obtain the decompiled function information structure of each function.
  • the parsing unit 302 is further configured to parse the function information structure to obtain a bytecode array field indicating the function position of the virtual machine executable file and a list length field indicating the function size of the virtual machine executable file; and determine a position and size of the function of the virtual machine executable file based on the bytecode array field and the list length field.
  • the parsing unit 302 is specifically configured to decompile the virtual machine executable file into a virtual machine bytecode using a virtual machine executable file decompilation tool.
  • the extracting unit 303 is configured to take a function that only appears in the black sample file while not appearing in the white sample file as the black sample feature, or take a part of code of the function that only appears in the black sample file while not appearing in the white sample file as the black sample feature; or,
  • the device further comprises: a feature library adding unit 308 configured to add a black sample feature into the black sample feature library, and a matching unit 309 configured to match a target file using the black sample feature library; if the target file contains a function or a subset of functions corresponding to the black sample feature, determine that malicious code exists in the target file.
  • the matching unit specially may perform sample feature detection and removal, virtual-machine based detection and removal, heuristic detection and removal, and/or similar samples clustering to the target file using the function corresponding to the black sample feature in the black sample feature library.
  • the black sample file refers to a file preliminarily determined as containing a black sample
  • the white sample file refers to a file preliminarily determined as not containing a black file
  • the file extracting unit 301 is specifically configured to find an installation package of an application from an application layer of a smart terminal operating system; parse the installation package to obtain a virtual machine executable file of the application; and take the virtual machine executable file as a black sample file or a white sample file.
  • modules in a device in an embodiment may be adapted and provided in one or more devices different from the embodiment.
  • Modules or units or components in an embodiment may be combined into one module or unit or assembly; besides, they may also be divided into a plurality of sub-modules or sub-units or sub-assemblies. Except that at least some of such features and/or processes or units are mutually exclusive, any combination may be employed to combine all features disclosed in the specification (including the appended claims, abstract and drawings) and all processes or units of any method or device such disclosed.
  • Various component embodiments of the present invention may be implemented by hardware or by software modules running on one or more processors, or implemented by their combination.
  • a microprocessor or a digital signal processor (DSP) may be used to implement some or all functions of some or all components of the device for feature extraction according to the embodiments of the present invention.
  • DSP digital signal processor
  • the present invention may also be implemented a device or device program (e.g., a computer program and a computer program product) for implementing a part or all of the method described here.
  • Such a problem for implementing the present invention may be stored on a computer readable medium, or may have a form of one or more signals. Such signals may be downloaded from an Internet website, or provided on a carrier signal, or provided in any other form.
  • FIG. 4 illustrates a smart electronic device for executing the method for feature extraction according to the present invention.
  • the smart device traditionally comprises a processor 410 and a computer program product or a computer readable medium in a form of memory 420 .
  • the memory 420 may be an electronic storage such as a flash memory, an EEPROM (Electrically Erasable Programmable Read-Only Memory), an EPROM, a hard disk or a ROM.
  • the memory 420 has a storage space 430 with program codes 431 for executing any method steps in the method.
  • the storage space 430 for program code may contain various program codes 431 for implementing respective steps in the methods above, respectively.
  • These program codes may be read out from one or more computer program codes or written into one or more such computer program codes.
  • Such computer program products contain program code carriers such as a hard disk, a compact disk (CD), a memory card or a floppy disk and the like.
  • Such computer program product is generally a portable or fixed storage unit as depicted with reference to FIG. 5 .
  • the storage unit may have a storage segment, a storage space and the like, in a similar arrangement to the memory 420 in the intelligence electronic device of FIG. 4 .
  • the program code may, for example, be compressed in any appropriate form.
  • the storage unit contains a computer readable code 431 ′, i.e., codes that may be read by a processor such as the processor 410 . These codes, when being executed by the server, cause the server to execute various steps of the methods depicted above.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Virology (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Storage Device Security (AREA)
  • Stored Programmes (AREA)
US15/109,343 2013-12-30 2014-08-07 Method and device for feature extraction Abandoned US20170214704A1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN201310746033.6A CN103761476B (zh) 2013-12-30 2013-12-30 特征提取的方法及装置
CN201310746033.6 2013-12-30
PCT/CN2014/083910 WO2015101044A1 (zh) 2013-12-30 2014-08-07 特征提取的方法及装置

Publications (1)

Publication Number Publication Date
US20170214704A1 true US20170214704A1 (en) 2017-07-27

Family

ID=50528712

Family Applications (2)

Application Number Title Priority Date Filing Date
US15/109,343 Abandoned US20170214704A1 (en) 2013-12-30 2014-08-07 Method and device for feature extraction
US15/109,409 Active 2035-04-13 US10277617B2 (en) 2013-12-30 2014-10-31 Method and device for feature extraction

Family Applications After (1)

Application Number Title Priority Date Filing Date
US15/109,409 Active 2035-04-13 US10277617B2 (en) 2013-12-30 2014-10-31 Method and device for feature extraction

Country Status (3)

Country Link
US (2) US20170214704A1 (zh)
CN (1) CN103761476B (zh)
WO (2) WO2015101044A1 (zh)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902910B (zh) * 2013-12-30 2016-07-13 北京奇虎科技有限公司 检测智能终端中恶意代码的方法及装置
CN103761476B (zh) * 2013-12-30 2016-11-09 北京奇虎科技有限公司 特征提取的方法及装置
CN103761475B (zh) * 2013-12-30 2017-04-26 北京奇虎科技有限公司 检测智能终端中恶意代码的方法及装置
CN105574408B (zh) * 2014-10-11 2018-04-17 安一恒通(北京)科技有限公司 用于文件病毒检测的特征获取方法及文件病毒检测的方法
EP3242240B1 (en) * 2015-02-04 2018-11-21 Nippon Telegraph and Telephone Corporation Malicious communication pattern extraction device, malicious communication pattern extraction system, malicious communication pattern extraction method and malicious communication pattern extraction program
CN105743877A (zh) * 2015-11-02 2016-07-06 哈尔滨安天科技股份有限公司 一种网络安全威胁情报处理方法及系统
CN106909839B (zh) * 2015-12-22 2020-04-17 北京奇虎科技有限公司 一种提取样本代码特征的方法及装置
CN106682507B (zh) 2016-05-19 2019-05-14 腾讯科技(深圳)有限公司 病毒库的获取方法及装置、设备、服务器、系统
CN105897923B (zh) * 2016-05-31 2019-04-30 中国科学院信息工程研究所 一种app安装包网络流量识别方法
CN105975854B (zh) * 2016-06-20 2019-06-28 武汉绿色网络信息服务有限责任公司 一种恶意文件的检测方法和装置
CN106127044A (zh) * 2016-06-20 2016-11-16 武汉绿色网络信息服务有限责任公司 一种函数恶意程度的检测方法和装置
CN106548069B (zh) * 2016-07-18 2020-04-24 北京安天网络安全技术有限公司 一种基于排序算法的特征提取系统及方法
US10607010B2 (en) * 2016-09-30 2020-03-31 AVAST Software s.r.o. System and method using function length statistics to determine file similarity
CN111368296A (zh) * 2019-06-27 2020-07-03 北京关键科技股份有限公司 源码文件匹配率分析方法
CN112580026B (zh) * 2019-09-27 2024-02-20 奇安信科技集团股份有限公司 网络系统及终端病毒查杀方法和装置
US11068595B1 (en) * 2019-11-04 2021-07-20 Trend Micro Incorporated Generation of file digests for cybersecurity applications
CN110955895B (zh) * 2019-11-29 2022-03-29 珠海豹趣科技有限公司 一种操作拦截方法、装置及计算机可读存储介质
CN112818348B (zh) * 2021-02-24 2023-09-08 北京安信天行科技有限公司 一种勒索病毒文件识别与检测方法及系统
CN113536310A (zh) * 2021-07-08 2021-10-22 浙江网商银行股份有限公司 一种代码文件的处理方法、检验方法、装置及电子设备
US11537463B1 (en) 2021-07-16 2022-12-27 Seagate Technology Llc Data integrity verification optimized at unit level
CN116088888B (zh) * 2022-07-22 2023-10-31 荣耀终端有限公司 应用程序更新方法及相关装置

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070240217A1 (en) * 2006-04-06 2007-10-11 George Tuvell Malware Modeling Detection System And Method for Mobile Platforms
US20110145920A1 (en) * 2008-10-21 2011-06-16 Lookout, Inc System and method for adverse mobile application identification
US20130067577A1 (en) * 2011-09-14 2013-03-14 F-Secure Corporation Malware scanning

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4736771B2 (ja) 2005-12-09 2011-07-27 ソニー株式会社 効果音生成装置及び効果音生成方法、並びにコンピュータ・プログラム
US8434151B1 (en) * 2008-01-04 2013-04-30 International Business Machines Corporation Detecting malicious software
CN101604364B (zh) * 2009-07-10 2012-08-15 珠海金山软件有限公司 基于文件指令序列的计算机恶意程序分类系统和分类方法
CN102054149B (zh) * 2009-11-06 2013-02-13 中国科学院研究生院 一种恶意代码行为特征提取方法
CN101788915A (zh) * 2010-02-05 2010-07-28 北京工业大学 基于可信进程树的白名单更新方法
CN101923617B (zh) 2010-08-18 2013-03-20 北京奇虎科技有限公司 一种基于云的样本数据库动态维护方法
CN102567661B (zh) * 2010-12-31 2014-03-26 北京奇虎科技有限公司 基于机器学习的程序识别方法及装置
CN102663077B (zh) * 2012-03-31 2014-03-12 福建师范大学 基于Hits算法的Web搜索结果安全性排序方法
CN103383720B (zh) 2012-05-03 2016-03-09 北京金山安全软件有限公司 一种api日志的循环逻辑的识别方法及装置
CN102708186A (zh) * 2012-05-11 2012-10-03 上海交通大学 一种钓鱼网站的识别方法
CN103440459B (zh) * 2013-09-25 2016-04-06 西安交通大学 一种基于函数调用的Android恶意代码检测方法
CN103761476B (zh) * 2013-12-30 2016-11-09 北京奇虎科技有限公司 特征提取的方法及装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070240217A1 (en) * 2006-04-06 2007-10-11 George Tuvell Malware Modeling Detection System And Method for Mobile Platforms
US20110145920A1 (en) * 2008-10-21 2011-06-16 Lookout, Inc System and method for adverse mobile application identification
US20130067577A1 (en) * 2011-09-14 2013-03-14 F-Secure Corporation Malware scanning

Also Published As

Publication number Publication date
US10277617B2 (en) 2019-04-30
CN103761476A (zh) 2014-04-30
CN103761476B (zh) 2016-11-09
US20160335437A1 (en) 2016-11-17
WO2015101044A1 (zh) 2015-07-09
WO2015101097A1 (zh) 2015-07-09

Similar Documents

Publication Publication Date Title
US20170214704A1 (en) Method and device for feature extraction
US10114946B2 (en) Method and device for detecting malicious code in an intelligent terminal
Xu et al. Spain: security patch analysis for binaries towards understanding the pain and pills
CN108763928B (zh) 一种开源软件漏洞分析方法、装置和存储介质
Cozzi et al. The tangled genealogy of IoT malware
Alrabaee et al. Fossil: a resilient and efficient system for identifying foss functions in malware binaries
US9348998B2 (en) System and methods for detecting harmful files of different formats in virtual environments
Bao et al. {BYTEWEIGHT}: Learning to recognize functions in binary code
Carmony et al. Extract Me If You Can: Abusing PDF Parsers in Malware Detectors.
US9525706B2 (en) Apparatus and method for diagnosing malicious applications
WO2015101042A1 (zh) 检测智能终端中恶意代码的方法及装置
KR20170068814A (ko) 악성 모바일 앱 감지 장치 및 방법
WO2015101043A1 (zh) 检测智能终端中恶意代码的方法及装置
CN102867144B (zh) 一种用于检测和清除计算机病毒的方法和装置
CN110023938A (zh) 利用函数长度统计确定文件相似度的系统和方法
US11916937B2 (en) System and method for information gain for malware detection
CN113987517A (zh) 基于物联网固件的漏洞挖掘方法、装置、设备及存储介质
CN107085684B (zh) 程序特征的检测方法和装置
CN106709350A (zh) 一种病毒检测方法及装置
Feichtner et al. Obfuscation-resilient code recognition in Android apps
US20240054210A1 (en) Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program
KR102415494B1 (ko) 에뮬레이션 기반의 임베디드 기기 취약점 점검 및 검증 방법
CN106909839B (zh) 一种提取样本代码特征的方法及装置
Cam et al. Detect repackaged android applications by using representative graphs
US20240054215A1 (en) Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program

Legal Events

Date Code Title Description
AS Assignment

Owner name: BEIJING QIHOO TECHNOLOGY COMPANY LIMITED, CHINA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:YANG, KANG;CHEN, ZHOU;REEL/FRAME:039140/0902

Effective date: 20160629

Owner name: QIZHI SOFTWARE (BEIJING) COMPANY LIMITED, CHINA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:TANG, HAI;REEL/FRAME:039140/0832

Effective date: 20160629

AS Assignment

Owner name: BEIJING QIHOO TECHNOLOGY COMPANY LIMITED, CHINA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:QIZHI SOFTWARE (BEIJING) COMPANY LIMITED;REEL/FRAME:039157/0382

Effective date: 20160629

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION