CN106803040B - Virus characteristic code processing method and device - Google Patents

Virus characteristic code processing method and device Download PDF

Info

Publication number
CN106803040B
CN106803040B CN201710035588.8A CN201710035588A CN106803040B CN 106803040 B CN106803040 B CN 106803040B CN 201710035588 A CN201710035588 A CN 201710035588A CN 106803040 B CN106803040 B CN 106803040B
Authority
CN
China
Prior art keywords
code
code block
function
application program
program interface
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710035588.8A
Other languages
Chinese (zh)
Other versions
CN106803040A (en
Inventor
罗元海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201710035588.8A priority Critical patent/CN106803040B/en
Publication of CN106803040A publication Critical patent/CN106803040A/en
Application granted granted Critical
Publication of CN106803040B publication Critical patent/CN106803040B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/564Static detection by virus signature recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Storage Device Security (AREA)

Abstract

The invention discloses a virus characteristic code processing method and a device; the method comprises the following steps: disassembling a malicious sample carrying viruses, and segmenting an obtained disassembling code to obtain a plurality of code blocks of the malicious sample; traversing the code block to obtain function call executed in the code block, comparing a target path of the function call with a path of an application program interface function, and determining the application program interface function called in the code block and the number of times of calling the application program interface function; constructing corresponding code block characteristics based on the application program interface function called in the code block and the times of calling the application program interface function; and combining the code block characteristics of each code block of the malicious sample to form a virus characteristic code of the malicious sample. By implementing the invention, the broad spectrum and the timeliness of the virus characteristic code can be improved.

Description

Virus characteristic code processing method and device
Technical Field
The present invention relates to security technologies, and in particular, to a method and an apparatus for processing virus signatures.
Background
Computer viruses are also called viruses, and are malicious target codes implanted by an organizer into a terminal (various computing terminals such as a smart phone, a computer, and a server) to destroy functions of the terminal or data.
The virus is usually operated as (e.g. shell-added) independent application program in the terminal to cheat the user to realize the malicious purpose, or is embedded into the secondary packaged conventional application program to realize the malicious purpose in the operation process of the conventional application program.
When the existing antivirus engine based on the feature code scans the virus, the sample to be detected is matched with the feature code of the virus, including the hash value of the sample is matched with the hash value in the feature code, and the binary byte number of the sample (namely the volume of the sample expressed by the byte number) is matched with the byte number of the file in the feature code.
However, in practical applications, the following two reasons exist to make the signature easily invalid, which affects the broad spectrum of virus detection by the signature:
on one hand, a virus author can achieve the purpose of changing the hash value and the byte number of a file of the virus by slightly modifying a virus source code, so that the feature code of the virus which can be originally detected is invalid, and the feature code of the virus needs to be continuously updated, thereby causing the hysteresis quality of the detected virus;
on the other hand, most compilers have optimization mechanisms such as instruction rearrangement and register reallocation, so that binary contents of target files compiled by even the same source codes may be inconsistent, and the condition of missing detection or false detection can occur when viruses are detected based on the number of bytes in the feature codes.
It can be seen that the signature provided by the related art is extremely sensitive to changes in the virus, does not have the broad spectrum of detecting viruses, and has hysteresis in the detection of new viruses.
Disclosure of Invention
The embodiment of the invention provides a virus characteristic code processing method and device, which can improve the broad spectrum and timeliness of virus characteristic codes.
The technical scheme of the embodiment of the invention is realized as follows:
in a first aspect, an embodiment of the present invention provides a method for processing a virus signature, including:
disassembling a malicious sample carrying viruses to obtain a disassembled code, determining multiple levels of paths in a code tree of the disassembled code, and dividing codes under each level of paths into a single code block to obtain multiple code blocks corresponding to the multiple levels of paths in the malicious sample;
traversing the code block to obtain a function call executed in the code block, comparing a target path of the function call with a path of an application program interface function, and determining the application program interface function called in the code block and used for representing semantic characteristics of the malicious sample for realizing a malicious purpose, and the number of times of calling the application program interface function;
forming feature elements based on the identification of each application program interface function called in the code block and the calling times of the corresponding application program interface function in the code block, and constructing corresponding code block features based on the feature elements corresponding to the application program interface functions called in the code block;
combining the code block characteristics of each code block of the malicious sample to form a virus characteristic code of the malicious sample;
calculating the feature code of the sample to be detected, comparing the feature code of the sample to be detected with the virus feature code to obtain the similarity of the feature code, and judging whether the sample to be detected carries the virus or not based on the similarity.
In a second aspect, an embodiment of the present invention provides a virus signature processing apparatus, including:
the assembly and segmentation unit is used for disassembling a malicious sample carrying viruses to obtain a disassembled code, determining multiple levels of paths in a code tree of the disassembled code, and dividing codes under each level of paths into a single code block to obtain multiple code blocks corresponding to the multiple levels of paths in the malicious sample;
the function calling unit is used for traversing the code block to obtain function calls executed in the code block, comparing a target path of the function calls with a path of an application program interface function, determining the application program interface function called in the code block and used for representing semantic characteristics of the malicious sample for realizing malicious purposes, and calling times of the application program interface function;
a feature construction unit, configured to form feature elements based on an identifier of each application program interface function called in the code block and the number of times that the corresponding application program interface function is called in the code block, and construct corresponding code block features based on the feature elements corresponding to each application program interface function called in the code block;
a feature merging unit, configured to merge code block features of each code block of the malicious sample to form a virus feature code of the malicious sample;
and the sample detection unit is used for calculating the feature code of the sample to be detected, comparing the feature code of the sample to be detected with the virus feature code to obtain the similarity of the feature code, and judging whether the sample to be detected carries the virus or not based on the similarity.
In a third aspect, an embodiment of the present invention provides a virus signature processing apparatus, including a memory and a processor, where the memory stores executable instructions for causing the processor to execute the virus signature processing method provided in the embodiment of the present invention.
In a fourth aspect, an embodiment of the present invention provides a storage medium, which stores executable instructions for causing a processor to execute the virus signature processing method provided in the embodiment of the present invention.
The embodiment of the invention has the following beneficial effects:
depending on the computing power of the terminal (e.g., terminal or server) can be done efficiently; meanwhile, the feature code is constructed by adopting the feature of API function call of the malicious sample, compared with the hash value of the malicious sample adopted in the related technology, the feature of API function call of the malicious sample can accurately reflect the semantic characteristics of the malicious sample when the malicious sample achieves the malicious purpose, and is not influenced by the hash value and byte number change of the malicious sample, so that the broad-spectrum detection of the virus can be realized; in addition, because the API call in the malicious sample has a relatively stable characteristic, the feature code constructed based on the feature of the API function call can detect the virus after evolution, and the problem of hysteresis in detecting the virus by the feature code provided by the related technology is avoided.
Drawings
Fig. 1 is a schematic diagram of an alternative process for extracting a virus signature and detecting whether a sample carries a virus based on the virus signature according to an embodiment of the present invention;
FIG. 2 is an alternative processing diagram of a virus signature processing method according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart of an alternative virus signature processing method according to an embodiment of the present invention;
fig. 4 is an alternative schematic diagram of a virus signature processing apparatus deployed in a network-side server according to an embodiment of the present invention;
fig. 5 is a schematic diagram of an alternative software and hardware structure of the virus signature processing apparatus 10 according to the embodiment of the present invention;
fig. 6 is a schematic flow chart of another alternative feature code processing method according to an embodiment of the present invention;
FIG. 7-1 is a schematic diagram of an alternative flow for extracting and storing API functions provided by an operating system to an API function library according to an embodiment of the present invention;
fig. 7-2 is an alternative flow chart illustrating the calculation of virus signatures of viruses carried in a malicious sample library according to an embodiment of the present invention;
fig. 7-3 is a schematic view of an alternative process for detecting whether a sample to be detected carries a virus according to an embodiment of the present invention;
FIG. 8 is an alternative diagram of disassembling an executable file according to an embodiment of the present invention;
FIG. 9-1 is an alternative diagram of splitting disassembled code into code blocks based on a code tree according to an embodiment of the present invention;
FIG. 9-2 is an alternative diagram of splitting disassembled code into code blocks based on a code tree according to an embodiment of the present invention;
FIG. 9-3 is an alternative diagram of splitting disassembled code into code blocks based on a code tree according to an embodiment of the present invention;
FIG. 10 is a schematic diagram of an alternative process for extracting and storing API functions provided by an operating system to an API function library according to an embodiment of the present invention;
FIG. 11 is a schematic diagram of an alternative process for calculating similarity of feature codes according to an embodiment of the present invention;
fig. 12 is a schematic diagram of an alternative structure of the feature code processing apparatus 20 according to the embodiment of the present invention.
Detailed Description
The present invention will be described in further detail below with reference to the accompanying drawings and examples. It should be understood that the examples provided herein are merely illustrative of the present invention and are not intended to limit the present invention. In addition, the following embodiments are provided as some embodiments for implementing the invention, not all embodiments for implementing the invention, and those skilled in the art will not make creative efforts to recombine technical solutions of the following embodiments and other embodiments based on implementing the invention all belong to the protection scope of the invention.
It should be noted that, in the embodiments of the present invention, the terms "comprises", "comprising" or any other variation thereof are intended to cover a non-exclusive inclusion, so that a method or apparatus including a series of elements includes not only the explicitly recited elements but also other elements not explicitly listed or inherent to the method or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other related elements in a method or apparatus that comprises the element (e.g., steps in a method or elements in an apparatus).
For example, the virus signature processing method provided in the embodiment of the present invention includes a series of steps, but the virus signature processing method provided in the embodiment of the present invention is not limited to the described steps, and similarly, the virus signature processing apparatus provided in the embodiment of the present invention includes a series of units, but the virus signature processing apparatus provided in the embodiment of the present invention is not limited to include the explicitly described units, and may include units that are required to acquire related information or perform processing based on the information.
Before further detailed description of the present invention, terms and expressions referred to in the embodiments of the present invention are described, and the terms and expressions referred to in the embodiments of the present invention are applicable to the following explanations.
1) Viruses, also called computer viruses or malicious codes, are binary codes that are malicious objects, such as destroying functions of a terminal, destroying data, or stealing data, implanted by an organizer at the terminal (e.g., various computing terminals such as a smart phone, a tablet computer, a laptop computer, and a desktop computer).
2) Sample, a general term for various types of applications, such as microsoft windows system applications, Unix system applications, iOS system applications, and android (android) system applications.
3) Malicious samples, including samples with viruses.
4) Normal samples, samples without virus.
5) And the code block is formed by dividing the disassembled code of the application program according to a certain granularity.
6) An Application Programming Interface (API), i.e., an API function implemented by using various Programming languages, is a Programming Interface provided by an Operating system (Operating system) or a library to an Application program for using various services (or functions), and can help the Application program to achieve the purposes of opening a window, drawing a graph, using terminal functions (such as shooting and positioning), and the like.
7) The function, namely the subprogram, can realize the fixed operation function, also have an entry and an exit, the so-called entry, namely each parameter that the function brings, substitute the parameter value of the function into the subprogram to process through this entry; the exit is a function value of a function, and after the function value is obtained, the exit returns the function to a caller.
8) Code block characterization, also referred to herein as characterization, refers to a digitized characterization generated by encoding (e.g., using a hash algorithm, BASE64 algorithm) a characterization of the behavior of a code block calling an API function.
9) The basic form of the feature code is a set of features of each code block of the sample; in addition, the feature code may further include an overall feature of the sample, such as a byte number of the sample (i.e., a storage space occupied by the sample).
When detecting viruses based on the virus feature codes provided by the related technology, the feature codes of the samples to be detected are matched with the virus feature codes. For example, matching a hash value of a sample (e.g., the file itself of an application) to a hash value in a signature, and matching a binary byte number of the sample (i.e., the volume of the sample in bytes) to a file byte number in the signature, related art provides virus signatures that typically take the following format:
format 1) hash string (HashString); number of file bytes (FileSize); malicious software name (MalwareName)
One example of a corresponding format is:
507d8f868c27feb88b18e6f8426adf1c;12391;Win.Exploit.CVE_2013_3163
format 2: MalwareName (HexSignature)
An example of a signature using format 2 is:
Trojan.URLspoof.gen(Clam)=2e687265663d756e6573636170652827*3a2f2f*
it can be seen that the virus signature is very sensitive to the change of the sample, and as long as the malicious sample slightly changes, the hash value and the byte number of the malicious sample change, so that the virus signature which originally can detect the virus carried in the malicious sample fails, the broad spectrum of virus signature detection is affected, and the detection of a new virus has hysteresis.
In the embodiment of the invention, aiming at the problem that the semantic analysis is not carried out on the malicious sample when the virus feature code is extracted in the related technology, a scheme for constructing the feature code based on the feature called by the API of the malicious sample to detect the virus is provided, and the behavior feature of calling the API function of the code block can better eliminate the interference introduced by the optimization strategy of a compiler and the modification of a virus author to the source code on the feature code, thereby improving the broad-spectrum property of the feature code, avoiding the hysteresis property of virus detection and improving the efficiency and the precision of virus detection.
Referring to fig. 1 in particular, fig. 1 is an optional processing schematic diagram for extracting a virus feature code and detecting whether a sample carries a virus based on the virus feature code, which is provided in the embodiment of the present invention, and relates to three parts, namely API function library generation, virus feature library generation, and sample detection, which are described below.
1) The API function library is generated, and the extracted API function is stored in the API library by detecting the API function provided by the operating system of the terminal (i.e., the API function integrated in the library in the operating system of the terminal, which is simply referred to as a library function), and the API function of a third party in the terminal (e.g., the API function embedded in the third party library in the operating system, which is simply referred to as a third party library function, or the API function provided by the application installed in the terminal). For example, the API function is stored in the API function library in the form of < encoding result (e.g., hash encoding, BASE64 encoding) of the path of the API function at the end, the API function identifying > such a duplet.
2) And generating a feature library, comparing a target path of a function call of each code block of a known malicious sample with paths of API functions in the API function library, and detecting features of the API function call of the code block (namely code block features), including identification of the API function called in the code block and times of calling corresponding API functions in the code block.
The code block characteristics are stored in a sequence in the form of { < identification of the called API function, and the calling times of the API function > … … }, and the code block characteristics of the malicious sample are combined to form a virus characteristic code and stored in a virus characteristic library.
3) And sample detection, namely extracting a feature code from the sample to be detected, comparing the feature code of the sample to be detected with a virus feature code, and judging whether the sample to be detected carries viruses or not based on the similarity of the feature codes.
Referring to fig. 2 and fig. 3 again, fig. 2 is an optional processing schematic diagram of the virus feature code processing method provided by the embodiment of the present invention, and fig. 3 is an optional flow schematic diagram of the virus feature code processing method provided by the embodiment of the present invention, in terms of extracting a feature code of a virus from a malicious sample including the virus, performing disassembly processing on the malicious sample carrying the virus, and segmenting the obtained disassembly code to obtain a plurality of code blocks of the malicious sample (step 101); traversing the code block to obtain function calls executed in the code block, comparing a target path of the function calls with a path of an application program interface function, and determining the application program interface function called in the code block and the number of times of calling the application program interface function (step 102); constructing corresponding code block characteristics based on the called application program interface function in the code block and the number of times of calling the application program interface function (step 103); and constructing virus feature codes of the viruses carried by the malicious samples based on the features of the code blocks of the malicious samples (step 104).
The steps can be automatically realized in a machine processing mode and efficiently finished depending on the computing power of a terminal (such as a terminal or a server); meanwhile, the virus feature code is constructed by calling the feature of the API function in each code block of the malicious sample, and compared with the virus feature code constructed by adopting the hash value and the byte number of binary data of the malicious sample in the related technology, the virus feature code can be detected by accurately reflecting the semantic characteristics of the malicious sample when the malicious sample realizes a malicious purpose due to the fact that the malicious sample calls the feature of the API function and is not influenced by the hash value and the byte number change of the binary data of the malicious sample; in addition, even if the virus publisher modifies the virus carried by the malicious sample, the feature of the calling API function of the malicious sample has relatively stable characteristic for the sample carrying the virus of the same family, so that the feature code is constructed based on the feature of the calling API function, the virus after evolution can be detected, and the problem of hysteresis in detecting the virus by the feature code provided by the related technology is avoided.
The embodiment of the invention also provides a virus characteristic code processing device for executing the virus characteristic code processing method, and hardware in the virus characteristic code processing device can be completely deployed in a user side terminal or a network side server.
For example, the terminal is provided as an antivirus application, the terminal periodically pulls a malicious sample from a malicious sample library, extracts a feature code of a virus, and stores the feature code, performs security scanning on an application installed locally in the terminal and an application (to-be-detected sample) being installed locally based on the feature code of the virus, and performs processing according to a local security policy of the terminal, including: 1) shielding and installing the to-be-installed application which is detected to contain the virus; 2) quarantining the installed application that is detected to include the virus; 3) and prompting the user and processing according to the processing mode selected by the user.
For another example, referring to fig. 4, fig. 4 is an optional schematic diagram that a virus feature code processing apparatus provided in the embodiment of the present invention is deployed in a network-side server, where the server provides a cloud antivirus service, the server periodically pulls a malicious sample from a malicious sample library and extracts a feature code of a virus, stores the feature code extracted from the malicious sample in a virus feature library, scans a feature code of a sample to be detected submitted by an antivirus application of a terminal based on the feature code of the virus, sends a scanning result to the antivirus application of the terminal, and processes according to a local security policy of the terminal, including: 1) shielding and installing the to-be-installed application which is detected to contain the virus; 2) quarantining the installed application that is detected to include the virus; 3) and prompting the user and processing according to the processing mode selected by the user.
Referring to fig. 5, an alternative software and hardware structure diagram of the virus signature processing apparatus 10 is shown, and the virus signature processing apparatus 10 includes a hardware layer, an intermediate layer, an operating system layer, and a software layer. However, it should be understood by those skilled in the art that the structure of the virus signature processing apparatus 10 shown in fig. 5 is merely an example, and does not constitute a limitation on the structure of the virus signature processing apparatus 10. For example, the virus signature processing apparatus 10 may be provided with more components than those shown in fig. 5 according to the implementation requirement, or omit some components according to the implementation requirement.
The hardware layers of the virus signature processing apparatus 10 include a processor 11, an input/output interface 13, a storage medium 14, and a network interface 12, and the components may communicate via a system bus connection.
The processor 11 may be implemented by a Central Processing Unit (CPU), a Microprocessor (MCU), an Application Specific Integrated Circuit (ASIC), or a Field-Programmable Gate Array (FPGA).
The input/output interface 13 may be implemented using input/output devices such as a display screen, a touch screen, a speaker, etc.
The storage medium 14 may be implemented by a nonvolatile storage medium such as a flash memory, a hard disk, and an optical disk, or may also be implemented by a volatile storage medium such as a Double Data Rate (DDR) dynamic cache, in which an executable instruction for executing the virus signature processing method is stored.
For example, the storage medium 14 may be disposed at the same location (e.g., a user-side terminal) as other components of the virus signature processing apparatus 10, or may be disposed in a distributed manner with respect to other components of the virus signature processing apparatus 10. The network interface 12 provides the processor 11 with Access capability of external data such as a storage medium 14 set in a different location, and for example, the network interface 12 may perform Near Field Communication (NFC) based technology, Bluetooth (Bluetooth) technology, ZigBee (ZigBee) technology, and cellular Communication based on a Communication scheme and an evolution scheme thereof, such as Code Division Multiple Access (CDMA) and Wideband Code Division Multiple Access (WCDMA), and Communication based on wireless-compliant authentication (Wi-Fi) via an Access wireless Access Point (AP, Access Point) on the network side.
The driver layer includes middleware 15 for the operating system 16 to recognize and communicate with the components of the hardware layer, such as a set of drivers for the components of the hardware layer.
The operating system 16 is used for providing a graphical interface facing a user, and exemplarily comprises a plug-in icon, a desktop background and an application icon, and the operating system 16 supports the user to control the terminal via the graphical interface, and the embodiment of the present invention does not limit the software environment of the terminal, such as the type and the version of the operating system, and may be, for example, a Linux operating system, a UNIX operating system or other operating systems.
The application layer includes an antivirus application/cloud antivirus service 17 run by the user-side terminal, or a module (or a functional plug-in) that can be coupled with the security software in the terminal, and executable instructions are set therein to execute the above virus feature code processing method.
In the following, the feature processing method shown in fig. 2 is further described with reference to fig. 6, and it should be noted that, based on the following description based on fig. 6, those skilled in the art can easily implement the feature code processing apparatus in a scenario where the feature code processing apparatus is deployed at the user terminal side.
Referring to fig. 6, fig. 6 is another optional flowchart of the feature code processing method according to the embodiment of the present invention, including the following steps:
in step 201, the server extracts an API function provided in the terminal operating system and/or extracts an API function of a third party in the terminal.
In one embodiment, the server pulls API functions provided in different types of operating systems from a database dedicated to collecting storage API functions, distinguishes between versions for each type of operating system, and pulls third party API functions. Certainly, on the premise that the server and the terminal establish a security authentication mechanism, the server can directly pull the API function from the terminal through the security connection with the terminal. The different types of API functions described above will be described separately below.
1) API function provided in operating system of terminal
The API function provided in the terminal operating system refers to a library function provided by the operating system as it is, and the library function is stored in the file system of the terminal in the form of a library, and is used for supporting the application program in the terminal to use the basic capability of the terminal, and illustratively includes the following types of API functions:
1.1) network API functions for creating or closing network connections, enumerating network resources.
1.2) message processing API functions for implementing message passing between windows.
1.3) file processing API functions for implementing operations relating to files, such as creation, copy, and deletion.
1.4) printing API function, which is used for supporting the application program in the terminal to realize the printing function.
1.5) drawing API function for realizing the drawing function.
2) Third party API function in terminal
In order to implement some expanded functions, such as the establishment of various Software Development environments, an API function of a third-party library additionally injected into an operating system of a terminal, such as an API function corresponding to a unique function provided by various third-party application programs of the terminal, taking a wechat client as an example, the API function may be an API function corresponding to functions of a wechat Software Development Kit (SDK) providing wechat payment and sharing to a circle of friends. The extraction positions of the third party API functions are distinguished according to the storage positions of the SDK files of different third party application programs in the terminal,
step 202, the server encodes the detected path of the API function, and stores the encoding result of the path of the API function and the identifier of the API function into the API function library.
The path of the API function is a character string consisting of a package name, a class name and an API function name, which can uniquely locate and identify one API function,
regarding the API function provided by the terminal operating system, referring to fig. 7-1 and fig. 10, fig. 7-1 is an optional flowchart illustrating extracting the API function provided by the operating system and storing the extracted API function into an API function library according to an embodiment of the present invention, fig. 10 is an optional processing diagram illustrating extracting the API function provided by the operating system and storing the extracted API function into the API function library according to an embodiment of the present invention, taking the operating system of the terminal as an Android (Android) operating system as an example, the API functions of function classes defined by the Android system are both in two jar packages, core.
Firstly, collecting core.jar packets and frame.jar packets of each version of android operating system, then analyzing the jar packets, and extracting paths of all API functions in the packets.
For example, in fig. 10, the path of the API function (int state, String incomingmumber) is:
Android.telephony.PhoneListener
Void onCallStateChanged(int state,String incomingNumber)
secondly, the paths of the functions are converted into a format described by a Smali language (Smali codes are code languages after de-compiling the DEX file of the executable file of the Android Dalvik virtual machine), and matching during subsequent feature extraction is facilitated.
Still taking the aforementioned API function (int state, String incomingmumber) as an example, one example of the format converted into the Smali language description is:
Landroid/telephony/PhoneStateListener
onCallStateChanged(ILjava/lang/String;)v
finally, calculating hash values for the paths described by the Smali language, and storing the hash values of the paths of the API function and the serial numbers (identifications) allocated to the API function into an API function library.
Still taking the aforementioned API function (int state, String incomingmumber) as an example, the path described by the Smali language is encoded, and a sequence number is assigned, so as to obtain:
< hash value: 4036329264617481551, respectively; sequence number: 12 >.
The processes of converting, encoding, and assigning sequence numbers to paths of other API functions in fig. 10 can be understood based on the above description, and are not described one by one.
Of course, it should be noted that the encoding result of the path of the API function may be obtained by encoding using other types of encoding algorithms such as BASE64, instead of the hash value calculated by using the hash algorithm.
The API function in the API function library is expressed by a binary group form of a coding result of a function path and an identifier of the API function, and one optional data structure which is arranged in the API function library and used for storing the API function I (I is a serial number of the API function, I is more than or equal to 1 and less than or equal to I, and I is the number of extracted API functions) is as follows:
< hash value of path of API function i, i >.
The processing of encoding the path of the third party API function and storing the encoding result and the serial number of the third party API function in the API function library is the same as the processing of the API function provided by the operating system, and will not be described further here.
As an alternative to step 202, the server stores the detected path of the API function (rather than the encoded result of the path of the API function) in an API function library along with an identification of the API function.
A binary representation of the path of the API function and the identifier of the API function can be stored in the API function library, and an optional data structure which is arranged in the API function library and stores the API function I (I is the serial number of the API function, I is more than or equal to 1 and less than or equal to I, and I is the number of extracted API functions) is as follows:
< path of API function, i >.
Step 203, the server pulls the malicious sample from the malicious sample library.
The malicious sample library can interface with a database of existing malicious samples, for example, with a database of viruses of different families, including:
1) the system virus database generally distinguishes system viruses in a malicious sample library according to different systems, and prefixes are as follows: win32, PE, Win95, W32, W95 and the like.
2) The prefix of the worm virus is as follows: and (5) word. The common characteristic of the virus is that the virus is spread through a network or system vulnerability, and a large part of the worm virus sends out a virus mail to block the network.
3) The script virus database, the prefix of the script virus is: script. The common characteristic of script viruses is that they are viruses that are written in a scripting language and spread through web pages.
4) Backportal virus database, the prefix of the backportal virus is: backdoor, the common characteristic of this type of virus is that it propagates through the network, opening the Backdoor for the system.
5) The destructive program virus database, the prefix of the destructive program virus is: and (7) Harm. The common characteristic of the viruses is that the viruses have good-looking icons to entice a user to click, and when the user clicks the viruses, the viruses directly damage the user terminal.
For example, the malicious sample library is based on the real-time requirement of scanning viruses, and is used for pulling malicious samples including viruses from virus databases of different families according to the frequency of week/day/hour, pulling the malicious samples from the virus databases of different families uniformly, or pulling the malicious samples individually according to the updating frequency of each family virus database.
And step 204, the server disassembles the malicious sample containing the virus to obtain a disassemblied code.
For disassembling a malicious sample, an executable file is extracted from the malicious sample, and there is a difference according to the format of an executable file of an operating system run by the executable file, where the executable file in the Windows operating system is in an exe format, the executable file in the Linux operating system is in an elf format, and the executable file in the Android operating system is in a dex format, an elf format, and so on, and then the executable file is disassembled, referring to fig. 8, which is an optional schematic diagram for disassembling the executable file provided by the embodiment of the present invention, and a result of the disassembling process includes:
1) uninitialized data (BSS, Block Start by Symbol) segment: a memory area for storing uninitialized global variables in the program;
2) and (3) data segment: a memory area for storing global variables initialized in the program. Including mutable data segments and immutable data segments.
3) Code segment (code segment/text segment): a block of memory area is typically used to store the execution code (statements).
4) Stacking: the method is used for storing the memory segment which is dynamically allocated in the process running process, is not fixed in size and can be dynamically expanded. When the process calls malloc and other allocated memories, the newly allocated memories are dynamically added to the heap (the heap is enlarged), and when the memories are released by free and other functions, the released memories are removed from the heap
5) Stack: stacks are created when processes are running, one process having one process stack. The stack is used to store local variables that the program temporarily stores, i.e., variables defined within functions, excluding variables of the static (static) type.
In step 205, the server divides the disassembled code to obtain a plurality of code segments of the malicious sample.
After the decompilation process is completed, the code segment is divided into code blocks by traversing the code segment of the executable file, referring to fig. 8, fig. 8 is an optional processing schematic diagram for dividing the code segment of the executable file into code blocks in the embodiment of the present invention, in fig. 8, the code block is obtained by dividing the disassembly code (such as the code segment shown in fig. 8) by taking a function or a path at a predetermined level as a granularity, and the following division modes are adopted:
mode 1) dividing the disassembled code by taking function as granularity to obtain code blocks
Traversing the disassembling code segment of the malicious sample, and dividing the disassembling code by taking the function as granularity to obtain a plurality of functions (the functions are equal to the code blocks at the moment) forming the disassembling code; of course, the code segment may be divided into a plurality of code blocks constituting the code segment with two or more functions as granularity (in this case, each code block includes two or more functions).
The functions are basic logic units forming the code segments, each function comprises a complete processing logic, and the code segments are divided according to the function granularity, so that on one hand, the division of the disassembled codes can be easily realized, and on the other hand, the logic inside the disassembled codes can be completely reserved.
Mode 2) obtaining code blocks by dividing paths of different levels of a code tree as granularity
Referring to fig. 9-1, fig. 9-1 is an alternative schematic diagram of splitting a disassembled code to form a code block based on a code tree according to an embodiment of the present invention, where a code under each primary path is split into a separate code block according to paths (including a primary path, a secondary path, and a tertiary path) at preset levels in the code tree, and of course, for the primary path, each secondary path under the primary path may be split into a separate code block.
Referring to fig. 9-2 again, fig. 9-2 is an optional schematic diagram of splitting a disassembly code based on a code tree to form a code block, where for an application program running in an android operating system as a malicious sample, an executable file in a format of Dex is extracted from the application program to be disassembled to obtain a disassembly code described in a Smali language, and the disassembly code is divided into code blocks. For example, a path at a class level may be selected, and the Dex may be divided into code blocks corresponding to the path at the class level, where each code block corresponds to one class in the Dex.
In fig. 9-2, each code block corresponds to a class in the disassembling code, which is specifically:
code block 1: com.android.internal.app.actionbarimpl,
code block 2: com.android.internal.app.alert activity,
code block 3: com.android.internal.app.alert controller,
,……。
of course, the server may also use any other level of path division to disassemble the disassembled code, for example, referring to fig. 9-3, fig. 9-3 is an optional schematic diagram of forming a code block based on the code tree division to disassemble the disassembled code provided by the embodiment of the present invention, the disassembled code may be divided according to the first four levels of paths in the code tree shown in fig. 9-3, and each divided code block corresponds to one four levels of paths in the code tree, specifically:
code block 1: com.android.internal.app,
code block 2: com. android. internal. appwidget,
code block 3: com. android. internal. backup,
……。
in step 206, the server traverses each code block of the disassembled code to obtain the function call executed in each code block, compares the target path of the function call with the path of the application program interface function in the API function library, and determines the application program interface function called in each code block and the number of times of calling the application program interface function.
According to different data structures for storing the API functions in the API function library, for the server to compare the target path of the function call in the code block J (J is more than or equal to 1 and less than or equal to J, and J is the number of the code blocks obtained by dividing the disassembled code) with the path of the application program interface function, the following modes can be adopted:
mode 1) storing the API function in the form of < path of the API function, i >, the server detects a target path of the function call in the code block j, the target path is matched with the path of the API function i in the API function library one by one according to the fields of the path, when all the fields of the path are completely matched, the function call currently detected in the code block j is determined to be the API function call, and the calling times of the API function i in the code block j are accumulated to be 1.
Mode 2) the API function library stores the API function in the form of < encoding result (e.g., hash value) of the path of the API function, i >, the server encodes the target path of the function call detected in the code block j (and the encoding mode of the path of the API function in the API function library is the same, for example, the same hash algorithm is used), compares the encoding result with the path of the API function i in the API function library, if the encoding result is the same, the path is the same, determines that the currently detected function call in the code block j is the API function call, and adds 1 to the number of calls to the API function i in the code block j.
Obviously, whether the paths are consistent or not is judged by comparing the encoding results of the paths, the same processing efficiency can be improved as the same as that of comparing the fields of the paths one by one, and especially the processing efficiency is obviously improved when the path of the API function is long.
Step 207, constructing corresponding code block characteristics based on the called application program interface function in the code block and the number of times of calling the application program interface function.
In one embodiment, for each code block, a feature element is formed by the identification of each API function called in the code block and the number of times of calling in the code block of the corresponding application program interface function, each called API function in the code block forms a feature element, a set is formed based on the feature elements corresponding to all API functions called in the code block, and the set is encoded to form the code block feature.
Still taking a code block J as an example, forming a feature element K by using a calling function K (K is more than or equal to 1 and less than or equal to K, and K is the number of different API function calls executed in the code block J) in the code block J, and recording the feature element in the following form < serial number of an API function and the number of times of calling the API function K in the code block J, so as to form a set of the following form of the code block J { < serial number of the API function and the number of times of calling the API function K in the code block J >; k is more than or equal to 1 and less than or equal to K, the set is coded (for example, the set is coded by adopting a Hash algorithm), and the coding result is used as the code block characteristic.
And step 208, combining the code block characteristics of each code block of the malicious sample to form a virus characteristic code of the malicious sample, and storing the virus characteristic code in a virus characteristic library.
Still taking a code block J as an example (J is more than or equal to 1 and less than or equal to J, J is the number of code blocks obtained by dividing the disassembled code of the malicious sample), and setting corresponding code block characteristics J, then the virus characteristic code of the malicious sample carrying viruses can adopt the following form { < code block characteristics 1 >; < code block characteristic 2 >; … … < code block characteristics J > },
the aforementioned steps 204 to 207 are processing flows of pulling a malicious sample from the malicious sample library and calculating virus signatures of the carried viruses, and for a plurality of malicious samples of the malicious sample library, the processing of calculating virus signatures in the aforementioned steps 204 to 207 is executed in a loop, as shown in fig. 7-2, where fig. 7-2 is an optional flow diagram of calculating virus signatures of the carried viruses for the malicious samples in the malicious sample library according to the embodiment of the present invention, and the virus signatures of the viruses carried by one malicious sample are randomly extracted from the malicious sample library according to the aforementioned steps 204 to 207 until all malicious samples in the malicious sample library are traversed.
The server assigns an identifier (serial number VID) to the virus corresponding to the calculated virus signature, and stores all the virus signatures in a virus signature library in the form of a binary set of < virus serial number, virus signature >.
In addition, it should be noted that, the foregoing is an example of storing, in a function library, an API function extracted from a terminal (e.g., an API function provided in an operating system of the terminal, and/or an API function of a third party in the terminal), since the extracted function is stored in advance in the API function library, when traversing a code block of a disassembled code of a malicious sample, a call to the API function in the code block can be quickly located based on the API function stored in the function library, and processing efficiency is ensured.
However, it is understood that, in the case that the computing power of the server is sufficient, maintaining the API function library in the embodiment of the present invention is a step that may be performed by default, and the server may extract the API function from the terminal and store the API function extracted from the terminal (e.g., an encoding result including a path of the API function and a sequence number) in a cache local to the server when it is necessary to detect a call of the code block to the API function, that is, it is not necessary to separately maintain the function library, so that the path of the API function is always up to date, and hysteresis of the virus feature code caused by a change of the API function in the terminal is avoided.
And 209, the server extracts the feature code of the sample to be detected, compares the feature code of the sample to be detected with the feature code of the virus to obtain the similarity of the feature codes, and judges whether the sample to be detected carries the virus or not based on the similarity.
Referring to fig. 7-3, fig. 7-3 is a schematic view of an alternative process for detecting whether a sample to be detected carries a virus according to an embodiment of the present invention, which is described below with reference to fig. 7-3.
Firstly, for any sample to be detected, the server extracts a corresponding feature code from the sample to be detected, and the feature code is recorded as df.
Specifically, the server extracts an executable file from a sample to be detected, performs disassembly processing on the extracted executable file to obtain a disassembly code, and refers to a mode of segmenting the disassembly code of the malicious sample: the mode 1) is to divide the disassembled code by taking the function as granularity to obtain the code block, and the mode 2) is to divide the paths of different levels of the code tree by taking the granularity as the granularity to obtain the code block.
And traversing each code block of the disassembling code by the server to obtain the function call executed in each code block, comparing the target path of the function call with the path of the application program interface function in the API function library, and determining the application program interface function called in each code block and the times of calling the application program interface function.
For example, according to the difference of the data structure storing the API function in the API function library, in terms of the comparison between the target path of the function call in the code block J (J is greater than or equal to 1 and less than or equal to J, J is the number of code blocks obtained by dividing the disassembled code) and the path of the application program interface function by the server, there may be the following ways:
mode 1) storing the API function in the form of < path of the API function, i >, the server detects a target path of the function call in the code block j, the target path is matched with the path of the API function i in the API function library one by one according to the fields of the path, when all the fields of the path are completely matched, the function call currently detected in the code block j is determined to be the API function call, and the calling times of the API function i in the code block j are accumulated to be 1.
Mode 2) the API function library stores the API function in the form of < encoding result (e.g., hash value) of the path of the API function, i >, the server encodes the target path of the function call detected in the code block j (and the encoding mode of the path of the API function in the API function library is the same, for example, the same hash algorithm is used), compares the encoding result with the path of the API function i in the API function library, if the encoding result is the same, the path is the same, determines that the currently detected function call in the code block j is the API function call, and adds 1 to the number of calls to the API function i in the code block j.
And constructing corresponding code block characteristics based on the called application program interface function in the code block and the times of calling the application program interface function. For each code block, forming a characteristic element by using the identifier of each API function called in the code block and the calling times of the corresponding application program interface function in the code block, forming a characteristic element by each called API function in the code block, forming a set based on the characteristic elements corresponding to all the API functions called in the code block, and coding the set to form the code block characteristics; and combining the code block characteristics of each code block of the sample to be detected to form the characteristic code of the sample to be detected.
And secondly, extracting the virus characteristic codes and the corresponding serial numbers from the virus characteristic codes, and setting the currently extracted virus characteristic codes as vf and the serial numbers as VID.
Thirdly, comparing the feature code df of the sample to be detected (for example, the software installation package in the apk format of the android operating system) with the virus feature code vf in the virus feature library to obtain the number S of the code block features shared by the feature code df of the sample to be detected and the virus feature code vf in the virus feature library.
Here, a specific example of calculating the similarity is described, referring to fig. 11, where fig. 11 is an optional processing diagram for calculating the feature code similarity according to the embodiment of the present invention.
In fig. 11, it is assumed that the disassembled code of the malicious sample a carrying the virus is divided into 3 code blocks, which are recorded as: a1; a2; A3.
the API function called in the code block a1 and the corresponding number of calls are recorded by a binary record (API function sequence number, number of calls), and the API function called in the code block a1 and the corresponding number of calls are expressed as a set: { (12,3), (15,1), (22,1) }, and calculating the hash of the set to obtain the code block characteristics of the code block A1: 1800939131.
similarly, the API function called by the code block a2 and the number of calls are expressed as: { (56,90) }, calculating the hash to obtain the code block characteristics of the code block A2: 1369398484.
similarly, the API function called by the code block a3 and the number of calls are expressed as: { (32,54), (123,34), (132,36), (645,1) }, calculating the hash to obtain the code block characteristics of the code block a 3: 2596230670.
the code block a 1; a2; and combining the code block characteristics of the A3 to obtain a virus characteristic code of the malicious sample A as A ═ {1800939131,1369398484,2596230670 }.
Assume that the sample B to be detected contains 4 code blocks, denoted as: b1, B2, B3 and B4.
The API function called by the code block B1 and the number of calls are expressed as a set of: { (12,3), (15,1), (22,1) }, which is computed and hashed to obtain the code block characteristics of the code block B1: 1800939131.
the API function called by the code block B2 and the number of calls are expressed as a set of: { (32,3), (122,3) }, which is hashed to obtain the code block characteristics of the code block B2: 4111055178.
the API function called by the code block B3 and the number of calls are expressed as a set of: { (56,91) }, calculating the hash to obtain the code block characteristics of the code block B3: 1348286179
The API function called by the code block B4 and the number of calls are expressed as a set of: { (56,35), (68,9) }, which is hashed to obtain the code block characteristics of the code block B4: 281916613
Therefore, the characteristic B of sample B is {1800939131,4111055178,1348286179,281916613 }.
By comparing the virus signature of the sample a with the common code block signature of the sample B to be detected {1800939131}, the similarity (a, B) can be calculated as follows:
similarity(A,B)=count({1800939131})/count(A)=1/3=0.33。
the similarity between the feature code df of the sample to be detected and the virus feature code vf in the virus feature library can be represented by using S/M (where M is the number of code block features included in the virus feature code vf), and if the similarity is greater than a similarity threshold (N/M, that is, S is less than or equal to N), it indicates that the sample to be detected carries the virus VID.
If the similarity does not exceed the similarity threshold, the API function call of the sample to be detected is greatly different from the call of the virus to the API function, other virus feature codes are continuously extracted from the virus feature library to be compared, and if the similarity is smaller than the similarity threshold, the sample to be detected does not carry the virus and belongs to a normal sample.
Referring to fig. 12, fig. 12 is a schematic diagram of an alternative structure of a signature processing apparatus 20 according to an embodiment of the present invention, which includes: the assembly division unit 21, the function call unit 22, the build feature unit 23, and the feature merge unit 24 are described below.
And the assembly and segmentation unit 21 is configured to perform disassembly processing on a malicious sample with a virus, and segment the obtained disassembly code to obtain a plurality of code blocks of the malicious sample.
For example, in the case of dividing the obtained disassembled code into a plurality of code blocks of malicious samples, the assembly division unit 21 divides the disassembled code into a plurality of code blocks with a predetermined level of paths in a code tree as a granularity according to paths of the code tree of the disassembled code, or divides the disassembled code into a plurality of code blocks with a function as the granularity and obtains a plurality of code blocks according to the function.
And the function calling unit 22 is configured to traverse the code block to obtain a function call executed in the code block, compare a target path of the function call with a path of the application program interface function, and determine the application program interface function called in the code block and the number of times of calling the application program interface function.
For the comparison between the target path of the function call and the path of the application program interface function by the function call unit 72, the function call unit 22 is configured to obtain the application program interface function provided in the operating system of the terminal and/or the application program interface function of the third party in the terminal, assign an identifier (such as a serial number) to each application program interface function, compare the target path of the function call with the obtained path of the application program interface function, that is, compare whether each field of the path is consistent or not, and record the identifier of the application program interface function called in the code block and the corresponding call times.
For the comparison between the target path of the function call and the path of the application program interface function by the function call unit 72, the function call unit 22 is further configured to encode the obtained application program interface function provided in the operating system of the terminal and/or the path of the application program interface function of the third party in the terminal, assign an identifier to the application program interface function, compare the encoded result of the target path of the function call with the encoded result of the path of the application program interface function in the function library, if the encoded result is consistent, indicate that the currently detected function call is the application program interface function call, and record the identifier of the application program interface function called in the code block and the corresponding call times.
To the extent that the function call unit 72 compares the target path of the function call with the path of the application program interface function, the function call unit 22 is further configured to store the application program interface function in the function library in advance, for example, store the encoding result of the acquired path of the application program interface function and the identification assigned to the corresponding application program interface function in the function library. When the function call unit 72 traverses the code block, the coding result of the target path of the function call is compared with the coding result of the path of the application program interface function in the function library, if the coding results are consistent, it indicates that the currently detected function call is the application program interface function call, and the identifier of the application program interface function called in the code block and the corresponding call times are recorded.
And a build feature unit 23, configured to build a corresponding code block feature based on the application program interface function called in the code block and the number of times the application program interface function is called.
As for the component code block features, the feature unit 23 is further configured to form feature elements by using the identifier of the application interface function called in the code block and the number of times of calling of the corresponding application program interface function, form a set based on the feature elements corresponding to each application program interface function called in the code block, and encode the set to form the code block features.
And the feature merging unit 24 is configured to merge the code block features of each code block of the malicious sample to form a virus feature code of the malicious sample.
And the sample detection unit 25 is configured to calculate the feature code of the sample to be detected, compare the feature code of the sample to be detected with the feature code of the virus to obtain similarity of the feature codes, and determine whether the sample to be detected carries the virus based on the similarity.
For the sample detection unit 25 to calculate the feature code of the sample to be detected, the code block features included in the feature code of the sample to be detected and the code block features included in the virus feature code are compared to obtain the code block feature code common to the sample to be detected and the malicious sample, and the number ratio of the common code block features to the code block features included in the virus feature code is calculated.
For the sample detection unit 25 to compare the feature code of the sample to be detected with the feature code of the virus to obtain the similarity of the feature code, and to judge whether the sample to be detected carries the virus based on the similarity, the sample detection unit 25 is configured to compare a target path called by a function in each code block of the sample to be detected with a path of a predetermined application program interface function, and construct a corresponding code block feature based on the application program interface function called in the code block obtained by the comparison and the number of times of calling the application program interface function; and combining the code block characteristics of the samples to be detected to form the characteristic code of the samples to be detected.
In summary, the embodiments of the present invention have the following beneficial effects:
1) depending on the computing power of the terminal (e.g., terminal or server) can be done efficiently;
2) compared with the hash value of the malicious sample adopted in the related technology, the feature code is constructed by adopting the API function calling feature of the malicious sample, and because the semantic characteristics of the malicious sample in the malicious purpose can be accurately reflected by the API function calling feature of the malicious sample and is not influenced by the hash value and byte number change of the malicious sample, the broad-spectrum virus detection can be realized;
3) because API call in the malicious sample has a relatively stable characteristic, the feature construction feature code based on the API function call can detect the virus after evolution, and the problem of hysteresis in detecting the virus by the feature code provided by the related technology is avoided;
4) the API function call of each code block of a sample is extracted and encoded as a feature. The method considers the semantics and behaviors of the program, can better resist the interference introduced by the optimization strategy of the compiler and the interference introduced by the modification of the source code by a virus author, greatly improves the broad-spectrum property of the feature code, and reduces the difficulty of virus searching and killing.
Those skilled in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media capable of storing program codes, such as a removable Memory device, a Random Access Memory (RAM), a Read-Only Memory (ROM), a magnetic disk, and an optical disk.
Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a RAM, a ROM, a magnetic or optical disk, or other various media that can store program code.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (18)

1. A virus signature processing method is characterized by comprising the following steps:
disassembling a malicious sample carrying viruses to obtain a disassembled code, determining multiple levels of paths in a code tree of the disassembled code, and dividing codes under each level of paths into a single code block to obtain multiple code blocks corresponding to the multiple levels of paths in the malicious sample;
traversing the code block to obtain a function call executed in the code block, comparing a target path of the function call with a path of an application program interface function, and determining the application program interface function called in the code block and used for representing semantic characteristics of the malicious sample for realizing a malicious purpose, and the number of times of calling the application program interface function;
forming feature elements based on the identification of each application program interface function called in the code block and the calling times of the corresponding application program interface function in the code block, and constructing corresponding code block features based on the feature elements corresponding to the application program interface functions called in the code block;
combining the code block characteristics of each code block of the malicious sample to form a virus characteristic code of the malicious sample;
calculating the feature code of the sample to be detected, comparing the feature code of the sample to be detected with the virus feature code to obtain the similarity of the feature code, and judging whether the sample to be detected carries the virus or not based on the similarity.
2. The method of claim 1, wherein the method further comprises:
and dividing the disassembling code by taking the function as granularity to obtain a plurality of code blocks according to the function.
3. The method of claim 1, wherein comparing the target path of the function call to the path of the application programming interface function comprises:
acquiring an application program interface function provided in an operating system of the terminal and/or an application program interface function of a third party in the terminal, and comparing a target path called by the function with the acquired path of the application program interface function.
4. The method of claim 3, wherein comparing the target path of the function call to the path of the obtained application programming interface function comprises:
and coding the obtained application program interface function provided in the operating system of the terminal and/or the path of the application program interface function of a third party in the terminal, and comparing the coding result of the target path called by the function with the coding result of the path of the application program interface function.
5. The method of claim 4, wherein comparing the encoded result of the target path of the function call to the encoded result of the path of the application programming interface function comprises:
and storing the obtained coding result of the path of the application program interface function and the identifier distributed to the corresponding application program interface function into a function library, and comparing the coding result of the target path called by the function with the coding result of the path of the application program interface function in the function library.
6. The method of claim 1, wherein constructing respective code block features based on feature elements corresponding to each of the application program interface functions called in the code block comprises:
and forming a set based on the feature elements corresponding to the application program interface functions called in the code block, and coding the set to form the code block features.
7. The method of claim 1, wherein comparing the signature of the sample to be tested with the signature of the virus to obtain similarity of signatures comprises:
and comparing the code block characteristics included in the characteristic code of the sample to be detected with the code block characteristics included in the virus characteristic code to obtain the code block characteristics shared by the sample to be detected and the malicious sample, and calculating the quantity ratio of the shared code block characteristics to the code block characteristics included in the virus characteristic code.
8. The method of claim 1, wherein the calculating the signature of the sample to be detected comprises:
comparing a target path called by a function in each code block of the sample to be detected with a path of the application program interface function, and constructing corresponding code block characteristics based on the application program interface function called in the code block obtained by comparison and the calling times of the application program interface function; and combining the code block characteristics of the samples to be detected to form the characteristic code of the samples to be detected.
9. A virus signature processing apparatus, comprising:
the assembly and segmentation unit is used for disassembling a malicious sample carrying viruses to obtain a disassembled code, determining multiple levels of paths in a code tree of the disassembled code, and dividing codes under each level of paths into a single code block to obtain multiple code blocks corresponding to the multiple levels of paths in the malicious sample;
the function calling unit is used for traversing the code block to obtain function calls executed in the code block, comparing a target path of the function calls with a path of an application program interface function, determining the application program interface function called in the code block and used for representing semantic characteristics of the malicious sample for realizing malicious purposes, and calling times of the application program interface function;
a feature construction unit, configured to form feature elements based on an identifier of each application program interface function called in the code block and the number of times that the corresponding application program interface function is called in the code block, and construct corresponding code block features based on the feature elements corresponding to each application program interface function called in the code block;
a feature merging unit, configured to merge code block features of each code block of the malicious sample to form a virus feature code of the malicious sample;
and the sample detection unit is used for calculating the feature code of the sample to be detected, comparing the feature code of the sample to be detected with the virus feature code to obtain the similarity of the feature code, and judging whether the sample to be detected carries the virus or not based on the similarity.
10. The apparatus of claim 9,
the assembly and division unit is further used for dividing the disassembled code by taking the function as granularity and obtaining a plurality of code blocks according to the function.
11. The apparatus of claim 9,
the function calling unit is further configured to acquire an application program interface function provided in an operating system of the terminal and/or an application program interface function of a third party in the terminal, and compare a target path called by the function with the acquired path of the application program interface function.
12. The apparatus of claim 11,
the function calling unit is further configured to code the obtained application program interface function provided in the operating system of the terminal and/or the path of the application program interface function of the third party in the terminal, and compare the coding result of the target path called by the function with the coding result of the path of the application program interface function.
13. The apparatus of claim 12,
the function call unit is further configured to store the obtained encoding result of the path of the application program interface function and the identifier assigned to the corresponding application program interface function in a function library, and compare the encoding result of the target path called by the function with the encoding result of the path of the application program interface function in the function library.
14. The apparatus of claim 9,
the feature building unit is further configured to form a set based on feature elements corresponding to each of the application program interface functions called in the code block, and encode the set to form the code block features.
15. The apparatus of claim 9,
the sample detection unit is further configured to compare code block features included in the feature code of the sample to be detected with code block features included in the virus feature code, obtain code block features common to the sample to be detected and the malicious sample, and calculate a quantity ratio of the common code block features to the code block features included in the virus feature code.
16. The apparatus of claim 9,
the sample detection unit is further configured to compare a target path called by a function in each code block of the sample to be detected with a path of the application program interface function, and construct a corresponding code block feature based on the application program interface function called in the code block and the number of calls of the application program interface function obtained by the comparison; and combining the code block characteristics of the samples to be detected to form the characteristic code of the samples to be detected.
17. A virus signature processing apparatus, the apparatus comprising:
a memory for storing executable instructions;
a processor for implementing the virus signature processing method of any one of claims 1 to 8 when executing computer executable instructions stored in the memory.
18. A computer-readable storage medium having stored thereon computer-executable instructions for implementing the virus signature processing method of any one of claims 1 to 8 when executed.
CN201710035588.8A 2017-01-18 2017-01-18 Virus characteristic code processing method and device Active CN106803040B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710035588.8A CN106803040B (en) 2017-01-18 2017-01-18 Virus characteristic code processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710035588.8A CN106803040B (en) 2017-01-18 2017-01-18 Virus characteristic code processing method and device

Publications (2)

Publication Number Publication Date
CN106803040A CN106803040A (en) 2017-06-06
CN106803040B true CN106803040B (en) 2021-08-10

Family

ID=58984570

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710035588.8A Active CN106803040B (en) 2017-01-18 2017-01-18 Virus characteristic code processing method and device

Country Status (1)

Country Link
CN (1) CN106803040B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107678968A (en) * 2017-10-18 2018-02-09 北京奇虎科技有限公司 Sample extraction method, apparatus, computing device and the storage medium of source code function
CN108334778B (en) * 2017-12-20 2021-12-31 北京金山安全管理系统技术有限公司 Virus detection method, device, storage medium and processor
CN109165514B (en) * 2018-10-16 2019-08-09 北京芯盾时代科技有限公司 A kind of risk checking method
CN109492396B (en) * 2018-11-12 2021-02-26 杭州安恒信息技术股份有限公司 Malicious software gene rapid detection method and device based on semantic segmentation
CN110647747B (en) * 2019-09-05 2021-02-09 四川大学 False mobile application detection method based on multi-dimensional similarity
CN112579828B (en) * 2019-09-30 2024-10-01 奇安信安全技术(珠海)有限公司 Processing method and device of feature codes, system, storage medium and electronic device
CN112148305B (en) * 2020-10-28 2024-09-10 腾讯科技(深圳)有限公司 Application detection method, device, computer equipment and readable storage medium
CN114881018B (en) * 2022-05-06 2024-10-01 安天科技集团股份有限公司 File processing method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103136475A (en) * 2011-11-29 2013-06-05 姚纪卫 Method and device for detecting computer viruses
CN103970523A (en) * 2013-02-05 2014-08-06 中国移动通信集团广东有限公司 Method and device for recognition of JAVA compiling destination file
CN104391798A (en) * 2014-12-09 2015-03-04 北京邮电大学 Software feature information extracting method
CN104751052A (en) * 2013-12-30 2015-07-01 南京理工大学常熟研究院有限公司 Dynamic behavior analysis method for mobile intelligent terminal software based on support vector machine algorithm
CN105184160A (en) * 2015-07-24 2015-12-23 哈尔滨工程大学 API object calling relation graph based method for detecting malicious behavior of application program in Android mobile phone platform
CN106709349A (en) * 2016-12-15 2017-05-24 中国人民解放军国防科学技术大学 Multi-dimension behavior characteristic-based malicious code classification method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103136475A (en) * 2011-11-29 2013-06-05 姚纪卫 Method and device for detecting computer viruses
CN103970523A (en) * 2013-02-05 2014-08-06 中国移动通信集团广东有限公司 Method and device for recognition of JAVA compiling destination file
CN104751052A (en) * 2013-12-30 2015-07-01 南京理工大学常熟研究院有限公司 Dynamic behavior analysis method for mobile intelligent terminal software based on support vector machine algorithm
CN104391798A (en) * 2014-12-09 2015-03-04 北京邮电大学 Software feature information extracting method
CN105184160A (en) * 2015-07-24 2015-12-23 哈尔滨工程大学 API object calling relation graph based method for detecting malicious behavior of application program in Android mobile phone platform
CN106709349A (en) * 2016-12-15 2017-05-24 中国人民解放军国防科学技术大学 Multi-dimension behavior characteristic-based malicious code classification method

Also Published As

Publication number Publication date
CN106803040A (en) 2017-06-06

Similar Documents

Publication Publication Date Title
CN106803040B (en) Virus characteristic code processing method and device
US10581879B1 (en) Enhanced malware detection for generated objects
US11188650B2 (en) Detection of malware using feature hashing
RU2614557C2 (en) System and method for detecting malicious files on mobile devices
Zhang et al. Libid: reliable identification of obfuscated third-party android libraries
EP3420489B1 (en) Cybersecurity systems and techniques
RU2531861C1 (en) System and method of assessment of harmfullness of code executed in addressing space of confidential process
US20180089430A1 (en) Computer security profiling
US11882134B2 (en) Stateful rule generation for behavior based threat detection
TWI720932B (en) System and method for detecting data anomalies by analysing morphologies of known and/or unknown cybersecurity threats
US8806641B1 (en) Systems and methods for detecting malware variants
Crussell et al. Andarwin: Scalable detection of android application clones based on semantics
US7620990B2 (en) System and method for unpacking packed executables for malware evaluation
US9135443B2 (en) Identifying malicious threads
US20180052997A1 (en) Determining whether process is infected with malware
US10216934B2 (en) Inferential exploit attempt detection
CN110225029B (en) Injection attack detection method, device, server and storage medium
JP2019003596A (en) System and method for detecting malicious file that uses static analysis
US20110093953A1 (en) Preventing and responding to disabling of malware protection software
CN112084497A (en) Method and device for detecting malicious program of embedded Linux system
KR102318714B1 (en) Computet program for detecting software vulnerability based on binary code clone
US11200317B2 (en) Systems and methods for protecting a computing device against malicious code
CN108319853B (en) Virus characteristic code processing method and device
Hu et al. Robust app clone detection based on similarity of ui structure
Ahmadi et al. Intelliav: Building an effective on-device android malware detector

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant