CN113569241A - Virus detection method and device - Google Patents

Virus detection method and device Download PDF

Info

Publication number
CN113569241A
CN113569241A CN202110857502.6A CN202110857502A CN113569241A CN 113569241 A CN113569241 A CN 113569241A CN 202110857502 A CN202110857502 A CN 202110857502A CN 113569241 A CN113569241 A CN 113569241A
Authority
CN
China
Prior art keywords
features
executable file
api
byte
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110857502.6A
Other languages
Chinese (zh)
Inventor
唐侃毅
周波
褚军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New H3C Technologies Co Ltd
Original Assignee
New H3C Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by New H3C Technologies Co Ltd filed Critical New H3C Technologies Co Ltd
Priority to CN202110857502.6A priority Critical patent/CN113569241A/en
Publication of CN113569241A publication Critical patent/CN113569241A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/565Static detection by checking file integrity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Virology (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a virus detection method and a virus detection device. The method comprises the following steps: extracting static characteristics of an executable file to be detected; inputting the static characteristics into a static detection model to obtain a result value for representing a static detection result; if the result value hits a preset dynamic detection condition threshold value, extracting the running characteristics of the executable file; inputting the operation characteristics into a dynamic detection model, and determining whether the executable file is a virus file according to an output result of the dynamic detection model. It can be seen that the static detection and the dynamic detection are combined, so that the aim of integrally considering both the detection efficiency and the detection accuracy is fulfilled.

Description

Virus detection method and device
Technical Field
The present application relates to the field of computer security technologies, and in particular, to a method and an apparatus for detecting a virus.
Background
Computer viruses generally refer to artificially manufactured programs which have destructive effects on computer information or systems, have destructive, infectious and latent properties, and have the characteristic of high propagation speed along with the rapid development of network technologies.
A computer virus, as a program, is executable and therefore, it usually exists in the form of an executable file. In order to distinguish between normal executable files and virus files, virus detection needs to be performed on the executable files.
Currently, virus detection for executable files generally employs a single detection method, i.e., the same detection method is performed for all executable files, such as a static detection method or a dynamic detection method.
The static detection method mainly analyzes static characteristics (characteristics when a file is not operated) of the executable file to obtain an analysis result of whether the executable file is a virus file. The detection method has low detection accuracy on virus files subjected to shell adding or encryption processing.
The dynamic detection method mainly analyzes dynamic characteristics (characteristics of file operation) of the executable file to obtain an analysis result of whether the executable file is a virus file. The detection method is time-consuming, and therefore, the detection efficiency is low.
Disclosure of Invention
In view of the above, the present application provides a virus detection method and apparatus, which are used to consider both the virus detection accuracy and the detection efficiency.
In order to achieve the purpose of the application, the application provides the following technical scheme:
in a first aspect, the present application provides a method for virus detection, the method comprising:
extracting static characteristics of an executable file to be detected;
inputting the static characteristics into a static detection model to obtain a result value for representing a static detection result;
if the result value hits a preset dynamic detection condition threshold value, extracting the running characteristics of the executable file;
inputting the operating characteristics into a dynamic detection model, and determining whether the executable file is a virus file according to an output result of the dynamic detection model.
Optionally, the method further includes:
and if the result value is not hit in the dynamic detection condition threshold value, determining whether the executable file is a virus file or not according to the result value.
Optionally, the static features include byte features, import features, text features, and attribute features, where the byte features include a first byte feature determined based on the number of occurrences of the byte value and a second byte feature determined based on the byte entropy.
Optionally, extracting text features of the executable file includes:
counting the occurrence times of each readable character in an American Standard Code for Information Interchange (ASCII) Code table;
and performing hash operation of a preset dimension on a data combination consisting of the corresponding occurrence times of each readable character to obtain the text characteristics of the preset dimension, wherein the preset dimension is greater than the number of the readable characters in the ASCII code table.
Optionally, the extracting the dynamic feature of the executable file includes:
acquiring running information of the executable file during simulation running, wherein the running information comprises a name of a called Application Programming Interface (API), a number of a thread calling the API and a sequence number called by the API in the thread;
and extracting operation features from the operation information according to a preset feature extraction rule, wherein the operation features comprise global features, local features, API sequence features and API probability features of the API.
Optionally, the dynamic detection model is a pre-trained fusion model composed of a plurality of detection models, the plurality of detection models include at least one Text Convolutional Neural network (Text-CNN) model, and the inputting the operating characteristics into the dynamic detection model includes:
inputting the API sequence features into the at least one Text-CNN model;
inputting the features of the operating features except the API sequence features into the detection models of the dynamic detection model except the at least one Text-CNN model.
In a second aspect, the present application provides a virus detection apparatus, the apparatus comprising:
the extraction unit is used for extracting the static characteristics of the executable file to be detected;
the input unit is used for inputting the static characteristics into a static detection model to obtain a result value for representing a static detection result;
the extraction unit is further configured to extract an operation feature of the executable file if the result value hits a preset dynamic detection condition threshold;
the input unit is further configured to input the operation characteristics into a dynamic detection model, and determine whether the executable file is a virus file according to an output result of the dynamic detection model.
Optionally, the apparatus further comprises:
and the determining unit is used for determining whether the executable file is a virus file according to the result value if the result value is not hit in the dynamic detection condition threshold value.
Optionally, the static features include byte features, import features, text features, and attribute features, where the byte features include a first byte feature determined based on the number of occurrences of the byte value and a second byte feature determined based on the byte entropy.
Optionally, the extracting unit extracts a text feature of the executable file, including:
counting the number of times of occurrence of each readable character in the executable file aiming at each readable character in the ASCII code table;
and performing hash operation of a preset dimension on a data combination consisting of the corresponding occurrence times of each readable character to obtain the text characteristics of the preset dimension, wherein the preset dimension is greater than the number of the readable characters in the ASCII code table.
Optionally, the extracting unit extracts the dynamic feature of the executable file, including:
acquiring running information of the executable file during simulation running, wherein the running information comprises the name of a called API, the number of a thread calling the API and a sequence number called by the API in the thread;
and extracting operation features from the operation information according to a preset feature extraction rule, wherein the operation features comprise global features, local features, API sequence features and API probability features of the API.
Optionally, the dynamic detection model is a pre-trained fusion model composed of a plurality of detection models, the plurality of detection models includes at least one Text-CNN model, and the inputting unit inputs the operation characteristic into the dynamic detection model, including:
inputting the API sequence features into the at least one Text-CNN model;
inputting the features of the operating features except the API sequence features into the detection models of the dynamic detection model except the at least one Text-CNN model.
As can be seen from the above description, in the embodiment of the present application, static detection is performed on an executable file first, and when the static detection cannot accurately determine the file type (normal file or virus file) of the executable file, dynamic detection is performed on the executable file, so as to improve detection accuracy. On the contrary, if the file type of the executable file can be accurately determined by static detection, dynamic detection does not need to be performed on the executable file, so that the detection efficiency is improved. It can be seen that the detection efficiency and the detection accuracy can be effectively considered.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow chart of a virus detection method according to an embodiment of the present disclosure;
FIG. 2 is a block diagram of a static detection flow shown in an embodiment of the present application;
FIG. 3 is a block diagram of a dynamic detection flow shown in an embodiment of the present application;
fig. 4 is a schematic structural diagram of a virus detection apparatus according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application.
The terminology used in the embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the present application. As used in the embodiments of the present application, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used in the embodiments of the present application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the negotiation information may also be referred to as second information, and similarly, the second information may also be referred to as negotiation information without departing from the scope of the embodiments of the present application. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
The application provides a virus detection method, which combines static detection and dynamic detection to achieve the aim of integrally considering both virus detection efficiency and detection accuracy.
For the purpose of making the objects, aspects and advantages of the present application more apparent, the following detailed description of the present application is made with reference to the accompanying drawings and specific embodiments:
referring to fig. 1, a flowchart of a virus detection method according to an embodiment of the present application is shown. As shown in fig. 1, the process may include the following steps:
step 101, extracting static characteristics of an executable file to be detected.
In the Windows operating system, the Executable file may be a Portable Executable (PE) file, and common PE files include an Executable (EXE) file, a Dynamic library (DLL) file, a system (SYS) file, and the like.
Herein, static features of an executable refer to relevant features of the executable that are not running.
For one embodiment, the static features may include byte features, import features, text features, and attribute features.
The following describes the extraction of these several features:
Figure BDA0003184640010000061
byte characteristics:
each file actually exists in binary form in the disk. As an example, a binary form of an executable file may be represented as: 0x 010 x 050 x 030 x01 … …. It can be seen that the executable file consists of a series of bytes.
The embodiment of the application aims at extracting byte characteristics of the executable file in the binary form. Here, the extracted byte features include a first byte feature determined based on the number of occurrences of the byte value and a second byte feature determined based on the byte entropy. It is to be understood that the first byte characteristic and the second byte characteristic are named for convenience of distinguishing and are not used for limitation.
As an example, the number (also referred to as dimension) of the first byte features required to be extracted may be determined according to the value range (0-255) of a single byte, and then, for each byte value (0x00, 0x01, 0x02, … …, 0xff), the number of times each byte value appears in the executable file may be counted separately. For example, a first byte signature of 256 dimensions [0, 2, 0, 1, … …, 5] can be obtained when 0x00 appears 0 times in the file, 0x01 appears 2 times in the file, 0x02 appears 0 times in the file, 0x03 appears 1 time in the file, … …, 0xff appears 5 times in the file. The 256-dimensional first byte characteristic can reflect the byte value distribution of the executable file.
As an example, the preset moving window size is 1024 bytes, and the moving step size is 256 bytes. And gradually moving on the executable file in the binary form by the moving window in a preset moving step. The byte entropy of each window is sequentially calculated according to 1024 bytes of the position of each window, and specifically, the byte entropy can be calculated by the following formula:
Figure BDA0003184640010000062
wherein n represents that the byte has n values; piThe probability of the ith byte value (byte value for short) in the window is represented; e represents the byte entropy of the window, which is used to characterize the uncertainty of the byte value within the window.
And generating a multidimensional second byte characteristic corresponding to the executable file according to the byte entropy of each window and the occurrence frequency of each byte value in the window. See table 1 for a 256 x 8 dimensional second byte feature example.
Figure BDA0003184640010000071
TABLE 1
Wherein, the horizontal axis represents byte values, including 256 byte values from 0x00 to 0 xff; the vertical axis represents byte entropy, including 8 byte entropies from 0 to 7.
For each byte value (256 byte values from 0x00 to 0xff), the number of times each byte value appears in the window is counted, and the byte entropy of the window is calculated according to equation (1).
As an example, if it is determined that 0x00 appears 10 times in the window, 0x01 appears 2 times in the window, 0x02 appears 20 times in the window, … …, 0xfe appears 0 times in the window, 0xff appears 5 times in the window, and the byte entropy of the current window is 1 according to the above statistical and calculation method, the number of occurrences of each byte value of the statistics is recorded in the row having the byte entropy of 1, as shown in table 2.
Figure BDA0003184640010000072
TABLE 2
Similarly, the above process is performed for each window and the results of the process are accumulated in the above table until the window slides over all bytes of the executable file to obtain the 256 × 8 dimensional second byte characteristic of the executable file.
Figure BDA0003184640010000081
Lead-in feature
The import table of the executable file is mainly used for recording system resource information required by file operation, such as a name of a system function required to be called, a name of a dynamic link library, and the like, and the system resource information is recorded in the import table in a character string form. In order to facilitate computer processing, the embodiment of the present application adopts a preset hash algorithm to convert a character string in an import table into a number as an extracted import feature.
The use of hash algorithms to convert strings into numbers is a mature technology and is not described in detail here. However, the dimension of the conversion can be set according to actual requirements, for example, a 256-dimensional hash algorithm is adopted to perform string/number conversion, so as to obtain 256-dimensional import characteristics.
Figure BDA0003184640010000082
Text features
The executable file may also be opened in the form of text, including strings of letters, numbers, symbols. The letters, numbers and symbols are usually characters with an ASCII code value of 0x 20-0 x7e in the ASCII code table, and are usually called readable characters or printable characters.
The embodiment of the application counts the occurrence times of each readable character in the ASCII code table in the executable file. For example, the character "! "(corresponding to an ASCII code value of 0x21) appears 4 times in the file, the character" # "(corresponding to an ASCII code value of 0x23) appears 17 times in the file, the character" a "(corresponding to an ASCII code value of 0x61) appears 30 times in the file, and the character" b "(corresponding to an ASCII code value of 0x62) appears 107 times in the file, … ….
Since the number of readable characters in the ASCII code table is only 95, only 95 statistical values, or 95-dimensional features, can be obtained through the above statistics. In order to improve the importance of the text features, the text features obtained through statistics are expanded to obtain text features with larger dimensionality.
Specifically, hash operation of a preset dimension is performed on a data combination composed of the corresponding occurrence times of each readable character, so as to obtain the text feature of the preset dimension. Here, the preset dimension is larger than the number of readable characters in the ASCII code table, for example, the preset dimension is 256 dimensions.
Through dimension extension, the importance of text features can be improved, all dimension features can be controlled within a certain range, and the phenomenon that some features correspond to too large numerical values and some features correspond to too small numerical values is avoided.
Figure BDA0003184640010000083
Attribute features
Here, the attribute feature refers to other accessory features that the file has in addition to the aforementioned main features (byte feature, import feature, text feature), such as a file header feature, a file general feature, a file section feature.
The file header features refer to features extracted based on file header information. The file header information is mainly used for explaining on which machine the file runs, sections, link time and the like. For example, the file header includes a Machine code (Machine) field for identifying a Machine code of a Central Processing Unit (CPU) running the file; the number of sections (English) field is used to identify the number of sections present in the file; the time of creation (english) field is used to identify the time of creation of the file, and so on. Among these information, information in a digital format (for example, the number of sections) may be directly used as features, and information in a text format may be used after being converted into a digital format.
The document general feature refers to a feature extracted based on document general information. The file general information typically includes: file size, file size in memory, whether it is in debug format, output information, input information, number of access resources, file signature, file flag, etc. Some of these information are in digital format, e.g., file size, size of the file in memory, which can be used directly as a feature; some of the text formats, such as output information, input information, file signatures, and file flags, may be used as features after converting the text format into a numerical format by a hash algorithm.
The section feature of the file refers to a feature extracted based on information of each section included in the executable file. Here, it should be noted that information of each section constituting the executable file, for example, whether a section is readable, writable, executable, and the like, is recorded in the file section table, and therefore, the embodiment of the present application may extract section features of the executable file according to the information of each section recorded in the file section table. For example, the number of sections with a length of 0, the number of sections named empty, the number of readable executable sections, the number of writable sections, the size of sections, etc. are counted. Similarly, if the information related to the digital format can be directly used as the feature, the information related to the text format needs to be converted into the digital format for use.
Through the above processing, static features required for static detection are obtained, for example, 2304-dimensional byte features (256-dimensional first byte features +256 × 8-dimensional second byte features), 256-dimensional import features, 256-dimensional text features, and 1024-dimensional attribute features.
And 102, inputting the static characteristics into a static detection model to obtain a result value for representing a static detection result.
As an example, the static detection model may be a Multi-Layer neural network (MLP). The neural network may consist of 1 input layer, 5 hidden layers, 1 output layer. The multidimensional static features obtained in step 101 (for example, 2304+256+ 1024 + 3840-dimensional static features) are input into the input layer, each hidden layer is composed of 512 nodes, and the output layer outputs 1-dimensional result values. The result value is usually between 0 and 1, for example, 0 represents a normal file, 1 represents a virus file, the closer the result value is to 0, the higher the probability of representing as a normal file is, and conversely, the closer to 1, the higher the probability of representing as a virus file is.
Referring to fig. 2, a static detection flow diagram is shown in the embodiment of the present application.
And 103, if the result value hits a preset dynamic detection condition threshold value, extracting the running characteristics of the executable file.
As can be seen from the analysis of the result values in step 102, when the result values approach 0 or 1, the file type (normal file or virus file) can be accurately predicted; when the result value is far from 0 or 1, for example, in the interval of 0.2-0.8, the prediction accuracy will be greatly reduced.
In order to meet the requirements of the overall detection efficiency and the detection accuracy, the dynamic detection condition threshold can be preset according to the actual application scene, for example, the dynamic detection condition threshold is preset to be 0.2-0.8.
If the result value output in step 102 does not hit the condition threshold, for example, the result value is 0.9512, which is close to 1, then the executable file can be accurately determined to be a virus file; for another example, if the result value is 0.084 and approaches 0, the executable file can be accurately determined to be a normal file.
If the result value output in step 102 hits the condition threshold, for example, the output result value is 0.4, which indicates that the static detection cannot accurately determine the file type, at this time, the dynamic detection may be used to improve the file detection accuracy. Therefore, the sandbox can be used for simulating the running of the executable file so as to extract the running characteristics of the executable file.
Specifically, the running information of the executable file is obtained. The run information may include the name of the called API, the number of the thread that called the API, and the sequence number in the thread that the API was called. Referring to Table 3, for an example of the running information of the executable file (file 1):
filename Name of API Thread numbering Called order in threads
file1 RegKeyExAapi1 2332 0
file1 CpFileAapi1 2332 1
file1 OpenSCAapi1 2332 2
file1 CrtServiceAapi 2332 3
file1 RegKeyExAapi1 2468 0
file1 CpFileAapi1 2468 1
file1 OpenSCAapi1 2468 2
file1 CrtServiceAapi 2468 3
file1 StartServiceA 2468 4
file1 NtCreateThreadEx 2468 5
TABLE 3
Taking the entry 1 as an example, the executable file1 first calls RegKeyExAapi1 when running, the RegKeyExAapi1 is called by the thread with the number 2332, and the RegKeyExAapi1 is the first API called by the thread 2332, and the corresponding call sequence number is 0.
And after all the running information of the executable file is acquired, extracting the running characteristics from the running information according to a preset characteristic extraction rule. The operational characteristics may include global characteristics, local characteristics, API sequence characteristics, and API probability characteristics.
Each dynamic feature extraction will be explained below:
Figure BDA0003184640010000111
global features
Global features generally refer to features extracted for a single run of information, such as features extracted only for thread numbers, features extracted only for call sequence numbers.
Here, a description will be given taking, as an example, a global feature extracted only for a thread number. See Table 4, based on
Example of global features resulting from thread numbering in table 3.
Name of API Counting Mean value Variance (variance) Minimum value 25% 50% 75% Maximum value
CpFileAapi1 2 2400 96.17 2332 2366 2400 2434 2468
CrtServiceAapi 2 2400 96.17 2332 2366 2400 2434 2468
NtCreateThreadEx 1 2468 Air conditioner 2468 2468 2468 2468 2468
OpenSCAapi1 2 2400 96.17 2332 2366 2400 2434 2468
RegKeyExAapi1 2 2400 96.17 2332 2366 2400 2434 2468
StartServiceA 1 2468 Air conditioner 2468 2468 2468 2468 2468
TABLE 4
Taking the entry 1 as an example, the count value of 2 indicates that CpFileAapi1 is called 2 times in file 1; mean 2400 is the average of the numbers (2332 and 2468) of the threads that called CpFileAapi 1; variance 96.17 is the variance of the number of threads invoking CpFileAapi 1; 2332 is the minimum thread number to call CpFileAapi 1; 2468 is the maximum thread number for calling CpFileAapi 1; 2366 is the thread number value at 25% component between the minimum thread number and the maximum thread number; 2400 is the thread number value at 50% of the component between the minimum thread number and the maximum thread number; 2434 is the thread number value at 75% of the component between the minimum and maximum thread numbers. Of course, the thread number values at other components (e.g., 0.2, 0.4, 0.6, 0.8) may also be extracted as the case may be.
Figure BDA0003184640010000121
Local features
The local features generally refer to features extracted in conjunction with a plurality of operational information, and may include second-order local features and higher-order local features.
Here, the second-order local feature refers to a feature extracted based on a combination of two pieces of operation information. For example, based on the combination of the file name and the thread number, counting the characteristics of the count value, the maximum value, the minimum value, the mean value, the variance and the like of the API calling sequence number in the corresponding thread; for another example, based on the combination of the file name and the thread number, the characteristics such as the count value of the API in the corresponding thread are counted.
Higher-order local features refer to features extracted based on a combination of more than two run information. For example, based on the combination of the file name, the API, and the thread number, the API calls for the characteristics of the count value, the maximum value, the minimum value, the mean value, the variance, and the like of the sequence number.
Figure BDA0003184640010000122
API sequence characteristics
The API sequence features are used to characterize the API call order of the file.
As shown in table 3, API names are usually represented in a character string form, and the embodiment of the present application needs to convert the API names in the character string form into a number form, and then characterize the calling order of the API based on the number form, that is, obtain API sequence features represented based on the number form.
As an example, all APIs of a file call may be first ordered by ASCII code, resulting in the following API order: CpFileAapi1, CrtServiceAapi, ntcreatetradax, OpenSCManager, RegKeyExAapi1, StartServiceA, then, 0 for CpFileAapi1, 1 for crtserveapi, 2 for ntcreatetradax, 3 for OpenSCManager, 4 for regkeyexaaapi 1, 5 for StartServiceA may be defined, and the API sequence feature of file1 may be represented as [4, 0, 3, 1, 4, 0, 3, 1, 5, 2 ].
Figure BDA0003184640010000131
API probabilistic features
As one example, an N-gram model may be employed to extract API probability features.
For example, if a 2-gram model is used, which indicates that the occurrence of one API is related to the previous API, the embodiment of the present application may extract the statistical count, the maximum value, the minimum value, and the like of the thread number or the call sequence number corresponding to the whole in the executable file as a whole.
For another example, by using a 20-gram model, which indicates that the occurrence of one API is related to the first 19 APIs, the embodiment of the present application may extract, as a whole, the statistical count, the maximum value, the minimum value, and the like of the thread number or the call sequence number corresponding to the whole executable file.
The above is the run feature extraction for executable files.
And 104, inputting the running characteristics into the dynamic detection model, and determining whether the executable file is a virus file according to an output result of the dynamic detection model.
In the embodiment of the application, the dynamic detection model is a pre-trained fusion model composed of a plurality of detection models. The plurality of detection models may include an eXtreme Gradient boost (XGboost, abbreviated as XGB), an MLP, a Light Gradient Boost (LGB), and a Text-CNN.
Referring to fig. 3, a block diagram of a dynamic detection process according to an embodiment of the present application is shown. The method comprises a dynamic detection model formed by a plurality of detection models. The multiple detection models comprise 4 XGB models (XGB 1-XGB 4), 2 MLB models (MLB1, MLB2), 4 LGB models (LGB 1-LGB 4) and 2 TEXT-CNN models (TEXT-CNN1, TEXT-CNN 2).
Wherein the XGB 1-XGB 4 are XGB models trained by adopting different super parameters; LGB 1-LGB 4 are LGB models trained by adopting different super parameters; MLB1 and MLB2 are provided with hidden layers with different depths; TEXT-CNN1 can be a TEXT-CNN model employing 7 different sized conventional convolution kernels; TEXT-CNN2 may be a TEXT-CNN model that employs 16 different size hole convolution kernels, i.e., the receptive field of the convolution kernels is increased by injecting holes over the convolution kernels.
Specifically, when the operation features obtained in step 103 are input to the dynamic inspection model shown in fig. 3, as shown in fig. 3, API sequence features may be input to TEXT-CNN1 and TEXT-CNN2, and operation features other than the API sequence features may be input to XGB1 to XGB4, MLB1, MLB2, and LGB1 to LGB4, respectively. After the multiple models are fused and output, a final detection result is obtained, and whether the executable file is a virus file or not can be accurately determined.
At this point, the virus detection process is completed.
According to the virus detection process, static detection is firstly performed on the executable file in the embodiment of the application, and when the file type (normal file or virus file) of the executable file cannot be accurately determined through the static detection, dynamic detection is performed on the executable file, so that the detection accuracy is improved. On the contrary, if the file type of the executable file can be accurately determined by static detection, dynamic detection does not need to be performed on the executable file, so that the detection efficiency is improved. Therefore, the embodiment of the application can effectively give consideration to both the detection efficiency and the detection accuracy.
The method provided by the embodiment of the present application is described above, and the virus detection apparatus provided by the embodiment of the present application is described below:
referring to fig. 4, a schematic structural diagram of a virus detection apparatus provided in the embodiment of the present application is shown. The device includes: an extraction unit 401 and an input unit 402, wherein:
an extracting unit 401, configured to extract a static feature of an executable file to be detected;
an input unit 402, configured to input the static feature into a static detection model, so as to obtain a result value representing a static detection result;
the extracting unit 401 is further configured to extract an operation feature of the executable file if the result value hits a preset dynamic detection condition threshold;
the input unit 402 is further configured to input the operation characteristic into a dynamic detection model, and determine whether the executable file is a virus file according to an output result of the dynamic detection model.
As an embodiment, the apparatus further comprises:
and the determining unit is used for determining whether the executable file is a virus file according to the result value if the result value is not hit in the dynamic detection condition threshold value.
As one embodiment, the static features include byte features, import features, text features, and attribute features, wherein the byte features include a first byte feature determined based on a number of occurrences of a byte value and a second byte feature determined based on a byte entropy.
As an embodiment, the extracting unit 401 extracts a text feature of the executable file, including:
counting the number of times of occurrence of each readable character in the executable file aiming at each readable character in the ASCII code table;
and performing hash operation of a preset dimension on a data combination consisting of the corresponding occurrence times of each readable character to obtain the text characteristics of the preset dimension, wherein the preset dimension is greater than the number of the readable characters in the ASCII code table.
As an embodiment, the extracting unit 401 extracts the dynamic feature of the executable file, including:
acquiring running information of the executable file during simulation running, wherein the running information comprises the name of a called API, the number of a thread calling the API and a sequence number called by the API in the thread;
and extracting operation features from the operation information according to a preset feature extraction rule, wherein the operation features comprise global features, local features, API sequence features and API probability features of the API.
As an embodiment, the dynamic detection model is a pre-trained fusion model composed of a plurality of detection models, the plurality of detection models includes at least one Text-CNN model, and the inputting unit 402 inputs the operation features into the dynamic detection model, including:
inputting the API sequence features into the at least one Text-CNN model;
inputting the features of the operating features except the API sequence features into the detection models of the dynamic detection model except the at least one Text-CNN model.
Thus, the description of the apparatus is completed. In the embodiment of the application, static detection is performed on the executable file, and when the static detection cannot accurately determine the file type (normal file or virus file) of the executable file, dynamic detection is performed on the executable file, so that the detection accuracy is improved. On the contrary, if the file type of the executable file can be accurately determined by static detection, dynamic detection does not need to be performed on the executable file, so that the detection efficiency is improved. It can be seen that the detection efficiency and the detection accuracy can be effectively considered.
The above description is only a preferred embodiment of the present application, and should not be taken as limiting the present application, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present application shall be included in the scope of the present application.

Claims (10)

1. A method for detecting a virus, the method comprising:
extracting static characteristics of an executable file to be detected;
inputting the static characteristics into a static detection model to obtain a result value for representing a static detection result;
if the result value hits a preset dynamic detection condition threshold value, extracting the running characteristics of the executable file;
inputting the operating characteristics into a dynamic detection model, and determining whether the executable file is a virus file according to an output result of the dynamic detection model.
2. The method of claim 1, wherein the method further comprises:
and if the result value is not hit in the dynamic detection condition threshold value, determining whether the executable file is a virus file or not according to the result value.
3. The method of claim 1, wherein the static features comprise byte features, import features, text features, and attribute features, wherein the byte features comprise a first byte feature determined based on a number of occurrences of a byte value and a second byte feature determined based on a byte entropy.
4. The method of claim 3, wherein extracting textual features of the executable file comprises:
counting the occurrence times of each readable character in the American Standard Code for Information Interchange (ASCII) code table in the executable file;
and performing hash operation of a preset dimension on a data combination consisting of the corresponding occurrence times of each readable character to obtain the text characteristics of the preset dimension, wherein the preset dimension is greater than the number of the readable characters in the ASCII code table.
5. The method of claim 1, wherein said extracting dynamic features of said executable file comprises:
acquiring running information of the executable file during simulation running, wherein the running information comprises the name of a called Application Program Interface (API), the number of a thread calling the API and a sequence number called by the API in the thread;
and extracting operation features from the operation information according to a preset feature extraction rule, wherein the operation features comprise global features, local features, API sequence features and API probability features of the API.
6. The method of claim 5, wherein the dynamic detection model is a pre-trained fusion model consisting of a plurality of detection models, the plurality of detection models including at least one Text convolutional neural network Text-CNN model, the inputting the operational features into the dynamic detection model comprising:
inputting the API sequence features into the at least one Text-CNN model;
inputting the features of the operating features except the API sequence features into the detection models of the dynamic detection model except the at least one Text-CNN model.
7. A virus detection apparatus, the apparatus comprising:
the extraction unit is used for extracting the static characteristics of the executable file to be detected;
the input unit is used for inputting the static characteristics into a static detection model to obtain a result value for representing a static detection result;
the extraction unit is further configured to extract an operation feature of the executable file if the result value hits a preset dynamic detection condition threshold;
the input unit is further configured to input the operation characteristics into a dynamic detection model, and determine whether the executable file is a virus file according to an output result of the dynamic detection model.
8. The apparatus of claim 7, wherein the static features comprise byte features, import features, text features, and attribute features, wherein the byte features comprise a first byte feature determined based on a number of occurrences of a byte value and a second byte feature determined based on a byte entropy.
9. The apparatus of claim 8, wherein the extraction unit extracts a text feature of the executable file, comprising:
counting the occurrence times of each readable character in the American Standard Code for Information Interchange (ASCII) code table in the executable file;
and performing hash operation of a preset dimension on a data combination consisting of the corresponding occurrence times of each readable character to obtain the text characteristics of the preset dimension, wherein the preset dimension is greater than the number of the readable characters in the ASCII code table.
10. The apparatus of claim 7, wherein the extraction unit to extract dynamic features of the executable file comprises:
acquiring running information of the executable file during simulation running, wherein the running information comprises the name of a called Application Program Interface (API), the number of a thread calling the API and a sequence number called by the API in the thread;
and extracting operation features from the operation information according to a preset feature extraction rule, wherein the operation features comprise global features, local features, API sequence features and API probability features of the API.
CN202110857502.6A 2021-07-28 2021-07-28 Virus detection method and device Pending CN113569241A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110857502.6A CN113569241A (en) 2021-07-28 2021-07-28 Virus detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110857502.6A CN113569241A (en) 2021-07-28 2021-07-28 Virus detection method and device

Publications (1)

Publication Number Publication Date
CN113569241A true CN113569241A (en) 2021-10-29

Family

ID=78168477

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110857502.6A Pending CN113569241A (en) 2021-07-28 2021-07-28 Virus detection method and device

Country Status (1)

Country Link
CN (1) CN113569241A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105897807A (en) * 2015-01-14 2016-08-24 江苏博智软件科技有限公司 Mobile intelligent terminal abnormal code cloud detection method based on behavioral characteristics
CN108090348A (en) * 2017-12-14 2018-05-29 四川长虹电器股份有限公司 Android malware detection method based on sandbox
CN109359439A (en) * 2018-10-26 2019-02-19 北京天融信网络安全技术有限公司 Software detecting method, device, equipment and storage medium
CN110647746A (en) * 2019-08-22 2020-01-03 成都网思科平科技有限公司 Malicious software detection method, system and storage medium
CN111382783A (en) * 2020-02-28 2020-07-07 广州大学 Malicious software identification method and device and storage medium
CN112528284A (en) * 2020-12-18 2021-03-19 北京明略软件系统有限公司 Malicious program detection method and device, storage medium and electronic equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105897807A (en) * 2015-01-14 2016-08-24 江苏博智软件科技有限公司 Mobile intelligent terminal abnormal code cloud detection method based on behavioral characteristics
CN108090348A (en) * 2017-12-14 2018-05-29 四川长虹电器股份有限公司 Android malware detection method based on sandbox
CN109359439A (en) * 2018-10-26 2019-02-19 北京天融信网络安全技术有限公司 Software detecting method, device, equipment and storage medium
CN110647746A (en) * 2019-08-22 2020-01-03 成都网思科平科技有限公司 Malicious software detection method, system and storage medium
CN111382783A (en) * 2020-02-28 2020-07-07 广州大学 Malicious software identification method and device and storage medium
CN112528284A (en) * 2020-12-18 2021-03-19 北京明略软件系统有限公司 Malicious program detection method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN106709345B (en) Method, system and equipment for deducing malicious code rules based on deep learning method
US11574052B2 (en) Methods and apparatus for using machine learning to detect potentially malicious obfuscated scripts
US11544459B2 (en) Method and apparatus for determining feature words and server
CN108549814A (en) A kind of SQL injection detection method based on machine learning, database security system
WO2016177069A1 (en) Management method, device, spam short message monitoring system and computer storage medium
CN113901474B (en) Vulnerability detection method based on function-level code similarity
WO2021121279A1 (en) Text document categorization using rules and document fingerprints
US9600644B2 (en) Method, a computer program and apparatus for analyzing symbols in a computer
CN112445912A (en) Fault log classification method, system, device and medium
CN114266251A (en) Malicious domain name detection method and device, electronic equipment and storage medium
CN110889451A (en) Event auditing method and device, terminal equipment and storage medium
CN111737694B (en) Malicious software homology analysis method based on behavior tree
CN113723542A (en) Log clustering processing method and system
CN117940894A (en) System and method for detecting code clones
CN112732655A (en) Online analysis method and system for unformatted logs
CN113128213A (en) Log template extraction method and device
CN116578700A (en) Log classification method, log classification device, equipment and medium
CN113569241A (en) Virus detection method and device
CN115391541A (en) Intelligent contract code automatic checking method, storage medium and electronic equipment
CN111159996B (en) Short text set similarity comparison method and system based on text fingerprint algorithm
CN113704108A (en) Similar code detection method and device, electronic equipment and storage medium
KR20220068462A (en) Method and apparatus for generating knowledge graph
KR20220041337A (en) Graph generation system of updating a search word from thesaurus and extracting core documents and method thereof
KR20220041336A (en) Graph generation system of recommending significant keywords and extracting core documents and method thereof
CN113032775A (en) Information processing method and information processing system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination