CN113378156B

CN113378156B - API-based malicious file detection method and system

Info

Publication number: CN113378156B
Application number: CN202110749396.XA
Authority: CN
Inventors: 梁淑云; 殷钱安; 余贤喆; 王启凡; 陶景龙; 徐�明; 刘胜; 马影; 周晓勇; 魏国富; 夏玉明
Original assignee: Information and Data Security Solutions Co Ltd
Current assignee: Information and Data Security Solutions Co Ltd
Priority date: 2021-07-01
Filing date: 2021-07-01
Publication date: 2023-07-11
Anticipated expiration: 2041-07-01
Also published as: CN113378156A

Abstract

The invention provides a malicious file detection method and a system based on an API, wherein the method comprises the following steps: the file is put into an sandbox to run, and the name and tid of the API called by the file during running and the sequence number index of the API call in the thread are recorded at the same time; data preprocessing, including: processing the API in the data, optimizing the low-frequency API, generating a new field and mapping the label code; constructing a feature project based on the processed data, wherein the feature project comprises global features and local combined features, and the two feature sets are finally spliced into a feature set; correcting the file which cannot be judged by part of antivirus software into a normal record number according to the initial training result of the model, and further training the model again; model prediction. The invention also provides a malicious file detection system based on the API. The method has a certain identification rate on various malicious files bypassing the feature codes and the sandboxes and can improve the generalization capability of malicious file detection.

Description

API-based malicious file detection method and system

Technical Field

The invention relates to the technical field of information security services, in particular to a malicious file detection method and system based on an API (application program interface).

Background

In recent years, with the development of computer technology, intelligent terminals and network technology are widely used, and malicious files are spread and mutated to some extent. Legal documents are used for strengthening and expanding the capability of a computer, thereby facilitating the work and life of people; malicious files are used for stealing or destroying computer data, and the like, so that economic loss and mental trouble can be brought to enterprises and individuals. Therefore, the malicious files are timely detected, the threat brought by the malicious files is blocked, and it is more and more important to maintain the health and safety of the network environment.

The current malicious file detection method mainly comprises a feature code method, a sandbox detection technology and the like. The method is fast in detection speed, but cannot detect the malicious files containing unknown feature codes, and once the malicious files can escape from the detection of the feature codes through means of deformation, encryption, shell adding and the like. In recent years, sandboxed technology is increasingly widely used, and the method is used for judging whether an unknown file belongs to a malicious file by simulating a normal environment for the unknown file to run, recording the actions of the file to run and matching the actions with a malicious file library. Along with the popularization of machine learning applications, some learning methods for constructing a machine learning model to detect malicious files also appear. As disclosed in patent document 202010572487.6, a method for constructing a detection model of a malicious file and detecting the malicious file is disclosed, and a plurality of normal samples and a plurality of malicious samples are obtained and respectively labeled; filtering out unshelling malicious samples in the malicious samples; establishing a static model, including: obtaining PE formats of a plurality of normal samples and a plurality of malicious samples; converting the data into a plurality of feature vectors according to the PE format of each acquired sample; combining a plurality of the feature vectors and associating with a tag; adjusting the random forest model and the LightGBM model to optimal parameters; inputting the feature vector associated with the tag into a random forest model and a LightGBM model, and respectively establishing the random forest model and the LightGBM model for statically detecting the malicious file; establishing a dynamic model, which comprises the following steps: placing a plurality of normal samples and a plurality of malicious samples into a sandbox to obtain a sandbox report, and obtaining characteristic vectors of each sample about API, tid, return _value and index in the sandbox report; combining a plurality of the feature vectors and associating with a tag; adjusting the random forest model and the LightGBM model to optimal parameters, and establishing an important characteristic random forest model; inputting the feature vectors associated with the tags into a random forest model, an important feature random forest model and a LightGBM model, and respectively establishing the random forest model, the important feature random forest model and the LightGBM model for dynamically detecting malicious files; fusing all the static models and all the dynamic models to obtain fused models; and calculating the total malicious suspicious score obtained according to the fusion model and the malicious suspicious score obtained by the malheur model to obtain a final malicious score, and detecting a sample according to the final malicious score.

Although the sandbox detection technology avoids the defect that unknown malicious files cannot be detected by the feature code method to a certain extent, attackers are always searching for various methods to bypass detection of the sandbox, such as detection of system features, delay operation and the like, so that whether part of unknown files belong to malicious files cannot be judged in the sandbox.

The existing method utilizing machine learning modeling solves the problem that the sandbox detection technology is bypassed to a certain extent, but the characteristic engineering of the method is biased to statistical characteristics, the characteristic difference of different files on an API call time sequence is ignored, and the problem of characteristic sparsity is ignored during characteristic processing, so that the accuracy and efficiency of a model are possibly reduced.

Disclosure of Invention

The technical problem to be solved by the invention is how to judge whether an unknown file belongs to a malicious file.

The invention solves the technical problems by the following technical means: a malicious file detection method based on an API comprises the following steps:

s101, classifying the acquired files to confirm the types of the files, putting the files into a sandbox for running, and recording parameters called when each file runs; the file category comprises known files and unknown files; the parameters include: file ID, API name, thread ID, and sequence number of API calls;

s102, preprocessing based on parameters called by the running of the known file to serve as model training data;

s103, constructing a characteristic engineering set based on the preprocessed data, wherein the characteristic engineering set comprises: global features and local combined features;

s104, constructing a model, and correcting the model based on the characteristic engineering set and a preset threshold;

s105, detecting the acquired unknown file based on the corrected model to confirm whether the unknown file is a malicious file or not.

The method is mainly aimed at processing the API, extracting keywords in the API and constructing the features, so that the feature dimension and feature sparsity are reduced, and the efficiency and accuracy of the model are improved; moreover, through model detection, probability values of unknown files belonging to various categories can be output, and the probability that the unknown files belong to malicious files, namely scoring, is quantized; in addition, under the condition that no normal label file exists, a pseudo label normal data set is generated by using the model, and a multi-classification model with a certain identification capacity on the normal label file is trained, so that whether an unknown file is a malicious file, a category of the malicious file and the like are predicted.

As an optimized technical solution, in step S102, the step of preprocessing based on parameters called during the running of the known file includes:

dividing the API name to obtain a first word and a second word, and filling a first column and a second column corresponding to the API name based on the first word;

merging the API names based on the content of the first column to reduce feature dimensions;

performing optimization processing of the API based on the number of the files corresponding to the first column;

generating a new field based on the thread ID and the sequence number of the API call in the thread;

and converting the file category into a numerical value and finishing label coding mapping. As an optimized technical solution, the step of dividing the API name to obtain a first word and a second word, and filling a first column and a second column corresponding to the API name based on the first word includes:

dividing an API according to a large hump naming rule of the API in a regular matching mode to obtain a first word and a second word in the API;

the first word is filled in a first column corresponding to the API name and the second word is filled in a second column corresponding to the API name.

As an optimized technical solution, the step of generating a new field based on the thread ID and the sequence number of the API call in the thread includes:

generating a first field based on the thread ID and a difference between a sequence number of an API call and the thread ID;

taking the file name and the thread ID as grouping objects, calculating a first difference value of two times before and after the serial numbers of the API call;

the contents of two adjacent first words corresponding to the same thread ID are spliced to generate a second field.

The step S103 of constructing a feature engineering set based on the preprocessed data includes:

taking the file ID as a grouping object to count the global features;

taking the file ID and a preset field as grouping objects to count the local combination characteristics;

and splicing the global features and the local combined features into the feature engineering set by taking the file ID as a primary key.

As an optimized technical solution, the step of taking the file ID as the grouping object to count the global feature includes:

taking the file ID as a grouping object, counting the times of occurrence of the first word, the times of occurrence after duplicate removal and the times of occurrence of the second word after duplicate removal;

taking the file ID as a grouping object, and counting the maximum value, the minimum value, the mean value, the median, the standard deviation, the number of times after eliminating the weight, the dispersion, the variation coefficient and the deviation between the median and the mean value of the thread ID;

taking the file ID as a grouping object, counting the maximum value, the minimum value, the mean value, the median, the standard deviation, the number of times after removing the repetition, the dispersion, the variation coefficient, the deviation degree of the median and the mean value of the sequence numbers of the API call;

taking the file ID as a grouping object, and counting the maximum value, the minimum value, the mean value, the median, the standard deviation, the number of times after eliminating the repetition, the dispersion, the variation coefficient, the deviation of the median and the mean value of the first field;

and taking the file ID as a grouping object, and counting the times of occurrence of the second field of the API and the times of occurrence after duplicate removal.

As an optimized technical scheme, the step of counting the local combination features by taking the file ID and the preset field as grouping objects comprises the following steps:

taking the file ID and the first word as grouping objects, counting the occurrence times of each second word and the occurrence times after repeated elimination;

taking the file ID and the first word as grouping objects, and counting the maximum value, the minimum value, the median, the standard deviation and the number of times after removing the repetition of each first difference value;

and counting the occurrence times of each second word by taking the file ID and the second field as grouping objects.

As an optimized technical solution, in step S104, the step of correcting the model based on the feature engineering set and the preset threshold includes:

iterative learning is carried out by taking the characteristic engineering set and the file category corresponding to the file as the input of a model, and the probability of the file category corresponding to each file ID is output;

correcting the file category of which the maximum probability value is smaller than a preset threshold value and the original file category is 'unknown file' to be a pseudo tag 'normal' so as to form a new data set;

and taking the new data set as the input of the model to perform iterative learning for preset times so as to complete the correction of the model.

As an optimized technical solution, in step S101, the step of classifying the collected file includes: and scanning the acquired files through antivirus software, and confirming the types of the files according to the scanning results.

The invention also provides a malicious file detection system based on the API, which comprises the following steps:

the parameter confirmation module is used for classifying the acquired files to confirm the types of the files, putting the files into a sandbox to run, and recording parameters called when each file runs; the file category comprises known files and unknown files; the parameters include: file ID, API name, thread ID, and sequence number of API calls;

the preprocessing module is used for preprocessing based on parameters called by the known file in the running process so as to be used as model training data;

the feature construction module is used for constructing a feature engineering set based on the preprocessed data, and the feature engineering set comprises: global features and local combined features;

the correction module is used for constructing a model and correcting the model based on the characteristic engineering set and a preset threshold value;

the detection module is used for detecting the acquired unknown file based on the corrected model so as to confirm whether the unknown file is a malicious file or not.

The invention has the advantages that: the invention provides an API-based malicious file detection method, which is characterized in that a characteristic engineering is constructed and a classification model is trained by calling an API (application program interface) and a TID (thread ID) when a file is operated, so as to judge whether an unknown file belongs to a malicious file.

Meanwhile, the method mainly aims at processing of the API, and extraction and feature construction of keywords in the API are carried out, so that feature dimension and feature sparsity are reduced, and the efficiency and accuracy of the model are improved; furthermore, the probability value of the unknown file belonging to each category can be output through model prediction, and the probability that the unknown file belongs to the malicious file, namely the score, is quantized; in addition, under the condition that no normal label file exists, a pseudo label normal data set is generated by using the model, and a multi-classification model with a certain identification capacity on the normal label file is trained, so that whether an unknown file is a malicious file, a category of the malicious file and the like are predicted.

Drawings

Fig. 1 is a general flow chart of a malicious file detection method based on an API in embodiment 1 of the present invention.

Fig. 2 is a diagram illustrating a malicious file detection system module based on an API according to embodiment 2 of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described in the following in conjunction with the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As described in the background art, the existing methods for detecting malicious files have a certain degree of problems. From the aspect of actual service, the method optimizes the problems existing in the prior art, such as 2-gram combination for constructing the API, segmentation and combination of the API, combination treatment of the low-frequency API and the like, avoids the problem of feature sparsity, reduces feature dimension, and simultaneously expands the features of the API on a call time sequence, thereby improving the accuracy and efficiency of the model. On the other hand, in the actual environment, there are a lot of unlabeled samples, but the labeled samples are limited, and the invention also solves the problem that a model for predicting the category containing normal files cannot be constructed under the condition that only a large number of unlabeled files have no normal files, namely, a method for constructing the model to generate the normal files by using a pseudo-label mode.

Example 1

Referring to fig. 1, the invention provides a malicious file detection method based on an API, which specifically includes the following steps:

s101, classifying the collected files to confirm the types of the files, putting the files into a sandbox for running, and recording parameters called when each file runs;

the file category comprises known files and unknown files;

the parameters include: API (application program interface) name, thread ID (tid), and sequence number (index) of API calls in the thread. In the running process of the file, a plurality of APIs and tids are generally called, no precedence relation exists among different tids, and index in the same tid is represented by the precedence relation called from small to large, but may not be continuous.

Wherein the step of classifying the collected files comprises: and scanning the acquired files through antivirus software, and confirming the types of the files according to the scanning results.

the step of preprocessing based on parameters called by the running time of the known file comprises the following steps:

s1021, dividing the API name to obtain a first word and a second word, and filling a first column and a second column corresponding to the API name based on the first word:

s10211, dividing the API according to a large hump naming rule of the API in a regular matching mode to obtain a first word and a second word in the API, wherein the first word is "Create" and the second word is "File" after the "CreateFileW" is divided;

s10212 populates the first word in a first column corresponding to the API name and populates the second word in a second column corresponding to the API name.

S1022, merging the API names based on the content of the first word to reduce feature dimension;

and for the API which does not contain capital letters, the first word is filled into the API, and the second word is subjected to null processing, so that the first word and the second word are added. In this way, some APIs with the same or similar functionality, such as "CreateFileW", "CreateFileA", may be incorporated, thereby reducing feature dimensions.

S1023, optimizing the low-frequency API, namely optimizing the API based on the number of the files corresponding to the first column;

s1024, generating a new field based on the thread ID and the sequence number of the API call in the thread, comprising the following steps:

generating a first field based on the thread ID and a difference between a sequence number (index) of an API call and the thread ID (tid);

taking a file ID (file_id) and a thread ID as grouping objects respectively, and calculating a first difference value of two times before and after the serial numbers of the API call;

the contents of two adjacent first words corresponding to the same file ID are spliced to generate a second field.

S1025, converting the file category into a numerical value, and finishing label coding mapping.

For example, "Trojan" maps to a value of 0, "worm virus" maps to a value of 1, "malicious web page file" maps to a value of 2, and "unknown" maps to a value of 3, etc.

The label (file category) includes, but is not limited to, the following categories: trojan horse, worm virus, macro virus document, downloader, virus program, malicious webpage file, suspicious program, backdoor program, game/play file, unknown, etc., wherein "unknown" means that antivirus software cannot determine whether it is a malicious file, but does not represent it as a normal file.

The great hump naming convention refers to variable names or function names that are concatenated together by one or more words, and the initials of each word are capitalized as "CreateFileW".

S103, feature engineering, which is to construct feature engineering based on the data processed in the step S102, wherein the feature engineering mainly comprises two parts, namely global features and local combined features, and specifically comprises the following steps:

s1031, taking a file ID (file_id) as a grouping object to count the global feature, wherein the global feature mainly comprises the following parts:

taking the file ID as a grouping object, counting the number of times (fileid_API1_count) of occurrence of a first word (first word) and the number of times (fileid_API1_nunique) of occurrence after repeated elimination; taking the file ID as a grouping object, counting the number of times (fileid_APP2_nunique) of occurrence of a second word (second word) after duplicate removal;

taking the file ID as a grouping object, counting the maximum value (fileid_tid_max), the minimum value (fileid_tid_min), the mean value (fileid_tid_mean), the median (fileid_tid_mean), the standard deviation (fileid_tid_std), the number of times after removing the weight (fileid_tid_unique), the dispersion (fileid_tid_dis), the variation coefficient (fileid_tid_cv) and the deviation of the median from the mean value (fileid_tid_sk) of the thread ID (tid);

taking the file ID as a grouping object, counting the maximum value (fileid_index_max), the minimum value (fileid_index_min), the mean (fileid_index_mean), the median (fileid_index_mean), the standard deviation (fileid_index_std), the number of times after removing the weight (fileid_index_unique), the dispersion (fileid_index_dis), the coefficient of variation (fileid_index_cv), and the deviation between the median and the mean (fileid_index_sk) of the sequence numbers (index) of API calls;

taking the file ID as a grouping object, counting the maximum value (fileid_inx_tid_max), the minimum value (fileid_inx_tid_min), the mean value (fileid_inx_tid_mean), the median (fileid_inx_tid_mean), the standard deviation (fileid_inx_tid_std), the number of weight removal times (fileid_inx_tid_unique), the dispersion (fileid_inx_tid_dis), the variation coefficient (fileid_inx_tid_cv) and the deviation between the median and the mean value (fileid_inx_tid_sk);

taking the file ID as a grouping object, counting the number of times the second field (API_2N) of the API occurs (fileid_API_2N_count) and the number of times after repeated elimination occurs (fileid_API_2N_nunique).

S1032, taking a file ID (file_id) and a preset field combination as grouping objects to count local combination features, further taking the file ID as a primary key, taking the preset field as a column name to develop a transposed generated feature set, and mainly comprising the following parts:

taking a file ID and a first word (first word) as grouping objects, and counting the occurrence times of each second word (second word) and the occurrence times after the second word is removed;

taking the file ID and the first word as grouping objects, and counting the maximum value, the minimum value, the median, the standard deviation and the number of times after removing the repetition of each first difference value (index_diff);

the number of occurrences of each second word (second word) is counted with the file ID and the second field (api_2n) as packet objects.

S1033, taking the file ID as a main key, and finally splicing the two feature sets of the global feature and the local combined feature into a feature set.

The dispersion (fileid_tid_dis) is the number of times (fileid_tid_unique)/total number of times (fileid_api 1_count) after the tid is removed;

the coefficient of variation (fileid_tid_cv) is the standard deviation of tid (fileid_tid_std)/the mean of tid (fileid_tid_mean);

the deviation degree (fileid_tid_sk) of the median from the mean is the median of tid (fileid_tid_mean)/tid (fileid_tid_mean);

s104, performing model construction, and correcting the model based on the characteristic engineering set and a preset threshold value:

in view of the fact that the existing dataset does not contain the record that label is "normal", the files which cannot be judged by part of antivirus software are required to be corrected into the record number of "normal" according to the initial training result of the model, and then the model is trained again, so that the model has certain capability of identifying the "normal" files, and the specific implementation process is as follows:

taking the feature set extracted in the step S103 and the file category corresponding to the file ID as the input of a LightGBM multi-classification model, and outputting the probability of the file category corresponding to each file ID through repeated iterative learning of the model;

correcting the file type of the file ID with the maximum probability value smaller than a preset threshold value (35%) and the original file type of which is unknown into a pseudo tag of which the file type is normal, removing the record with the unknown file type, adding the record with the pseudo tag of which the file type is normal, mapping the file type of which is normal into a value of 3, and forming a new data set to be used as the input of a LightGBM multi-classification model;

and taking the new data set as the input of the model to carry out iterative learning for preset times, and storing the model after repeated iterative training to finish the correction of the model.

The LightGBM multi-classification model is a decision tree-based distributed gradient lifting algorithm model, and the core ideas mainly comprise a Histogram strategy, a leaf-wise growth strategy, a GOSS sampling strategy and the like. The idea of the Histogram is to discretize and convert continuous characteristic values into bin (bin) data, specifically, determining how many bins (bins) are needed for each characteristic, then equally dividing, updating sample data belonging to the bins into bin values, and finally representing the bin values by using the Histogram. The method solves the problem that other gradient lifting algorithms have large cost and long time for searching the optimal segmentation point of each feature. The LightGBM adopts a Leaf-wise growth strategy, one Leaf with the maximum splitting gain is found from all the current leaves each time, then splitting is performed, and the cycle is performed, so that under the condition that the splitting times are the same, more errors can be reduced by the Leaf-wise compared with the level-wise growth strategy, and better precision is obtained. The GOSS sampling strategy is a strategy for reducing the data volume and ensuring the relative balance of precision, and by distinguishing the examples of different gradients, the calculation amount is reduced in a mode of randomly sampling the smaller gradient while the larger gradient examples are reserved, so that the calculation efficiency is improved.

S105, detecting the model, and detecting the acquired unknown file based on the corrected model to confirm whether the unknown file is a malicious file or not.

Example two

Referring to fig. 2, the present invention provides a system for detecting malicious files based on an API according to the first embodiment, which specifically includes the following modules:

the module 101, a parameter confirmation module, configured to perform the step of step S101 in the first embodiment, that is, to classify the collected files to confirm the file types, and put the files into a sandbox to run, and record the parameters called when each file runs; the file category comprises known files and unknown files; the parameters include: file ID, API name, thread ID, and sequence number of API calls;

the module 102, a preprocessing module, configured to execute the step of step S102 in the first embodiment, that is, perform preprocessing based on parameters called during the running of the known file, so as to serve as model training data;

a module 103, a feature construction module, configured to perform the step of step S103 of the first embodiment, that is, construct a feature engineering set based on the preprocessed data, where the feature engineering set includes: global features and local combined features;

the module 104 is a correction module, configured to execute the step S104 of the first embodiment, that is, perform model construction, and correct the model based on the feature engineering set and a preset threshold;

the module 105, the detection module, is configured to perform the step of step S105 of the first embodiment, that is, is configured to detect the collected unknown file based on the corrected model, so as to confirm whether the unknown file is a malicious file.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The malicious file detection method based on the API is characterized by comprising the following steps of:

s102, preprocessing is carried out based on parameters called by the running process of the known file to be used as model training data, and the preprocessing step comprises the following steps:

based on the thread ID and the sequence number of the API call and the difference value of the thread ID, a first field is generated, file ID and thread ID are respectively used as grouping objects, the first difference value of the sequence number of the API call before and after twice is calculated, and the contents of two adjacent first words corresponding to the same file ID are spliced to generate a second field;

converting the file category into a numerical value and finishing label coding mapping;

2. The API-based malicious file detection method as recited in claim 1, wherein the step of partitioning the API name to obtain a first word and a second word, and populating a first column and a second column corresponding to the API name based on the first word comprises:

3. The API-based malicious file detection method as set forth in claim 1, wherein said step S103 of constructing a feature engineering set based on the preprocessed data comprises:

taking the file ID as a grouping object to count the global features;

4. An API-based malicious file detection method in accordance with claim 3, wherein said step of counting said global features with file IDs as grouping objects comprises:

5. The API-based malicious file detection method as recited in claim 3, wherein the step of counting the local combination feature with a file ID and a preset field as a grouping object comprises:

6. The API-based malicious file detection method as set forth in claim 1, wherein in the step S104, the step of modifying the model based on the feature engineering set and a preset threshold value includes:

7. The API-based malicious file detection method as claimed in claim 1, wherein said step S101 of classifying the collected file comprises: and scanning the acquired files through antivirus software, and confirming the types of the files according to the scanning results.

8. An API-based malicious file detection system, comprising:

the preprocessing module is used for preprocessing based on parameters called by the known file in the running process to serve as model training data, and the preprocessing step comprises the following steps: