CN113378156B - API-based malicious file detection method and system - Google Patents

API-based malicious file detection method and system Download PDF

Info

Publication number
CN113378156B
CN113378156B CN202110749396.XA CN202110749396A CN113378156B CN 113378156 B CN113378156 B CN 113378156B CN 202110749396 A CN202110749396 A CN 202110749396A CN 113378156 B CN113378156 B CN 113378156B
Authority
CN
China
Prior art keywords
file
api
word
files
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110749396.XA
Other languages
Chinese (zh)
Other versions
CN113378156A (en
Inventor
梁淑云
殷钱安
余贤喆
王启凡
陶景龙
徐�明
刘胜
马影
周晓勇
魏国富
夏玉明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information and Data Security Solutions Co Ltd
Original Assignee
Information and Data Security Solutions Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information and Data Security Solutions Co Ltd filed Critical Information and Data Security Solutions Co Ltd
Priority to CN202110749396.XA priority Critical patent/CN113378156B/en
Publication of CN113378156A publication Critical patent/CN113378156A/en
Application granted granted Critical
Publication of CN113378156B publication Critical patent/CN113378156B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/52Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems during program execution, e.g. stack integrity ; Preventing unwanted data erasure; Buffer overflow
    • G06F21/53Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems during program execution, e.g. stack integrity ; Preventing unwanted data erasure; Buffer overflow by executing in a restricted environment, e.g. sandbox or secure virtual machine
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/448Execution paradigms, e.g. implementations of programming paradigms
    • G06F9/4488Object-oriented
    • G06F9/449Object-oriented method invocation or resolution
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention provides a malicious file detection method and a system based on an API, wherein the method comprises the following steps: the file is put into an sandbox to run, and the name and tid of the API called by the file during running and the sequence number index of the API call in the thread are recorded at the same time; data preprocessing, including: processing the API in the data, optimizing the low-frequency API, generating a new field and mapping the label code; constructing a feature project based on the processed data, wherein the feature project comprises global features and local combined features, and the two feature sets are finally spliced into a feature set; correcting the file which cannot be judged by part of antivirus software into a normal record number according to the initial training result of the model, and further training the model again; model prediction. The invention also provides a malicious file detection system based on the API. The method has a certain identification rate on various malicious files bypassing the feature codes and the sandboxes and can improve the generalization capability of malicious file detection.

Description

API-based malicious file detection method and system
Technical Field
The invention relates to the technical field of information security services, in particular to a malicious file detection method and system based on an API (application program interface).
Background
In recent years, with the development of computer technology, intelligent terminals and network technology are widely used, and malicious files are spread and mutated to some extent. Legal documents are used for strengthening and expanding the capability of a computer, thereby facilitating the work and life of people; malicious files are used for stealing or destroying computer data, and the like, so that economic loss and mental trouble can be brought to enterprises and individuals. Therefore, the malicious files are timely detected, the threat brought by the malicious files is blocked, and it is more and more important to maintain the health and safety of the network environment.
The current malicious file detection method mainly comprises a feature code method, a sandbox detection technology and the like. The method is fast in detection speed, but cannot detect the malicious files containing unknown feature codes, and once the malicious files can escape from the detection of the feature codes through means of deformation, encryption, shell adding and the like. In recent years, sandboxed technology is increasingly widely used, and the method is used for judging whether an unknown file belongs to a malicious file by simulating a normal environment for the unknown file to run, recording the actions of the file to run and matching the actions with a malicious file library. Along with the popularization of machine learning applications, some learning methods for constructing a machine learning model to detect malicious files also appear. As disclosed in patent document 202010572487.6, a method for constructing a detection model of a malicious file and detecting the malicious file is disclosed, and a plurality of normal samples and a plurality of malicious samples are obtained and respectively labeled; filtering out unshelling malicious samples in the malicious samples; establishing a static model, including: obtaining PE formats of a plurality of normal samples and a plurality of malicious samples; converting the data into a plurality of feature vectors according to the PE format of each acquired sample; combining a plurality of the feature vectors and associating with a tag; adjusting the random forest model and the LightGBM model to optimal parameters; inputting the feature vector associated with the tag into a random forest model and a LightGBM model, and respectively establishing the random forest model and the LightGBM model for statically detecting the malicious file; establishing a dynamic model, which comprises the following steps: placing a plurality of normal samples and a plurality of malicious samples into a sandbox to obtain a sandbox report, and obtaining characteristic vectors of each sample about API, tid, return _value and index in the sandbox report; combining a plurality of the feature vectors and associating with a tag; adjusting the random forest model and the LightGBM model to optimal parameters, and establishing an important characteristic random forest model; inputting the feature vectors associated with the tags into a random forest model, an important feature random forest model and a LightGBM model, and respectively establishing the random forest model, the important feature random forest model and the LightGBM model for dynamically detecting malicious files; fusing all the static models and all the dynamic models to obtain fused models; and calculating the total malicious suspicious score obtained according to the fusion model and the malicious suspicious score obtained by the malheur model to obtain a final malicious score, and detecting a sample according to the final malicious score.
Although the sandbox detection technology avoids the defect that unknown malicious files cannot be detected by the feature code method to a certain extent, attackers are always searching for various methods to bypass detection of the sandbox, such as detection of system features, delay operation and the like, so that whether part of unknown files belong to malicious files cannot be judged in the sandbox.
The existing method utilizing machine learning modeling solves the problem that the sandbox detection technology is bypassed to a certain extent, but the characteristic engineering of the method is biased to statistical characteristics, the characteristic difference of different files on an API call time sequence is ignored, and the problem of characteristic sparsity is ignored during characteristic processing, so that the accuracy and efficiency of a model are possibly reduced.
Disclosure of Invention
The technical problem to be solved by the invention is how to judge whether an unknown file belongs to a malicious file.
The invention solves the technical problems by the following technical means: a malicious file detection method based on an API comprises the following steps:
s101, classifying the acquired files to confirm the types of the files, putting the files into a sandbox for running, and recording parameters called when each file runs; the file category comprises known files and unknown files; the parameters include: file ID, API name, thread ID, and sequence number of API calls;
s102, preprocessing based on parameters called by the running of the known file to serve as model training data;
s103, constructing a characteristic engineering set based on the preprocessed data, wherein the characteristic engineering set comprises: global features and local combined features;
s104, constructing a model, and correcting the model based on the characteristic engineering set and a preset threshold;
s105, detecting the acquired unknown file based on the corrected model to confirm whether the unknown file is a malicious file or not.
The method is mainly aimed at processing the API, extracting keywords in the API and constructing the features, so that the feature dimension and feature sparsity are reduced, and the efficiency and accuracy of the model are improved; moreover, through model detection, probability values of unknown files belonging to various categories can be output, and the probability that the unknown files belong to malicious files, namely scoring, is quantized; in addition, under the condition that no normal label file exists, a pseudo label normal data set is generated by using the model, and a multi-classification model with a certain identification capacity on the normal label file is trained, so that whether an unknown file is a malicious file, a category of the malicious file and the like are predicted.
As an optimized technical solution, in step S102, the step of preprocessing based on parameters called during the running of the known file includes:
dividing the API name to obtain a first word and a second word, and filling a first column and a second column corresponding to the API name based on the first word;
merging the API names based on the content of the first column to reduce feature dimensions;
performing optimization processing of the API based on the number of the files corresponding to the first column;
generating a new field based on the thread ID and the sequence number of the API call in the thread;
and converting the file category into a numerical value and finishing label coding mapping. As an optimized technical solution, the step of dividing the API name to obtain a first word and a second word, and filling a first column and a second column corresponding to the API name based on the first word includes:
dividing an API according to a large hump naming rule of the API in a regular matching mode to obtain a first word and a second word in the API;
the first word is filled in a first column corresponding to the API name and the second word is filled in a second column corresponding to the API name.
As an optimized technical solution, the step of generating a new field based on the thread ID and the sequence number of the API call in the thread includes:
generating a first field based on the thread ID and a difference between a sequence number of an API call and the thread ID;
taking the file name and the thread ID as grouping objects, calculating a first difference value of two times before and after the serial numbers of the API call;
the contents of two adjacent first words corresponding to the same thread ID are spliced to generate a second field.
The step S103 of constructing a feature engineering set based on the preprocessed data includes:
taking the file ID as a grouping object to count the global features;
taking the file ID and a preset field as grouping objects to count the local combination characteristics;
and splicing the global features and the local combined features into the feature engineering set by taking the file ID as a primary key.
As an optimized technical solution, the step of taking the file ID as the grouping object to count the global feature includes:
taking the file ID as a grouping object, counting the times of occurrence of the first word, the times of occurrence after duplicate removal and the times of occurrence of the second word after duplicate removal;
taking the file ID as a grouping object, and counting the maximum value, the minimum value, the mean value, the median, the standard deviation, the number of times after eliminating the weight, the dispersion, the variation coefficient and the deviation between the median and the mean value of the thread ID;
taking the file ID as a grouping object, counting the maximum value, the minimum value, the mean value, the median, the standard deviation, the number of times after removing the repetition, the dispersion, the variation coefficient, the deviation degree of the median and the mean value of the sequence numbers of the API call;
taking the file ID as a grouping object, and counting the maximum value, the minimum value, the mean value, the median, the standard deviation, the number of times after eliminating the repetition, the dispersion, the variation coefficient, the deviation of the median and the mean value of the first field;
and taking the file ID as a grouping object, and counting the times of occurrence of the second field of the API and the times of occurrence after duplicate removal.
As an optimized technical scheme, the step of counting the local combination features by taking the file ID and the preset field as grouping objects comprises the following steps:
taking the file ID and the first word as grouping objects, counting the occurrence times of each second word and the occurrence times after repeated elimination;
taking the file ID and the first word as grouping objects, and counting the maximum value, the minimum value, the median, the standard deviation and the number of times after removing the repetition of each first difference value;
and counting the occurrence times of each second word by taking the file ID and the second field as grouping objects.
As an optimized technical solution, in step S104, the step of correcting the model based on the feature engineering set and the preset threshold includes:
iterative learning is carried out by taking the characteristic engineering set and the file category corresponding to the file as the input of a model, and the probability of the file category corresponding to each file ID is output;
correcting the file category of which the maximum probability value is smaller than a preset threshold value and the original file category is 'unknown file' to be a pseudo tag 'normal' so as to form a new data set;
and taking the new data set as the input of the model to perform iterative learning for preset times so as to complete the correction of the model.
As an optimized technical solution, in step S101, the step of classifying the collected file includes: and scanning the acquired files through antivirus software, and confirming the types of the files according to the scanning results.
The invention also provides a malicious file detection system based on the API, which comprises the following steps:
the parameter confirmation module is used for classifying the acquired files to confirm the types of the files, putting the files into a sandbox to run, and recording parameters called when each file runs; the file category comprises known files and unknown files; the parameters include: file ID, API name, thread ID, and sequence number of API calls;
the preprocessing module is used for preprocessing based on parameters called by the known file in the running process so as to be used as model training data;
the feature construction module is used for constructing a feature engineering set based on the preprocessed data, and the feature engineering set comprises: global features and local combined features;
the correction module is used for constructing a model and correcting the model based on the characteristic engineering set and a preset threshold value;
the detection module is used for detecting the acquired unknown file based on the corrected model so as to confirm whether the unknown file is a malicious file or not.
The invention has the advantages that: the invention provides an API-based malicious file detection method, which is characterized in that a characteristic engineering is constructed and a classification model is trained by calling an API (application program interface) and a TID (thread ID) when a file is operated, so as to judge whether an unknown file belongs to a malicious file.
Meanwhile, the method mainly aims at processing of the API, and extraction and feature construction of keywords in the API are carried out, so that feature dimension and feature sparsity are reduced, and the efficiency and accuracy of the model are improved; furthermore, the probability value of the unknown file belonging to each category can be output through model prediction, and the probability that the unknown file belongs to the malicious file, namely the score, is quantized; in addition, under the condition that no normal label file exists, a pseudo label normal data set is generated by using the model, and a multi-classification model with a certain identification capacity on the normal label file is trained, so that whether an unknown file is a malicious file, a category of the malicious file and the like are predicted.
Drawings
Fig. 1 is a general flow chart of a malicious file detection method based on an API in embodiment 1 of the present invention.
Fig. 2 is a diagram illustrating a malicious file detection system module based on an API according to embodiment 2 of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described in the following in conjunction with the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As described in the background art, the existing methods for detecting malicious files have a certain degree of problems. From the aspect of actual service, the method optimizes the problems existing in the prior art, such as 2-gram combination for constructing the API, segmentation and combination of the API, combination treatment of the low-frequency API and the like, avoids the problem of feature sparsity, reduces feature dimension, and simultaneously expands the features of the API on a call time sequence, thereby improving the accuracy and efficiency of the model. On the other hand, in the actual environment, there are a lot of unlabeled samples, but the labeled samples are limited, and the invention also solves the problem that a model for predicting the category containing normal files cannot be constructed under the condition that only a large number of unlabeled files have no normal files, namely, a method for constructing the model to generate the normal files by using a pseudo-label mode.
Example 1
Referring to fig. 1, the invention provides a malicious file detection method based on an API, which specifically includes the following steps:
s101, classifying the collected files to confirm the types of the files, putting the files into a sandbox for running, and recording parameters called when each file runs;
the file category comprises known files and unknown files;
the parameters include: API (application program interface) name, thread ID (tid), and sequence number (index) of API calls in the thread. In the running process of the file, a plurality of APIs and tids are generally called, no precedence relation exists among different tids, and index in the same tid is represented by the precedence relation called from small to large, but may not be continuous.
Wherein the step of classifying the collected files comprises: and scanning the acquired files through antivirus software, and confirming the types of the files according to the scanning results.
S102, preprocessing based on parameters called by the running of the known file to serve as model training data;
the step of preprocessing based on parameters called by the running time of the known file comprises the following steps:
s1021, dividing the API name to obtain a first word and a second word, and filling a first column and a second column corresponding to the API name based on the first word:
s10211, dividing the API according to a large hump naming rule of the API in a regular matching mode to obtain a first word and a second word in the API, wherein the first word is "Create" and the second word is "File" after the "CreateFileW" is divided;
s10212 populates the first word in a first column corresponding to the API name and populates the second word in a second column corresponding to the API name.
S1022, merging the API names based on the content of the first word to reduce feature dimension;
and for the API which does not contain capital letters, the first word is filled into the API, and the second word is subjected to null processing, so that the first word and the second word are added. In this way, some APIs with the same or similar functionality, such as "CreateFileW", "CreateFileA", may be incorporated, thereby reducing feature dimensions.
S1023, optimizing the low-frequency API, namely optimizing the API based on the number of the files corresponding to the first column;
s1024, generating a new field based on the thread ID and the sequence number of the API call in the thread, comprising the following steps:
generating a first field based on the thread ID and a difference between a sequence number (index) of an API call and the thread ID (tid);
taking a file ID (file_id) and a thread ID as grouping objects respectively, and calculating a first difference value of two times before and after the serial numbers of the API call;
the contents of two adjacent first words corresponding to the same file ID are spliced to generate a second field.
S1025, converting the file category into a numerical value, and finishing label coding mapping.
For example, "Trojan" maps to a value of 0, "worm virus" maps to a value of 1, "malicious web page file" maps to a value of 2, and "unknown" maps to a value of 3, etc.
The label (file category) includes, but is not limited to, the following categories: trojan horse, worm virus, macro virus document, downloader, virus program, malicious webpage file, suspicious program, backdoor program, game/play file, unknown, etc., wherein "unknown" means that antivirus software cannot determine whether it is a malicious file, but does not represent it as a normal file.
The great hump naming convention refers to variable names or function names that are concatenated together by one or more words, and the initials of each word are capitalized as "CreateFileW".
S103, feature engineering, which is to construct feature engineering based on the data processed in the step S102, wherein the feature engineering mainly comprises two parts, namely global features and local combined features, and specifically comprises the following steps:
s1031, taking a file ID (file_id) as a grouping object to count the global feature, wherein the global feature mainly comprises the following parts:
taking the file ID as a grouping object, counting the number of times (fileid_API1_count) of occurrence of a first word (first word) and the number of times (fileid_API1_nunique) of occurrence after repeated elimination; taking the file ID as a grouping object, counting the number of times (fileid_APP2_nunique) of occurrence of a second word (second word) after duplicate removal;
taking the file ID as a grouping object, counting the maximum value (fileid_tid_max), the minimum value (fileid_tid_min), the mean value (fileid_tid_mean), the median (fileid_tid_mean), the standard deviation (fileid_tid_std), the number of times after removing the weight (fileid_tid_unique), the dispersion (fileid_tid_dis), the variation coefficient (fileid_tid_cv) and the deviation of the median from the mean value (fileid_tid_sk) of the thread ID (tid);
taking the file ID as a grouping object, counting the maximum value (fileid_index_max), the minimum value (fileid_index_min), the mean (fileid_index_mean), the median (fileid_index_mean), the standard deviation (fileid_index_std), the number of times after removing the weight (fileid_index_unique), the dispersion (fileid_index_dis), the coefficient of variation (fileid_index_cv), and the deviation between the median and the mean (fileid_index_sk) of the sequence numbers (index) of API calls;
taking the file ID as a grouping object, counting the maximum value (fileid_inx_tid_max), the minimum value (fileid_inx_tid_min), the mean value (fileid_inx_tid_mean), the median (fileid_inx_tid_mean), the standard deviation (fileid_inx_tid_std), the number of weight removal times (fileid_inx_tid_unique), the dispersion (fileid_inx_tid_dis), the variation coefficient (fileid_inx_tid_cv) and the deviation between the median and the mean value (fileid_inx_tid_sk);
taking the file ID as a grouping object, counting the number of times the second field (API_2N) of the API occurs (fileid_API_2N_count) and the number of times after repeated elimination occurs (fileid_API_2N_nunique).
S1032, taking a file ID (file_id) and a preset field combination as grouping objects to count local combination features, further taking the file ID as a primary key, taking the preset field as a column name to develop a transposed generated feature set, and mainly comprising the following parts:
taking a file ID and a first word (first word) as grouping objects, and counting the occurrence times of each second word (second word) and the occurrence times after the second word is removed;
taking the file ID and the first word as grouping objects, and counting the maximum value, the minimum value, the median, the standard deviation and the number of times after removing the repetition of each first difference value (index_diff);
the number of occurrences of each second word (second word) is counted with the file ID and the second field (api_2n) as packet objects.
S1033, taking the file ID as a main key, and finally splicing the two feature sets of the global feature and the local combined feature into a feature set.
The dispersion (fileid_tid_dis) is the number of times (fileid_tid_unique)/total number of times (fileid_api 1_count) after the tid is removed;
the coefficient of variation (fileid_tid_cv) is the standard deviation of tid (fileid_tid_std)/the mean of tid (fileid_tid_mean);
the deviation degree (fileid_tid_sk) of the median from the mean is the median of tid (fileid_tid_mean)/tid (fileid_tid_mean);
s104, performing model construction, and correcting the model based on the characteristic engineering set and a preset threshold value:
in view of the fact that the existing dataset does not contain the record that label is "normal", the files which cannot be judged by part of antivirus software are required to be corrected into the record number of "normal" according to the initial training result of the model, and then the model is trained again, so that the model has certain capability of identifying the "normal" files, and the specific implementation process is as follows:
taking the feature set extracted in the step S103 and the file category corresponding to the file ID as the input of a LightGBM multi-classification model, and outputting the probability of the file category corresponding to each file ID through repeated iterative learning of the model;
correcting the file type of the file ID with the maximum probability value smaller than a preset threshold value (35%) and the original file type of which is unknown into a pseudo tag of which the file type is normal, removing the record with the unknown file type, adding the record with the pseudo tag of which the file type is normal, mapping the file type of which is normal into a value of 3, and forming a new data set to be used as the input of a LightGBM multi-classification model;
and taking the new data set as the input of the model to carry out iterative learning for preset times, and storing the model after repeated iterative training to finish the correction of the model.
The LightGBM multi-classification model is a decision tree-based distributed gradient lifting algorithm model, and the core ideas mainly comprise a Histogram strategy, a leaf-wise growth strategy, a GOSS sampling strategy and the like. The idea of the Histogram is to discretize and convert continuous characteristic values into bin (bin) data, specifically, determining how many bins (bins) are needed for each characteristic, then equally dividing, updating sample data belonging to the bins into bin values, and finally representing the bin values by using the Histogram. The method solves the problem that other gradient lifting algorithms have large cost and long time for searching the optimal segmentation point of each feature. The LightGBM adopts a Leaf-wise growth strategy, one Leaf with the maximum splitting gain is found from all the current leaves each time, then splitting is performed, and the cycle is performed, so that under the condition that the splitting times are the same, more errors can be reduced by the Leaf-wise compared with the level-wise growth strategy, and better precision is obtained. The GOSS sampling strategy is a strategy for reducing the data volume and ensuring the relative balance of precision, and by distinguishing the examples of different gradients, the calculation amount is reduced in a mode of randomly sampling the smaller gradient while the larger gradient examples are reserved, so that the calculation efficiency is improved.
S105, detecting the model, and detecting the acquired unknown file based on the corrected model to confirm whether the unknown file is a malicious file or not.
Example two
Referring to fig. 2, the present invention provides a system for detecting malicious files based on an API according to the first embodiment, which specifically includes the following modules:
the module 101, a parameter confirmation module, configured to perform the step of step S101 in the first embodiment, that is, to classify the collected files to confirm the file types, and put the files into a sandbox to run, and record the parameters called when each file runs; the file category comprises known files and unknown files; the parameters include: file ID, API name, thread ID, and sequence number of API calls;
the module 102, a preprocessing module, configured to execute the step of step S102 in the first embodiment, that is, perform preprocessing based on parameters called during the running of the known file, so as to serve as model training data;
a module 103, a feature construction module, configured to perform the step of step S103 of the first embodiment, that is, construct a feature engineering set based on the preprocessed data, where the feature engineering set includes: global features and local combined features;
the module 104 is a correction module, configured to execute the step S104 of the first embodiment, that is, perform model construction, and correct the model based on the feature engineering set and a preset threshold;
the module 105, the detection module, is configured to perform the step of step S105 of the first embodiment, that is, is configured to detect the collected unknown file based on the corrected model, so as to confirm whether the unknown file is a malicious file.
The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1. The malicious file detection method based on the API is characterized by comprising the following steps of:
s101, classifying the acquired files to confirm the types of the files, putting the files into a sandbox for running, and recording parameters called when each file runs; the file category comprises known files and unknown files; the parameters include: file ID, API name, thread ID, and sequence number of API calls;
s102, preprocessing is carried out based on parameters called by the running process of the known file to be used as model training data, and the preprocessing step comprises the following steps:
dividing the API name to obtain a first word and a second word, and filling a first column and a second column corresponding to the API name based on the first word;
merging the API names based on the content of the first column to reduce feature dimensions;
performing optimization processing of the API based on the number of the files corresponding to the first column;
based on the thread ID and the sequence number of the API call and the difference value of the thread ID, a first field is generated, file ID and thread ID are respectively used as grouping objects, the first difference value of the sequence number of the API call before and after twice is calculated, and the contents of two adjacent first words corresponding to the same file ID are spliced to generate a second field;
converting the file category into a numerical value and finishing label coding mapping;
s103, constructing a characteristic engineering set based on the preprocessed data, wherein the characteristic engineering set comprises: global features and local combined features;
s104, constructing a model, and correcting the model based on the characteristic engineering set and a preset threshold;
s105, detecting the acquired unknown file based on the corrected model to confirm whether the unknown file is a malicious file or not.
2. The API-based malicious file detection method as recited in claim 1, wherein the step of partitioning the API name to obtain a first word and a second word, and populating a first column and a second column corresponding to the API name based on the first word comprises:
dividing an API according to a large hump naming rule of the API in a regular matching mode to obtain a first word and a second word in the API;
the first word is filled in a first column corresponding to the API name and the second word is filled in a second column corresponding to the API name.
3. The API-based malicious file detection method as set forth in claim 1, wherein said step S103 of constructing a feature engineering set based on the preprocessed data comprises:
taking the file ID as a grouping object to count the global features;
taking the file ID and a preset field as grouping objects to count the local combination characteristics;
and splicing the global features and the local combined features into the feature engineering set by taking the file ID as a primary key.
4. An API-based malicious file detection method in accordance with claim 3, wherein said step of counting said global features with file IDs as grouping objects comprises:
taking the file ID as a grouping object, counting the times of occurrence of the first word, the times of occurrence after duplicate removal and the times of occurrence of the second word after duplicate removal;
taking the file ID as a grouping object, and counting the maximum value, the minimum value, the mean value, the median, the standard deviation, the number of times after eliminating the weight, the dispersion, the variation coefficient and the deviation between the median and the mean value of the thread ID;
taking the file ID as a grouping object, counting the maximum value, the minimum value, the mean value, the median, the standard deviation, the number of times after removing the repetition, the dispersion, the variation coefficient, the deviation degree of the median and the mean value of the sequence numbers of the API call;
taking the file ID as a grouping object, and counting the maximum value, the minimum value, the mean value, the median, the standard deviation, the number of times after eliminating the repetition, the dispersion, the variation coefficient, the deviation of the median and the mean value of the first field;
and taking the file ID as a grouping object, and counting the times of occurrence of the second field of the API and the times of occurrence after duplicate removal.
5. The API-based malicious file detection method as recited in claim 3, wherein the step of counting the local combination feature with a file ID and a preset field as a grouping object comprises:
taking the file ID and the first word as grouping objects, counting the occurrence times of each second word and the occurrence times after repeated elimination;
taking the file ID and the first word as grouping objects, and counting the maximum value, the minimum value, the median, the standard deviation and the number of times after removing the repetition of each first difference value;
and counting the occurrence times of each second word by taking the file ID and the second field as grouping objects.
6. The API-based malicious file detection method as set forth in claim 1, wherein in the step S104, the step of modifying the model based on the feature engineering set and a preset threshold value includes:
iterative learning is carried out by taking the characteristic engineering set and the file category corresponding to the file as the input of a model, and the probability of the file category corresponding to each file ID is output;
correcting the file category of which the maximum probability value is smaller than a preset threshold value and the original file category is 'unknown file' to be a pseudo tag 'normal' so as to form a new data set;
and taking the new data set as the input of the model to perform iterative learning for preset times so as to complete the correction of the model.
7. The API-based malicious file detection method as claimed in claim 1, wherein said step S101 of classifying the collected file comprises: and scanning the acquired files through antivirus software, and confirming the types of the files according to the scanning results.
8. An API-based malicious file detection system, comprising:
the parameter confirmation module is used for classifying the acquired files to confirm the types of the files, putting the files into a sandbox to run, and recording parameters called when each file runs; the file category comprises known files and unknown files; the parameters include: file ID, API name, thread ID, and sequence number of API calls;
the preprocessing module is used for preprocessing based on parameters called by the known file in the running process to serve as model training data, and the preprocessing step comprises the following steps:
dividing the API name to obtain a first word and a second word, and filling a first column and a second column corresponding to the API name based on the first word;
merging the API names based on the content of the first column to reduce feature dimensions;
performing optimization processing of the API based on the number of the files corresponding to the first column;
based on the thread ID and the sequence number of the API call and the difference value of the thread ID, a first field is generated, file ID and thread ID are respectively used as grouping objects, the first difference value of the sequence number of the API call before and after twice is calculated, and the contents of two adjacent first words corresponding to the same file ID are spliced to generate a second field;
converting the file category into a numerical value and finishing label coding mapping;
the feature construction module is used for constructing a feature engineering set based on the preprocessed data, and the feature engineering set comprises: global features and local combined features;
the correction module is used for constructing a model and correcting the model based on the characteristic engineering set and a preset threshold value;
the detection module is used for detecting the acquired unknown file based on the corrected model so as to confirm whether the unknown file is a malicious file or not.
CN202110749396.XA 2021-07-01 2021-07-01 API-based malicious file detection method and system Active CN113378156B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110749396.XA CN113378156B (en) 2021-07-01 2021-07-01 API-based malicious file detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110749396.XA CN113378156B (en) 2021-07-01 2021-07-01 API-based malicious file detection method and system

Publications (2)

Publication Number Publication Date
CN113378156A CN113378156A (en) 2021-09-10
CN113378156B true CN113378156B (en) 2023-07-11

Family

ID=77580639

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110749396.XA Active CN113378156B (en) 2021-07-01 2021-07-01 API-based malicious file detection method and system

Country Status (1)

Country Link
CN (1) CN113378156B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117193889B (en) * 2023-08-02 2024-03-08 上海澜码科技有限公司 Construction method of code example library and use method of code example library

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111639337A (en) * 2020-04-17 2020-09-08 中国科学院信息工程研究所 Unknown malicious code detection method and system for massive Windows software
CN111723371A (en) * 2020-06-22 2020-09-29 上海斗象信息科技有限公司 Method for constructing detection model of malicious file and method for detecting malicious file
CN112241530A (en) * 2019-07-19 2021-01-19 中国人民解放军战略支援部队信息工程大学 Malicious PDF document detection method and electronic equipment

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9344472B2 (en) * 2012-12-28 2016-05-17 Microsoft Technology Licensing, Llc Seamlessly playing a composite media presentation
KR101620931B1 (en) * 2014-09-04 2016-05-13 한국전자통신연구원 Similar malicious code retrieval apparatus and method based on malicious code feature information
US20160241560A1 (en) * 2015-02-13 2016-08-18 Instart Logic, Inc. Client-site dom api access control
CN109508545B (en) * 2018-11-09 2021-06-04 北京大学 Android Malware classification method based on sparse representation and model fusion
CN109543751A (en) * 2018-11-22 2019-03-29 南京中孚信息技术有限公司 Method for mode matching, device and electronic equipment based on multithreading
CN111368289B (en) * 2018-12-26 2023-08-29 中兴通讯股份有限公司 Malicious software detection method and device
KR102317833B1 (en) * 2019-10-31 2021-10-25 삼성에스디에스 주식회사 method for machine LEARNING of MALWARE DETECTING MODEL AND METHOD FOR detecting Malware USING THE SAME
CN110826320B (en) * 2019-11-28 2023-10-13 上海观安信息技术股份有限公司 Sensitive data discovery method and system based on text recognition
CN112464234B (en) * 2020-11-21 2024-04-05 西北工业大学 Malicious software detection method based on SVM on cloud platform
CN112528284A (en) * 2020-12-18 2021-03-19 北京明略软件系统有限公司 Malicious program detection method and device, storage medium and electronic equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112241530A (en) * 2019-07-19 2021-01-19 中国人民解放军战略支援部队信息工程大学 Malicious PDF document detection method and electronic equipment
CN111639337A (en) * 2020-04-17 2020-09-08 中国科学院信息工程研究所 Unknown malicious code detection method and system for massive Windows software
CN111723371A (en) * 2020-06-22 2020-09-29 上海斗象信息科技有限公司 Method for constructing detection model of malicious file and method for detecting malicious file

Also Published As

Publication number Publication date
CN113378156A (en) 2021-09-10

Similar Documents

Publication Publication Date Title
CN112905421B (en) Container abnormal behavior detection method of LSTM network based on attention mechanism
CN113596007B (en) Vulnerability attack detection method and device based on deep learning
CN111723371B (en) Method for constructing malicious file detection model and detecting malicious file
CN109614795B (en) Event-aware android malicious software detection method
CN113011889B (en) Account anomaly identification method, system, device, equipment and medium
CN112052451A (en) Webshell detection method and device
CN112651025A (en) Webshell detection method based on character-level embedded code
CN113378156B (en) API-based malicious file detection method and system
CN112487422B (en) Malicious document detection method and device, electronic equipment and storage medium
CN111988327B (en) Threat behavior detection and model establishment method and device, electronic equipment and storage medium
CN113971283A (en) Malicious application program detection method and device based on features
CN111414621B (en) Malicious webpage file identification method and device
CN116821903A (en) Detection rule determination and malicious binary file detection method, device and medium
CN109359274B (en) Method, device and equipment for identifying character strings generated in batch
CN112163217B (en) Malware variant identification method, device, equipment and computer storage medium
CN115391541A (en) Intelligent contract code automatic checking method, storage medium and electronic equipment
CN115238707A (en) Law enforcement video evaluation method and device based on word vector semantic analysis
CN113971282A (en) AI model-based malicious application program detection method and equipment
CN114398887A (en) Text classification method and device and electronic equipment
CN113688240A (en) Threat element extraction method, device, equipment and storage medium
CN114491528A (en) Malicious software detection method, device and equipment
CN113888760A (en) Violation information monitoring method, device, equipment and medium based on software application
CN114528908A (en) Network request data classification model training method, classification method and storage medium
CN113722713A (en) Malicious code detection method and device, electronic equipment and storage medium
CN112597498A (en) Webshell detection method, system and device and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant