CN111382783A - Malicious software identification method and device and storage medium - Google Patents

Malicious software identification method and device and storage medium Download PDF

Info

Publication number
CN111382783A
CN111382783A CN202010134497.1A CN202010134497A CN111382783A CN 111382783 A CN111382783 A CN 111382783A CN 202010134497 A CN202010134497 A CN 202010134497A CN 111382783 A CN111382783 A CN 111382783A
Authority
CN
China
Prior art keywords
model
gcforest
ret
pid
api
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010134497.1A
Other languages
Chinese (zh)
Inventor
张九经
李树栋
吴晓波
韩伟红
方滨兴
田志宏
殷丽华
顾钊铨
仇晶
王乐
李默涵
唐可可
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou University
Original Assignee
Guangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou University filed Critical Guangzhou University
Priority to CN202010134497.1A priority Critical patent/CN111382783A/en
Publication of CN111382783A publication Critical patent/CN111382783A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Computer Hardware Design (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Virology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the technical field of software security, and discloses a malicious software identification method, a malicious software identification device and a storage medium, wherein the malicious software identification method comprises the following steps: extracting sample software execution sequence characteristics; wherein the sample software execution sequence features include an API feature, a PID feature, and a RET feature; training a GCforest model by using the API characteristics, the PID characteristics and the RET characteristics; the GCforest model comprises a cascade forest module, and a final prediction result of the GCforest model is output by a final decision learner; and identifying the malicious software by using the trained GCforest model. The malicious software identification method, the malicious software identification device and the storage medium can improve the identification accuracy of malicious software.

Description

Malicious software identification method and device and storage medium
Technical Field
The present invention relates to the field of software security technologies, and in particular, to a method, an apparatus, and a storage medium for identifying malicious software.
Background
With the popularization and development of networks, people have entered the information-based era. However, along with the development of network attack technology, especially the security problem of malicious codes represented by computer viruses, computer worms, trojan horses, and the like, on networks and information systems has become a significant problem concerning national security, military security, and social security, and software security research has become an important issue of current computer research. The malware identification is a method for judging the security of computer software, and is a key part of software security research.
In the prior art, a deep neural network algorithm is mainly used to complete a malware identification task, and through analysis of a malicious sample, a malware file can be converted into an image data set and a text sequence data set through a conversion algorithm, so that the deep learning model which is excellent in image and text tasks is applied, for example, CNN (Convolutional neural network), GRU (Gated Recurrent Unit), and the like. The traditional forest-based Machine learning algorithm achieves a good effect in a data classification task, wherein XGBoost (extreme Gradient Boosting) and LightGBM (Light Gradient Boosting Machine) are currently used in many problems in the field of network security, such as DDoS (Distributed Denial of service) attack detection, malicious intrusion detection and click fraud detection.
However, the existing recognition method based on deep learning has the defect of low recognition accuracy; the existing recognition method based on the traditional forest machine learning directly takes the average value of the class probability vectors of the last layer as output, and has the defects of low accuracy rate and the like.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the method and the device for identifying the malicious software and the storage medium are provided, the malicious software is identified by adopting an improved GCforest model, and the identification accuracy is improved.
In order to solve the technical problem, in a first aspect, the present invention provides a malware identification method, including:
extracting sample software execution sequence characteristics; wherein the sample software execution sequence features include an API feature, a PID feature, and a RET feature;
training a GCforest model by using the API characteristics, the PID characteristics and the RET characteristics; the GCforest model comprises a cascade forest module, and a final prediction result of the GCforest model is output by a final decision learner;
and identifying the malicious software by using the trained GCforest model.
Preferably, the extracting sample software executes sequence features, specifically:
grabbing api _ name, call _ pid and ret _ value in an xml file of the sample software;
and extracting the API characteristics, the PID characteristics and the RET characteristics of the sample software according to the API _ name, the call _ PID and the RET _ value use rule matching and frequency statistics.
Specifically, the extracting the API feature, the PID feature, and the RET feature of the sample software according to the API _ name, the call _ PID, and the RET _ value usage rule matching and frequency statistics specifically includes:
when the API _ name of the sample software contains a first character string, determining that the value of the API feature of the sample software is 1, otherwise, determining that the value is 0; the first character string is any character string in api _ name of the malicious software;
when a second character string is contained in the call _ PID of the sample software, determining the value of the PID characteristic of the sample software as the frequency of occurrence of the second character string; wherein the second character string is any character string in call _ pid of the malicious software;
when a third string is contained in the RET value of the sample software, determining the value of the RET characteristic of the sample software as the frequency of occurrence of the third string; wherein the third character string is any character string in ret _ value of the malware.
Preferably, the training of the GCForest model by using the API features, the PID features, and the RET features specifically includes:
s21: merging and standardizing the extracted results of the API characteristic, the PID characteristic and the RET characteristic into a first characteristic vector, dividing the first characteristic vector into a training set and a cross validation set, sending the training set into the GCForest model, and training a base learner and a final decision learner of a first forest layer of the GCForest model;
s22: connecting the first forest layer with the final decision learner to obtain a first GCforest model, predicting the cross validation set by using the first GCforest model, comparing and validating a prediction result with a preset label, and calculating a first accuracy;
s23: connecting a class probability vector output by a previous forest layer with a first feature vector of the training set to obtain a new feature vector as input of a next forest layer, training the next forest layer by using the new feature vector, connecting the new feature vector with the final decision learner to obtain a new GCForest model, predicting the cross validation set by using the new GCForest model, comparing and validating a prediction result with the preset label, and calculating the current accuracy;
s24: if the current accuracy is greater than the accuracy of the previous forest layer, updating the current highest accuracy and the forest layer corresponding to the current highest accuracy, and repeating the step S23;
s25: and when the accuracy is not increased any more, stopping training, and connecting the forest layer with the highest accuracy with the final decision learner to obtain the trained GCforest model.
Preferably, the base learner of any forest layer of the GCForest model is at least one of the following algorithms: random forest, extreme random tree, extreme gradient lift, lightweight gradient lift, category lift, and logistic regression.
In order to solve the same technical problem, in a second aspect, the present invention provides a malware identification apparatus, including: the system comprises a feature extraction module, a model training module and a software identification module; wherein the content of the first and second substances,
the characteristic extraction module is used for extracting the characteristics of the execution sequence of the sample software; wherein the sample software execution sequence features include an API feature, a PID feature, and a RET feature;
the model training module is used for training a GCforest model by utilizing the API characteristics, the PID characteristics and the RET characteristics; the GCforest model comprises a cascade forest module, and a final prediction result of the GCforest model is output by a final decision learner;
the software identification module is used for identifying malicious software by using the trained GCforest model.
Preferably, the feature extraction module is configured to extract features of a sample software execution sequence, specifically:
grabbing api _ name, call _ pid and ret _ value in an xml file of the sample software;
and extracting the API characteristics, the PID characteristics and the RET characteristics of the sample software according to the API _ name, the call _ PID and the RET _ value use rule matching and frequency statistics.
Specifically, the extracting the API feature, the PID feature, and the RET feature of the sample software according to the API _ name, the call _ PID, and the RET _ value usage rule matching and frequency statistics specifically includes:
the extracting the API feature, the PID feature, and the RET feature of the sample software according to the API _ name, the call _ PID, and the RET _ value usage rule matching and frequency statistics specifically includes:
when the API _ name of the sample software contains a first character string, determining that the value of the API feature of the sample software is 1, otherwise, determining that the value is 0; the first character string is any character string in api _ name of the malicious software;
when a second character string is contained in the call _ PID of the sample software, determining the value of the PID characteristic of the sample software as the frequency of occurrence of the second character string; wherein the second character string is any character string in call _ pid of the malicious software;
when a third string is contained in the RET value of the sample software, determining the value of the RET characteristic of the sample software as the frequency of occurrence of the third string; wherein the third character string is any character string in ret _ value of the malware.
Preferably, the model training module is configured to train the GCForest model by using the API features, the PID features, and the RET features, specifically:
a: merging and standardizing the extracted results of the API characteristic, the PID characteristic and the RET characteristic into a first characteristic vector, dividing the first characteristic vector into a training set and a cross validation set, sending the training set into the GCForest model, and training a base learner and a final decision learner of a first forest layer of the GCForest model;
b: connecting the first forest layer with the final decision learner to obtain a first GCforest model, predicting the cross validation set by using the first GCforest model, comparing and validating a prediction result with a preset label, and calculating a first accuracy;
c: connecting a class probability vector output by a previous forest layer with a first feature vector of the training set to obtain a new feature vector as input of a next forest layer, training the next forest layer by using the new feature vector, connecting the new feature vector with the final decision learner to obtain a new GCForest model, predicting the cross validation set by using the new GCForest model, comparing and validating a prediction result with the preset label, and calculating the current accuracy;
d: if the current accuracy is greater than the accuracy of the previous forest layer, updating the current highest accuracy and the forest layer corresponding to the current highest accuracy, and repeating the step c;
e: and when the accuracy is not increased any more, stopping training, and connecting the forest layer with the highest accuracy with the final decision learner to obtain the trained GCforest model.
In order to solve the same technical problem, in a third aspect, the present invention provides a computer-readable storage medium having stored thereon a computer program for implementing the malware identification method of the first aspect described above.
Compared with the prior art, the malicious software identification method, the malicious software identification device and the malicious software identification storage medium have the advantages that: extracting execution sequence characteristics of sample software, training a GCforest model, wherein the GCforest model comprises a cascade forest module, a final prediction result of the GCforest model is output by a final decision learner, and identifying malicious software by using the trained GCforest model; the final prediction result of the GCforest model is output by the final decision learner, and compared with the existing GCforest model, the GCforest model has higher identification accuracy in malicious software identification; the improved GCforest model is adopted to identify the malicious software, and compared with a deep neural network method, the method has the advantages of being high in training speed, less in parameters to be adjusted, more robust, and capable of adaptively adjusting the complexity of the model according to a data set, so that a relatively light model can be obtained without pruning.
Drawings
In order to more clearly illustrate the technical features of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention will be briefly described below, and it is apparent that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on the drawings without inventive labor.
FIG. 1 is a schematic structural diagram of a GCforest model in the prior art;
FIG. 2 is a schematic structural diagram of a GCforest model according to an embodiment of the present invention;
fig. 3 is a flowchart illustrating a malware identification method according to a first embodiment of the present invention;
fig. 4 is a schematic flowchart of a specific process of feature extraction in the malware identification method according to the first embodiment of the present invention;
fig. 5 is a flowchart illustrating a specific process of step S2 in the malware identification method according to the first embodiment of the present invention;
fig. 6 is a schematic structural diagram of a malware identification device according to a second embodiment of the present invention.
Detailed Description
In order to clearly understand the technical features, objects and effects of the present invention, the following detailed description of the embodiments of the present invention is provided with reference to the accompanying drawings and examples. The following examples are intended to illustrate the invention, but are not intended to limit the scope of the invention. Other embodiments, which can be derived by those skilled in the art from the embodiments of the present invention without inventive step, shall fall within the scope of the present invention.
It should be noted that the multi-granular cascaded forest GCForest (multi-granular cascaded forest) is an algorithm for learning a cascaded structure constructed by integrating decision trees, and mainly includes a cascaded forest module and a multi-granular scanning module, and in this document, the multi-granular scanning module is not used, but the cascaded forest module is directly used.
Fig. 1 is a schematic structural diagram of a GCForest model in the prior art.
As shown in fig. 1, in the prior art, a cascade forest module uses forests as basic units, which is a multi-layer cascade structure, each layer is composed of base learners such as random forests and completely random forests, for each base learner, the input is a class probability vector or original data input generated by the previous layer, the output is an output combination of each base learner, then K-fold verification is performed on each layer, and when the accuracy of cross-validation is not increased any more, the cascade process is stopped immediately.
Fig. 2 is a schematic structural diagram of the GCForest model according to the embodiment of the present invention.
As shown in fig. 2, in the embodiment of the present invention, the cascaded forest module uses forests as basic units, which are a multi-layer cascaded structure, for each base learner, the input is a class probability vector generated by a previous layer or raw data input, the output is an output combination of each base learner, then, verification is performed on each layer, when the accuracy of cross verification is no longer improved, the cascading process is stopped immediately, the base learner of each layer may be composed of at least one of random forest (random forest), extreme random tree (extreme random Trees), extreme Gradient boost xgboost (extreme Gradient boost), light Gradient boost (light Gradient boost), category boost (category boost), logistic regression (logistic regression), and at the same time, a final forest decision-making learner is added to the last layer of the cascaded forest module, and the class probability vector of the last layer is used as an input layer, and outputting the final predicted value by a final decision learner, wherein the final decision learner is also a base learner.
Fig. 3 is a flowchart illustrating a malware identification method according to an embodiment of the present invention.
As shown in fig. 3, the malware identification method includes the following steps:
s1: extracting sample software execution sequence characteristics; wherein the sample software execution sequence features include an API feature, a PID feature, and a RET feature;
s2: training a GCforest model by using the API characteristics, the PID characteristics and the RET characteristics; the GCforest model comprises a cascade forest module, and a final prediction result of the GCforest model is output by a final decision learner;
s3: and identifying the malicious software by using the trained GCforest model.
In step S1, the specific process is as follows:
grabbing api _ name, call _ pid and ret _ value in an xml file of the sample software;
and extracting the API characteristics, the PID characteristics and the RET characteristics of the sample software according to the API _ name, the call _ PID and the RET _ value use rule matching and frequency statistics.
Fig. 4 is a schematic diagram illustrating a specific flow of feature extraction in the malware identification method according to the first embodiment of the present invention.
As shown in fig. 4, the feature extraction process specifically includes:
s11: when the API _ name of the sample software contains a first character string, determining that the value of the API feature of the sample software is 1, otherwise, determining that the value is 0; the first character string is any character string in api _ name of the malicious software;
it can be understood that the dynamic behavior of the software is mainly realized by calling a system API, the API property is the most important point considered, and the API _ name is composed of a plurality of character strings, and the character strings do not repeatedly appear in a single sample, so a one-hot manner is used to change the character string of the API _ name of the malware into a feature, and for the API feature, when the API _ name of the sample software contains a first character string, the value of the API feature of the sample software is determined to be 1, otherwise, the value is 0; the first character string is any character string in the api _ name of the malicious software.
S12: when a second character string is contained in the call _ PID of the sample software, determining the value of the PID characteristic of the sample software as the frequency of occurrence of the second character string; wherein the second character string is any character string in call _ pid of the malicious software;
it can be understood that the PID feature represents the type and other information of the software execution process, and the call _ PID is composed of a plurality of character strings which appear many times in a single sample, the character string of the call _ PID of the malicious software is changed into one feature, and for the PID feature, when the second character string is included in the call _ PID of the sample software, the value of the PID feature of the sample software is determined as the frequency of appearance of the second character string; the second character string is any character string in call _ pid of the malicious software.
S13: when a third string is contained in the RET value of the sample software, determining the value of the RET characteristic of the sample software as the frequency of occurrence of the third string; wherein the third character string is any character string in ret _ value of the malware.
It is understood that the RET property shows the execution result of the system call, and the RET _ value is composed of a plurality of character strings which appear multiple times in a single sample, the character string of the RET _ value of the malware is changed into one feature, and for the RET feature, when the RET _ value of the sample software contains a third character string, the value of the RET feature of the sample software is determined as the frequency of appearance of the third character string; and the third character string is any character string in ret _ value of the malicious software.
In the first embodiment of the present invention, the Beautiful Soup library in Python is used to capture the api _ name, call _ pid, and ret _ value in the xml file exported from the sandbox, but the present invention is not limited thereto.
Through the processing of the above steps, a matrix whose features are expressed as only numbers can be extracted.
Fig. 5 is a schematic flowchart illustrating a specific flow of step S2 in the malware identification method according to the first embodiment of the present invention.
As shown in fig. 5, step S2 specifically includes:
s21: merging and standardizing the extracted results of the API characteristic, the PID characteristic and the RET characteristic into a first characteristic vector, dividing the first characteristic vector into a training set and a cross validation set, sending the training set into the GCForest model, and training a base learner and a final decision learner of a first forest layer of the GCForest model;
in the first embodiment of the present invention, data is labeled as malicious and benign, and the malicious software includes but is not limited to: the software has the behaviors of privately acquiring privacy information such as a user terminal, a position and the like, has the behaviors of changing system settings and installing sub-malicious software, and has the behaviors of falsely linking and endangering property safety of users. The number and proportion of malicious software to benign software, and the proportion of the training set to the test set, the invention is not limited.
S22: connecting the first forest layer with the final decision learner to obtain a first GCforest model, predicting the cross validation set by using the first GCforest model, comparing and validating a prediction result with a preset label, and calculating a first accuracy;
s23: connecting a class probability vector output by a previous forest layer with a first feature vector of the training set to obtain a new feature vector as input of a next forest layer, training the next forest layer by using the new feature vector, connecting the new feature vector with the final decision learner to obtain a new GCForest model, predicting the cross validation set by using the new GCForest model, comparing and validating a prediction result with the preset label, and calculating the current accuracy;
it should be understood that, each time the step S23 is performed, the class probability vector output by the last forest layer of the original model is connected to the first feature vector based on the original model to obtain a new feature vector, and the new forest layer is trained by using the new feature vector. And after the training is finished, taking the forest layer which is just trained as the next layer of the last forest layer of the original model, and then connecting the final decision learner to form a new GCforest model.
S24: if the current accuracy is greater than the accuracy of the previous forest layer, updating the current highest accuracy and the forest layer corresponding to the current highest accuracy, and repeating the step S23;
wherein, it is accurateRate of change
Figure BDA0002395588300000101
Y is the number of samples in the cross-validation set, YPred (i)Prediction class for the ith sample in the Cross-validation set, yTrue (i)The true class of the ith sample in the cross-validation set. The function I (x, y) is an indicative function, namely when the values of x and y are the same, the function value is 1; otherwise, the function value is 0. Using p simultaneouslyBestRecord the current highest cross-validation accuracy and IndexBestAnd recording the forest layer corresponding to the current highest accuracy.
It should be understood that Index in step S24BestAn update of the value of (a) means an increase in the number of layers of the GCForest model. And, when the operation of step S23 is repeated in step S24, p has already been setBestUpdating the value of the cross validation set to be the prediction accuracy of a new GCforest model added with the forest layer obtained by the training to the cross validation set, and updating IndexBestFor example, after the second forest layer is trained in step S23, and when the accuracy of the current model (which includes the first forest layer, the second forest layer and the final decision learner) is higher than that of the model with only the first forest layer and the final decision learner in step S24, the value of p is updated to the index value of the forest layer obtained in the training, and p is updatedBestAnd IndexBestA value of (d); otherwise, step S25 is completed. At this time IndexBestThe value of (d) still points to the first forest layer, pBestThe value of (d) is still the first accuracy calculated in step S22.
S25: and when the accuracy is not increased any more, stopping training, and connecting the forest layer with the highest accuracy with the final decision learner to obtain the trained GCforest model.
In the process of training the GCForest model, a base learner of any forest layer of the GCForest model at least comprises one of the following algorithms: random forest, extreme random tree, extreme gradient lift, lightweight gradient lift, category lift, and logistic regression.
After the step S2, an improved GCForest model can be obtained, the final prediction result of the model is output by the final decision learner, and the base learner of any one forest layer is formed by at least one of the following algorithms: random forest, extreme random tree, extreme gradient lift, lightweight gradient lift, category lift, and logistic regression.
After the improved GCforest model is obtained, feature extraction is carried out on software to be recognized, the extracted features are input into the improved GCforest model for recognition, and therefore the software can be recognized and classified into malicious software or benign software.
The malicious software identification method provided by the embodiment of the invention comprises the steps of extracting sample software execution sequence characteristics, training a GCforest model, wherein the GCforest model comprises a cascade forest module, outputting a final prediction result of the GCforest model by a final decision learner, and identifying malicious software by using the trained GCforest model; the final prediction result of the GCforest model is output by the final decision learner, and compared with the existing GCforest model, the GCforest model has higher identification accuracy in malicious software identification; the improved GCforest model is adopted to identify the malicious software, compared with a deep neural network method, the method has the advantages of higher training speed, fewer parameters to be adjusted, relatively simple difficulty in parameter adjustment, robustness and capability of adaptively adjusting the model according to a data set, so that a relatively light model can be obtained without pruning.
It should be understood that all or part of the processes of the malware identification method described above may be implemented by a computer program, which may be stored in a computer readable storage medium and executed by a processor, so as to implement the steps of the malware identification method described above. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, etc. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, in accordance with legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunications signals.
Fig. 6 is a schematic structural diagram of a malware identification device according to a second embodiment of the present invention.
As shown in fig. 6, the malware recognition apparatus includes: a feature extraction module 61, a model training module 62 and a software identification module 63; wherein the content of the first and second substances,
the feature extraction module 61 is configured to extract features of a sample software execution sequence; wherein the sample software execution sequence features include an API feature, a PID feature, and a RET feature;
the model training module 62 is configured to train a GCForest model using the API features, the PID features, and the RET features; the GCforest model comprises a cascade forest module, and a final prediction result of the GCforest model is output by a final decision learner;
the software identification module 63 is configured to identify malware by using the trained GCForest model.
Preferably, the feature extraction module 61 is configured to extract features of the sample software execution sequence, specifically:
grabbing api _ name, call _ pid and ret _ value in an xml file of the sample software;
and extracting the API characteristics, the PID characteristics and the RET characteristics of the sample software according to the API _ name, the call _ PID and the RET _ value use rule matching and frequency statistics.
Further, the extracting the API feature, the PID feature, and the RET feature of the sample software according to the API _ name, the call _ PID, and the RET _ value usage rule matching and frequency statistics specifically includes:
when the API _ name of the sample software contains a first character string, determining that the value of the API feature of the sample software is 1, otherwise, determining that the value is 0; the first character string is any character string in api _ name of the malicious software;
when a second character string is contained in the call _ PID of the sample software, determining the value of the PID characteristic of the sample software as the frequency of occurrence of the second character string; wherein the second character string is any character string in call _ pid of the malicious software;
when a third string is contained in the RET value of the sample software, determining the value of the RET characteristic of the sample software as the frequency of occurrence of the third string; wherein the third character string is any character string in ret _ value of the malware.
Through the above steps, a matrix whose features appear to be only numbers can be extracted.
Preferably, the model training module 62 is configured to train a GCForest model, specifically:
the model training module 62 is configured to train a GCForest model using the API features, the PID features, and the RET features, specifically:
a: merging and standardizing the extracted results of the API characteristic, the PID characteristic and the RET characteristic into a first characteristic vector, dividing the first characteristic vector into a training set and a cross validation set, sending the training set into the GCForest model, and training a base learner and a final decision learner of a first forest layer of the GCForest model;
b: connecting the first forest layer with the final decision learner to obtain a first GCforest model, predicting the cross validation set by using the first GCforest model, comparing and validating a prediction result with a preset label, and calculating a first accuracy;
c: connecting a class probability vector output by a previous forest layer with a first feature vector of the training set to obtain a new feature vector as input of a next forest layer, training the next forest layer by using the new feature vector, connecting the new feature vector with the final decision learner to obtain a new GCForest model, predicting the cross validation set by using the new GCForest model, comparing and validating a prediction result with the preset label, and calculating the current accuracy;
d: if the current accuracy is greater than the accuracy of the previous forest layer, updating the current highest accuracy and the forest layer corresponding to the current highest accuracy, and repeating the step c;
wherein the accuracy rate
Figure BDA0002395588300000131
Y is the number of samples in the cross-validation set, YPred (i)Prediction class for the ith sample in the Cross-validation set, yTrue (i)The true class of the ith sample in the cross-validation set. The function I (x, y) is an indicative function, namely when the values of x and y are the same, the function value is 1; otherwise, the function value is 0. At the same time, use pBestRecord the current highest cross-validation accuracy and IndexBestAnd recording the forest layer corresponding to the current highest accuracy.
It is understood that Index in step dBestAn update of the value of (a) means an increase in the number of layers of the GCForest model. And, when the operation of step c is repeated in step d, p is already setBestUpdating the value of the cross validation set to be the prediction accuracy of a new GCforest model added with the forest layer obtained by the training to the cross validation set, and updating IndexBestFor example, after the second forest layer is trained in step c, and when the accuracy of the current model (the model comprises the first forest layer, the second forest layer and the final decision learner) is higher than that of the model only comprising the first forest layer and the final decision learner in step d, the value of p is updated to the index value of the forest layer obtained by the training, and p is updatedBestAnd IndexBestA value of (d); otherwise, finishing the step e. At this time IndexBestThe value of (d) still points to the first forest layer, pBestThe value of (c) is still the first accuracy calculated in step b.
e: and when the accuracy is not increased any more, stopping training, and connecting the forest layer with the highest accuracy with the final decision learner to obtain the trained GCforest model.
In the process of training GCForest, a base learner of any forest layer of the GCForest model is at least composed of one of the following algorithms: random forest, extreme random tree, extreme gradient lift, lightweight gradient lift, category lift, and logistic regression.
After the improved GCforest model is obtained, feature extraction is carried out on software to be recognized, features are input into the improved GCforest model for recognition, and then the software can be recognized and classified, and the software is classified into malicious software or benign software.
Compared with the conventional GCforest model, the device for identifying the malicious software has higher identification accuracy in the identification of the malicious software; the improved GCforest model is adopted to identify the malicious software, compared with a deep neural network method, the method has the advantages of higher training speed, fewer parameters to be adjusted, relatively simple difficulty in parameter adjustment, robustness and capability of adaptively adjusting the model according to a data set, so that a relatively light model can be obtained without pruning.
The above description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and it should be noted that, for those skilled in the art, several equivalent obvious modifications and/or equivalent substitutions can be made without departing from the technical principle of the present invention, and these obvious modifications and/or equivalent substitutions should also be regarded as the scope of the present invention.

Claims (10)

1. A malware identification method, comprising:
extracting sample software execution sequence characteristics; wherein the sample software execution sequence features include an API feature, a PID feature, and a RET feature;
training a GCforest model by using the API characteristics, the PID characteristics and the RET characteristics; the GCforest model comprises a cascade forest module, and a final prediction result of the GCforest model is output by a final decision learner;
and identifying the malicious software by using the trained GCforest model.
2. The malware identification method according to claim 1, wherein the extracting of the sample software execution sequence features specifically is:
grabbing api _ name, call _ pid and ret _ value in an xml file of the sample software;
and extracting the API characteristics, the PID characteristics and the RET characteristics of the sample software according to the API _ name, the call _ PID and the RET _ value use rule matching and frequency statistics.
3. The malware identification method of claim 2, wherein the extracting of the API feature, the PID feature, and the RET feature of the sample software according to the API _ name, the call _ PID, and the RET _ value using rule matching and frequency statistics specifically comprises:
when the API _ name of the sample software contains a first character string, determining that the value of the API feature of the sample software is 1, otherwise, determining that the value is 0; the first character string is any character string in api _ name of the malicious software;
when a second character string is contained in the call _ PID of the sample software, determining the value of the PID characteristic of the sample software as the frequency of occurrence of the second character string; wherein the second character string is any character string in call _ pid of the malicious software;
when a third string is contained in the RET value of the sample software, determining the value of the RET characteristic of the sample software as the frequency of occurrence of the third string; wherein the third character string is any character string in ret _ value of the malware.
4. The malware identification method according to claim 1, wherein the training of the GCForest model using the API features, the PID features, and the RET features specifically comprises:
s21: merging and standardizing the extracted results of the API characteristic, the PID characteristic and the RET characteristic into a first characteristic vector, dividing the first characteristic vector into a training set and a cross validation set, sending the training set into the GCForest model, and training a base learner and a final decision learner of a first forest layer of the GCForest model;
s22: connecting the first forest layer with the final decision learner to obtain a first GCforest model, predicting the cross validation set by using the first GCforest model, comparing and validating a prediction result with a preset label, and calculating a first accuracy;
s23: connecting a class probability vector output by a previous forest layer with a first feature vector of the training set to obtain a new feature vector as input of a next forest layer, training the next forest layer by using the new feature vector, connecting the new feature vector with the final decision learner to obtain a new GCForest model, predicting the cross validation set by using the new GCForest model, comparing and validating a prediction result with the preset label, and calculating the current accuracy;
s24: if the current accuracy is greater than the accuracy of the previous forest layer, updating the current highest accuracy and the forest layer corresponding to the current highest accuracy, and repeating the step S23;
s25: and when the accuracy is not increased any more, stopping training, and connecting the forest layer with the highest accuracy with the final decision learner to obtain the trained GCforest model.
5. The malware identification method of claim 4, wherein the base learner of any forest layer of the GCforest model is comprised of at least one of the following algorithms: random forest, extreme random tree, extreme gradient lift, lightweight gradient lift, category lift, and logistic regression.
6. A malware identification device, comprising: the system comprises a feature extraction module, a model training module and a software identification module; wherein the content of the first and second substances,
the characteristic extraction module is used for extracting the characteristics of the execution sequence of the sample software; wherein the sample software execution sequence features include an API feature, a PID feature, and a RET feature;
the model training module is used for training a GCforest model by utilizing the API characteristics, the PID characteristics and the RET characteristics; the GCforest model comprises a cascade forest module, and a final prediction result of the GCforest model is output by a final decision learner;
the software identification module is used for identifying malicious software by using the trained GCforest model.
7. The malware identification device of claim 6, wherein the feature extraction module is configured to extract sample software execution sequence features, specifically:
grabbing api _ name, call _ pid and ret _ value in an xml file of the sample software;
and extracting the API characteristics, the PID characteristics and the RET characteristics of the sample software according to the API _ name, the call _ PID and the RET _ value use rule matching and frequency statistics.
8. The malware identification device of claim 7, wherein the API feature, the PID feature, and the RET feature of the sample software are extracted according to the API _ name, the call _ PID, and the RET _ value using rule matching and frequency statistics, specifically:
when the API _ name of the sample software contains a first character string, determining that the value of the API feature of the sample software is 1, otherwise, determining that the value is 0; the first character string is any character string in api _ name of the malicious software;
when a second character string is contained in the call _ PID of the sample software, determining the value of the PID characteristic of the sample software as the frequency of occurrence of the second character string; wherein the second character string is any character string in call _ pid of the malicious software;
when a third string is contained in the RET value of the sample software, determining the value of the RET characteristic of the sample software as the frequency of occurrence of the third string; wherein the third character string is any character string in ret _ value of the malware.
9. The malware identification device of claim 6, wherein the model training module is configured to train a GCForest model using the API features, the PID features, and the RET features, specifically:
a: merging and standardizing the extracted results of the API characteristic, the PID characteristic and the RET characteristic into a first characteristic vector, dividing the first characteristic vector into a training set and a cross validation set, sending the training set into the GCForest model, and training a base learner and a final decision learner of a first forest layer of the GCForest model;
b: connecting the first forest layer with the final decision learner to obtain a first GCforest model, predicting the cross validation set by using the first GCforest model, comparing and validating a prediction result with a preset label, and calculating a first accuracy;
c: connecting a class probability vector output by a previous forest layer with a first feature vector of the training set to obtain a new feature vector as input of a next forest layer, training the next forest layer by using the new feature vector, connecting the new feature vector with the final decision learner to obtain a new GCForest model, predicting the cross validation set by using the new GCForest model, comparing and validating a prediction result with the preset label, and calculating the current accuracy;
d: if the current accuracy is greater than the accuracy of the previous forest layer, updating the current highest accuracy and the forest layer corresponding to the current highest accuracy, and repeating the step c;
e: and when the accuracy is not increased any more, stopping training, and connecting the forest layer with the highest accuracy with the final decision learner to obtain the trained GCforest model.
10. A computer-readable storage medium, characterized in that a computer program implementing the malware identification method according to any one of claims 1 to 5 is stored thereon.
CN202010134497.1A 2020-02-28 2020-02-28 Malicious software identification method and device and storage medium Pending CN111382783A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010134497.1A CN111382783A (en) 2020-02-28 2020-02-28 Malicious software identification method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010134497.1A CN111382783A (en) 2020-02-28 2020-02-28 Malicious software identification method and device and storage medium

Publications (1)

Publication Number Publication Date
CN111382783A true CN111382783A (en) 2020-07-07

Family

ID=71221389

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010134497.1A Pending CN111382783A (en) 2020-02-28 2020-02-28 Malicious software identification method and device and storage medium

Country Status (1)

Country Link
CN (1) CN111382783A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112328977A (en) * 2020-11-09 2021-02-05 杭州安恒信息技术股份有限公司 Method, device, equipment and medium for detecting authenticity of application software
CN113569241A (en) * 2021-07-28 2021-10-29 新华三技术有限公司 Virus detection method and device
CN113704409A (en) * 2021-08-31 2021-11-26 上海师范大学 False recruitment information detection method based on cascade forest
WO2022227535A1 (en) * 2021-04-29 2022-11-03 广州大学 Method and system for recognizing mining malicious software, and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102761458A (en) * 2011-12-20 2012-10-31 北京安天电子设备有限公司 Detection method and system of rebound type Trojan
CN106850582A (en) * 2017-01-05 2017-06-13 中国电子科技网络信息安全有限公司 A kind of APT Advanced threat detection methods based on instruction monitoring
CN107153789A (en) * 2017-04-24 2017-09-12 西安电子科技大学 The method for detecting Android Malware in real time using random forest grader
CN108319855A (en) * 2018-02-08 2018-07-24 中国人民解放军陆军炮兵防空兵学院郑州校区 A kind of malicious code sorting technique based on depth forest
CN108345793A (en) * 2017-12-29 2018-07-31 北京物资学院 A kind of extracting method and device of software detection feature
CN108595955A (en) * 2018-04-25 2018-09-28 东北大学 A kind of Android mobile phone malicious application detecting system and method
US10360398B2 (en) * 2012-10-19 2019-07-23 Mcafee, Llc Secure disk access control
CN111062036A (en) * 2019-11-29 2020-04-24 暨南大学 Malicious software identification model construction method, malicious software identification medium and malicious software identification equipment

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102761458A (en) * 2011-12-20 2012-10-31 北京安天电子设备有限公司 Detection method and system of rebound type Trojan
US10360398B2 (en) * 2012-10-19 2019-07-23 Mcafee, Llc Secure disk access control
CN106850582A (en) * 2017-01-05 2017-06-13 中国电子科技网络信息安全有限公司 A kind of APT Advanced threat detection methods based on instruction monitoring
CN107153789A (en) * 2017-04-24 2017-09-12 西安电子科技大学 The method for detecting Android Malware in real time using random forest grader
CN108345793A (en) * 2017-12-29 2018-07-31 北京物资学院 A kind of extracting method and device of software detection feature
CN108319855A (en) * 2018-02-08 2018-07-24 中国人民解放军陆军炮兵防空兵学院郑州校区 A kind of malicious code sorting technique based on depth forest
CN108595955A (en) * 2018-04-25 2018-09-28 东北大学 A kind of Android mobile phone malicious application detecting system and method
CN111062036A (en) * 2019-11-29 2020-04-24 暨南大学 Malicious software identification model construction method, malicious software identification medium and malicious software identification equipment

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
DATACON大数据安全分析比赛冠军思路分享: "DataCon大数据安全分析比赛冠军思路分享", 《知乎》 *
企鹅在线: "互联网协会公布恶意软件定义细则", 《IT168软件咨询》 *
徐英杰 等: "基于多粒度级联多层梯度提升树的选票手写字符识别算法", 《计算机应用》 *
石兴华 等: "基于深度森林的安卓恶意软件行为分析与检测", 《软件》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112328977A (en) * 2020-11-09 2021-02-05 杭州安恒信息技术股份有限公司 Method, device, equipment and medium for detecting authenticity of application software
CN112328977B (en) * 2020-11-09 2024-03-22 杭州安恒信息技术股份有限公司 Application software authenticity detection method, device, equipment and medium
WO2022227535A1 (en) * 2021-04-29 2022-11-03 广州大学 Method and system for recognizing mining malicious software, and storage medium
CN113569241A (en) * 2021-07-28 2021-10-29 新华三技术有限公司 Virus detection method and device
CN113704409A (en) * 2021-08-31 2021-11-26 上海师范大学 False recruitment information detection method based on cascade forest
CN113704409B (en) * 2021-08-31 2023-08-04 上海师范大学 False recruitment information detection method based on cascading forests

Similar Documents

Publication Publication Date Title
CN111382783A (en) Malicious software identification method and device and storage medium
CN111914256B (en) Defense method for machine learning training data under toxic attack
Li et al. Invisible backdoor attacks on deep neural networks via steganography and regularization
Tran et al. An approach for host-based intrusion detection system design using convolutional neural network
US10984101B2 (en) Methods and systems for malware detection and categorization
CN111783442A (en) Intrusion detection method, device, server and storage medium
CN109614795B (en) Event-aware android malicious software detection method
CN113011889B (en) Account anomaly identification method, system, device, equipment and medium
CN113486350B (en) Method, device, equipment and storage medium for identifying malicious software
CN114692156B (en) Memory segment malicious code intrusion detection method, system, storage medium and equipment
CN111898129A (en) Malicious code sample screener and method based on Two-Head anomaly detection model
CN113657773B (en) Method and device for voice operation quality inspection, electronic equipment and storage medium
CN111988327B (en) Threat behavior detection and model establishment method and device, electronic equipment and storage medium
CN113468524A (en) RASP-based machine learning model security detection method
CN113971283A (en) Malicious application program detection method and device based on features
Tirumala et al. Evaluation of feature and signature based training approaches for malware classification using autoencoders
Jere et al. Principal component properties of adversarial samples
CN115842645A (en) UMAP-RF-based network attack traffic detection method and device and readable storage medium
Kamundala et al. CNN Model to Classify Malware Using Image Feature
CN112163217B (en) Malware variant identification method, device, equipment and computer storage medium
CN113420293A (en) Android malicious application detection method and system based on deep learning
CN112860573A (en) Smartphone malicious software detection method
CN113935032A (en) Method and device for homologous analysis of malicious code and readable storage medium
CN113971282A (en) AI model-based malicious application program detection method and equipment
Jurečková et al. Classification and online clustering of zero-day malware

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination