CN112257757A

CN112257757A - Malicious sample detection method and system based on deep learning

Info

Publication number: CN112257757A
Application number: CN202011032770.6A
Authority: CN
Inventors: 弓睿智; 李林
Original assignee: Beijing Ruifuxin Technology Co ltd
Current assignee: Beijing Ruifuxin Technology Co ltd
Priority date: 2020-09-27
Filing date: 2020-09-27
Publication date: 2021-01-22

Abstract

The embodiment of the invention relates to the technical field of network security, and discloses a malicious sample detection method and system based on deep learning, wherein the method comprises the following steps: converting the sample data into two-dimensional matrix data; adopting CNN to train two-dimensional matrix data to obtain a full connection layer; carrying out feature classification on the full connection layer to obtain a feature classification model; and carrying out malicious sample detection based on the feature classification model. The embodiment of the invention trains the constructed sample data of the normal sample and the malicious sample by adopting the convolutional neural network to obtain the characteristic classification model capable of clearly identifying the normal sample and the malicious sample, thereby abandoning the traditional mode of identifying the normal sample and releasing the normal sample, more accurately detecting the malicious sample and improving the detection accuracy on the premise of ensuring the safety.

Description

Malicious sample detection method and system based on deep learning

Technical Field

The invention relates to the technical field of network security, in particular to a malicious sample detection method and system based on deep learning.

Background

In network access, compared with normal access flow with huge data volume, the malicious sample data volume related to network intrusion is rare, so that network safety work is difficult to start from a malicious sample, a normal sample model is often established by analyzing a large number of normal samples, and the samples which are inconsistent with the normal sample model are regarded as the malicious sample detection means. The method sets more limiting conditions for normal samples, possibly detects some special normal samples as malicious samples by mistake, has redundant safety and is insufficient in accuracy.

Disclosure of Invention

The embodiment of the invention discloses a method and a system for detecting a malicious sample based on deep learning, which can start with the malicious sample, train the sample data through a convolutional neural network, and obtain a feature classification model capable of clearly identifying a normal sample and the malicious sample, thereby abandoning the traditional mode of identifying the normal sample and releasing the normal sample, more accurately detecting the malicious sample, and improving the detection accuracy on the premise of ensuring the safety.

The first aspect of the embodiment of the invention discloses a malicious sample detection method based on deep learning, which comprises the following steps:

converting the sample data into two-dimensional matrix data;

training the two-dimensional matrix data by adopting a CNN (Convolutional Neural Networks) to obtain a full connection layer;

carrying out feature classification on the full connection layer to obtain a feature classification model;

and carrying out malicious sample detection based on the feature classification model.

As an optional implementation manner, in the first aspect of this embodiment of the present invention, before the converting the sample data into the two-dimensional matrix data, the method further includes:

constructing and operating a plurality of dynamic behavior samples in a security analysis sandbox to obtain a dynamic behavior report;

converting the dynamic behavior report into a text document, extracting effective fields, and removing duplication to obtain a dynamic behavior text;

and constructing a name and a serial number for each dynamic behavior in the dynamic behavior text to obtain the sample data.

As an optional implementation manner, in the first aspect of the embodiment of the present invention, the converting sample data into two-dimensional matrix data includes:

adopting a vector with a preset length to represent the name of each dynamic behavior in the dynamic behavior word bank;

and synthesizing the sequence number of each dynamic behavior and the vector of each dynamic behavior to construct a two-dimensional matrix, and obtaining two-dimensional matrix data.

As an optional implementation manner, in the first aspect of the embodiment of the present invention, the training the two-dimensional matrix data by using CNN to obtain a full connection layer includes:

performing convolution on the two-dimensional matrix data by adopting a plurality of convolution cores to obtain column vectors;

taking the maximum value in the column vector corresponding to each two-dimensional matrix data as the characteristic value of each two-dimensional matrix data;

and connecting the characteristic value of each two-dimensional matrix data to obtain a full connection layer.

A second aspect of the embodiments of the present invention discloses a malicious sample detection system, including:

the matrix conversion unit is used for converting the sample data into two-dimensional matrix data;

the data training unit is used for training the two-dimensional matrix data by adopting CNN to obtain a full connection layer;

the characteristic classification unit is used for carrying out characteristic classification on the full connection layer to obtain a characteristic classification model;

and the sample detection unit is used for carrying out malicious sample detection based on the feature classification model.

As an optional implementation manner, in the second aspect of the embodiment of the present invention, the system further includes:

the sample operation unit is used for constructing and operating a plurality of dynamic behavior samples in the security analysis sandbox to obtain a dynamic behavior report before the matrix conversion unit converts the sample data into the two-dimensional matrix data;

the text generation unit is used for converting the dynamic behavior report into a text document, extracting effective fields and removing duplication to obtain a dynamic behavior text;

and the data construction unit is used for constructing a name and a serial number for each dynamic behavior in the dynamic behavior text to obtain the sample data.

As an optional implementation manner, in a second aspect of the embodiment of the present invention, the matrix converting unit includes:

the vector characterization subunit is used for representing the name of each dynamic behavior in the dynamic behavior word bank by adopting a vector with a preset length;

and the matrix constructing subunit is used for synthesizing the sequence number of each dynamic behavior and the vector of each dynamic behavior to construct a two-dimensional matrix so as to obtain two-dimensional matrix data.

As an optional implementation manner, in a second aspect of the embodiment of the present invention, the data training unit includes:

the convolution subunit is used for performing convolution on the two-dimensional matrix data by adopting a plurality of convolution cores to obtain column vectors;

a characteristic value sub-unit, configured to take a maximum value in a column vector corresponding to each piece of the two-dimensional matrix data as a characteristic value of each piece of the two-dimensional matrix data;

and the characteristic connection subunit is used for connecting the characteristic value of each two-dimensional matrix data to obtain a full connection layer.

A third aspect of the embodiments of the present invention discloses a malicious sample detection system, including:

a memory storing executable program code;

a processor coupled with the memory;

the processor calls the executable program code stored in the memory to execute the malicious sample detection method based on deep learning disclosed by the first aspect of the embodiment of the invention.

A fourth aspect of the embodiments of the present invention discloses a computer-readable storage medium, which stores a computer program, where the computer program enables a computer to execute the method for detecting a malicious sample based on deep learning disclosed in the first aspect of the embodiments of the present invention.

A fifth aspect of embodiments of the present invention discloses a computer program product, which, when run on a computer, causes the computer to perform some or all of the steps of any one of the methods of the first aspect.

A sixth aspect of the present embodiment discloses an application publishing platform, where the application publishing platform is configured to publish a computer program product, where the computer program product is configured to, when running on a computer, cause the computer to perform part or all of the steps of any one of the methods in the first aspect.

Compared with the prior art, the embodiment of the invention has the following beneficial effects:

in the embodiment of the invention, sample data is converted into two-dimensional matrix data; adopting CNN to train two-dimensional matrix data to obtain a full connection layer; carrying out feature classification on the full connection layer to obtain a feature classification model; and carrying out malicious sample detection based on the feature classification model. The embodiment of the invention trains the constructed sample data of the normal sample and the malicious sample by adopting the convolutional neural network to obtain the characteristic classification model capable of clearly identifying the normal sample and the malicious sample, thereby abandoning the traditional mode of identifying the normal sample and releasing the normal sample, more accurately detecting the malicious sample and improving the detection accuracy on the premise of ensuring the safety.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a malicious sample detection method based on deep learning according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a malicious sample detection system according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of another malicious sample detection system according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first", "second", "third", "fourth", and the like in the description and the claims of the present invention are used for distinguishing different objects, and are not used for describing a specific order. The terms "comprises," "comprising," and any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiment of the invention discloses a method and a system for detecting a malicious sample based on deep learning, which improve the detection accuracy of the malicious sample on the premise of ensuring the safety. The following detailed description is made with reference to the accompanying drawings.

Example one

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating a malicious sample detection method based on deep learning according to an embodiment of the present invention. As shown in fig. 1, the malicious sample detection method based on deep learning may include the following steps.

101. And converting the sample data into two-dimensional matrix data.

In the embodiment of the invention, the dynamic behavior sample runs in the security analysis sandbox to obtain sample data in json (JavaScript Object Notation) format.

As an optional implementation manner, before sample data is converted into two-dimensional matrix data, a plurality of dynamic behavior samples are constructed and operated in a security analysis sandbox to obtain a dynamic behavior report; converting the dynamic behavior report into a text document, extracting effective fields, and removing duplication to obtain a dynamic behavior text; and constructing a name and a serial number for each dynamic behavior in the dynamic behavior text to obtain sample data. Specifically, a plurality of dynamic behavior samples are constructed, the dynamic behavior samples are placed in folders of corresponding types according to the types of white samples, trojans, worms, backdoors, viruses, advertisements and unknown types and operated, Cuckoo sandboxes are adopted to read the dynamic behavior samples of the types into a task list in turn and operate, an Application Programming Interface (API) is called to obtain the operation state of the current sandbox, if the operation state is 'ported', the fact that the operation of the dynamic behavior samples of the current type is completed and the Cuckoo sandbox generates a dynamic behavior report is indicated, the API is called to obtain the dynamic behavior report, the MD5 value of the dynamic behavior report is used as the name of the dynamic behavior report, and the dynamic behavior report is stored in a json format to obtain the dynamic behavior report of the dynamic behavior samples of the types. Furthermore, each dynamic behavior report is converted into a text document format, effective fields in the dynamic behavior report, namely API information called by the dynamic behavior samples, are extracted, and repeated calling API information is subjected to duplicate removal, so that a brief dynamic behavior text can be obtained; and constructing a corresponding name and a serial number for each dynamic behavior (namely API call information) in the dynamic behavior text corresponding to each type of dynamic behavior sample, and obtaining sample data. Therefore, the simple sample data can be obtained by converting the format of the dynamic behavior report, extracting effective information and removing duplication, so that the behavior detection problem aiming at the dynamic behavior sample is simplified into a simple text classification problem, the sample behavior detection work is simplified from a data source, and the detection efficiency is improved.

In the embodiment of the invention, on the basis of extracting the obtained data sample, the data sample can be subjected to training pretreatment.

As an optional implementation manner, a vector with a preset length is used to represent the name of each dynamic behavior in the dynamic behavior lexicon; and synthesizing the serial number of each dynamic behavior and the vector of each dynamic behavior to construct a two-dimensional matrix, and obtaining two-dimensional matrix data. Specifically, the data sample is converted into matrix data for CNN to directly train, a two-dimensional matrix is constructed according to the serial number and name of the dynamic behavior, the name of each dynamic behavior (i.e., API call information) is represented by a vector with a preset length, a two-dimensional matrix is constructed according to the vector of the dynamic behavior and the corresponding serial number, and two-dimensional matrix data is obtained by integrating the two-dimensional matrix corresponding to each dynamic behavior.

102. And training two-dimensional matrix data by adopting CNN to obtain a full connection layer.

In the embodiment of the invention, the convolutional neural network trains the two-dimensional matrix data and converts the two-dimensional matrix data into a full-connection layer form characterized by characteristics.

As an optional implementation manner, the two-dimensional matrix data is convolved by adopting a multi-convolution core to obtain column vectors; taking the maximum value in the column vector corresponding to each two-dimensional matrix data as the characteristic value of each two-dimensional matrix data; and connecting the characteristic values of each two-dimensional matrix data to obtain a full connection layer. Specifically, a plurality of convolution cores of 3 × 3 are selected to perform convolution on two-dimensional matrix data, the two-dimensional matrix data are convoluted into one-dimensional column vectors, the maximum value of the column vectors in each two-dimensional matrix data (namely the most obvious characteristic point of the two-dimensional matrix data) is taken as the characteristic value of the two-dimensional matrix data, each characteristic value is connected in sequence, a full connection layer corresponding to sample data is obtained, the full connection layer data is stored in a csv format, the implicit characteristic of the sample data is mapped, and the sample data is convenient to classify.

103. And carrying out feature classification on the full connection layer to obtain a feature classification model.

In the embodiment of the invention, Softmax logistic regression is adopted to carry out statistics on the weight of each type of characteristics in the full connection layer, and the probability value corresponding to each type of characteristics, namely the characteristics corresponding to each type of sample data and the probability distribution thereof, is calculated according to the weight ratio of each type of characteristics, so that the characteristic classification model corresponding to the sample data is constructed.

104. And carrying out malicious sample detection based on the feature classification model.

In the embodiment of the invention, the characteristics of the received unknown dynamic behavior sample are retrieved and analyzed in the characteristic classification model, so that the unknown dynamic behavior sample can be conveniently classified, and whether the unknown dynamic behavior sample is a malicious sample is detected according to the classification type of the unknown dynamic behavior sample. Therefore, the method of releasing the normal sample through the normal sample model is abandoned, the malicious sample is directly detected, the mistaken interception of part of the normal sample is avoided, and the detection is more accurate on the premise of ensuring the safety.

In the embodiment of the invention, independent of a characteristic classification model constructed by a convolutional neural network, another classification model is constructed for the same batch of sample data by adopting an XGboost algorithm, and a double classification model is adopted to detect and recheck unknown dynamic behavior samples.

As an optional implementation manner, before a security analysis sandbox is adopted to run a dynamic behavior sample and sample data is obtained through processing, VirtualTotal is adopted to analyze an md5 value of the dynamic behavior sample and generate an analysis report with an md5 label, and the dynamic behavior sample is classified and stored according to the md5 label by reading the analysis report, so that the sample data with the label is obtained; setting multi-level branch nodes, carrying out iterative classification on sample data according to label characteristics to obtain a plurality of characteristic clusters, calculating the fitting prediction score of each sample data, and carrying out classification training on the type of the sample data with the highest fitting prediction score in each characteristic cluster to obtain a prediction classification model; and when the feature classification model detects the unknown dynamic behavior sample, the prediction classification model is adopted to detect the unknown dynamic behavior sample, and when the detection results of the feature classification model and the prediction classification model are consistent, the normal sample obtained by detection is released or the malicious sample obtained by detection is intercepted, so that the detection accuracy is further improved.

Therefore, by implementing the malicious sample detection method based on deep learning described in fig. 1, starting from the malicious sample itself, the sample data is trained through the convolutional neural network, and a feature classification model capable of clearly identifying the normal sample and the malicious sample is obtained, so that the traditional mode of identifying the normal sample and releasing the normal sample is abandoned, the malicious sample is more accurately detected, and the detection accuracy is improved on the premise of ensuring the safety.

Example two

Referring to fig. 2, fig. 2 is a schematic structural diagram of a malicious sample detection system according to an embodiment of the present invention. As shown in fig. 2, the malicious sample detection system may include:

a sample operation unit 201, configured to construct and operate a plurality of dynamic behavior samples in a security analysis sandbox, so as to obtain a dynamic behavior report;

the text generation unit 202 is configured to convert the dynamic behavior report into a text document, extract an effective field, and obtain a dynamic behavior text by deduplication;

the data construction unit 203 is used for constructing a name and a serial number for each dynamic behavior in the dynamic behavior text to obtain sample data;

a matrix conversion unit 204, configured to convert the sample data into two-dimensional matrix data;

the data training unit 205 is configured to train two-dimensional matrix data by using CNN to obtain a full connection layer;

the feature classification unit 206 is configured to perform feature classification on the full connection layer to obtain a feature classification model;

and the sample detection unit 207 is configured to perform malicious sample detection based on the feature classification model.

The matrix conversion unit 204 includes:

the vector characterization subunit 2041 is configured to represent the name of each dynamic behavior in the dynamic behavior lexicon by using a vector with a preset length;

the matrix constructing subunit 2042 is configured to synthesize the sequence number of each dynamic behavior and the vector of each dynamic behavior to construct a two-dimensional matrix, so as to obtain two-dimensional matrix data.

And, the data training unit 205 includes:

a convolution subunit 2051, configured to perform convolution on the two-dimensional matrix data by using a multiple convolution kernel to obtain a column vector;

a feature value sub-unit 2052, configured to take a maximum value in a column vector corresponding to each two-dimensional matrix data as a feature value of each two-dimensional matrix data;

and a feature connection subunit 2053, configured to connect the feature values of each two-dimensional matrix data to obtain a full connection layer.

As an optional implementation manner, before converting sample data into two-dimensional matrix data, the sample operation unit 201 constructs and operates a plurality of dynamic behavior samples in the security analysis sandbox to obtain a dynamic behavior report; the text generation unit 202 converts the dynamic behavior report into a text document, extracts an effective field, and obtains a dynamic behavior text by duplication removal; the data constructing unit 203 constructs a name and a serial number for each dynamic behavior in the dynamic behavior text, and obtains sample data. Specifically, the sample operation unit 201 constructs a plurality of dynamic behavior samples, places the dynamic behavior samples into folders of corresponding types according to the types of (white samples, trojans, worms, backdoors, viruses, advertisements, unknown types) and operates, reads the dynamic behavior samples of each type into a task list by using Cuckoo sandboxes in turn, and calls an Application Programming Interface (API) to obtain the operation state of the current sandbox, if the operation state is "ported", it indicates that the operation of the dynamic behavior samples of the current type is completed and the Cuckoo sandboxes generate dynamic behavior reports, calls the API to obtain the dynamic behavior reports, and stores the dynamic behavior reports in a json format by using MD5 values of the dynamic behavior reports as names of the dynamic behavior reports, so as to obtain the dynamic behavior reports of the dynamic behavior samples of each type. Further, the text generating unit 202 converts each dynamic behavior report into a text document format, extracts the effective field in the text document format, i.e., API information called by the dynamic behavior sample, and performs deduplication processing on the API information called repeatedly to obtain a simplified dynamic behavior text; the data constructing unit 203 constructs a corresponding name and serial number for each dynamic behavior (i.e., API call information) in the dynamic behavior text corresponding to each type of dynamic behavior sample, and then obtains sample data. Therefore, the simple sample data can be obtained by converting the format of the dynamic behavior report, extracting effective information and removing duplication, so that the behavior detection problem aiming at the dynamic behavior sample is simplified into a simple text classification problem, the sample behavior detection work is simplified from a data source, and the detection efficiency is improved.

As an alternative embodiment, the vector characterization subunit 2041 uses a vector with a preset length to represent the name of each dynamic behavior in the dynamic behavior lexicon; the matrix constructing subunit 2042 synthesizes the sequence number of each dynamic behavior and the vector of each dynamic behavior to construct a two-dimensional matrix, and obtains two-dimensional matrix data. Specifically, the data samples are converted into matrix data for CNN to directly train, a two-dimensional matrix is constructed according to the serial numbers and names of the dynamic behaviors, the vector characterization subunit 2041 represents the name of each dynamic behavior (i.e., API call information) by using a vector with a preset length, and a two-dimensional matrix is constructed according to the vector of the dynamic behavior and the corresponding serial number, and the matrix construction subunit 2042 obtains two-dimensional matrix data by synthesizing the two-dimensional matrix corresponding to each dynamic behavior.

As an optional implementation, the convolution subunit 2051 performs convolution on the two-dimensional matrix data by using multiple convolution cores to obtain column vectors; the feature extraction subunit 2052 takes the maximum value in the column vector corresponding to each two-dimensional matrix data as the feature value of each two-dimensional matrix data; the feature connection subunit 2053 connects the feature values of each two-dimensional matrix data to obtain a full connection layer. Specifically, the convolution subunit 2051 selects a plurality of convolution cores of 3 × 3 to perform convolution on the two-dimensional matrix data, convolves the two-dimensional matrix data into one-dimensional column vectors, the feature extraction subunit 2052 takes the maximum value of the column vectors in each two-dimensional matrix data (i.e., the most obvious feature point of the two-dimensional matrix data) as the feature value of the two-dimensional matrix data, the feature connection subunit 2053 connects each feature value in sequence to obtain a fully connected layer corresponding to sample data, the fully connected layer data is stored in a csv format to map the implicit features of the sample data, and thus the sample data is classified conveniently.

As an optional implementation manner, the feature classification unit 206 uses Softmax logistic regression to count the weights of the various features in the full connection layer, and calculates probability values corresponding to the various features according to the weight ratios of the various features, that is, the features corresponding to the various sample data and the probability distribution thereof, so as to construct a feature classification model corresponding to the sample data.

As an optional implementation manner, for a received unknown dynamic behavior sample, the sample detection unit 207 performs retrieval analysis on the features of the unknown dynamic behavior sample in the feature classification model, that is, the unknown dynamic behavior sample can be conveniently classified, and then whether the unknown dynamic behavior sample is a malicious sample is detected according to the classification type of the unknown dynamic behavior sample. Therefore, the method of releasing the normal sample through the normal sample model is abandoned, the malicious sample is directly detected, the mistaken interception of part of the normal sample is avoided, and the detection is more accurate on the premise of ensuring the safety.

EXAMPLE III

Referring to fig. 3, fig. 3 is a schematic structural diagram of another malicious sample detection system according to an embodiment of the present invention. As shown in fig. 3, the malicious sample detection system may include:

a memory 301 storing executable program code;

a processor 302 coupled to the memory 301;

the processor 302 calls the executable program code stored in the memory 301 to execute the malicious sample detection method based on deep learning of fig. 1.

The embodiment of the invention discloses a computer-readable storage medium which stores a computer program, wherein the computer program enables a computer to execute the malicious sample detection method based on deep learning of figure 1.

Embodiments of the present invention also disclose a computer program product, wherein, when the computer program product is run on a computer, the computer is caused to execute part or all of the steps of the method as in the above method embodiments.

It will be understood by those skilled in the art that all or part of the steps in the methods of the embodiments described above may be implemented by hardware instructions of a program, and the program may be stored in a computer-readable storage medium, where the storage medium includes Read-Only Memory (ROM), Random Access Memory (RAM), Programmable Read-Only Memory (PROM), Erasable Programmable Read-Only Memory (EPROM), One-time Programmable Read-Only Memory (OTPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Compact Disc Read-Only Memory (CD-ROM), or other Memory, such as a magnetic disk, or a combination thereof, A tape memory, or any other medium readable by a computer that can be used to carry or store data.

The method and the system for detecting the malicious sample based on deep learning disclosed by the embodiment of the invention are described in detail, a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A malicious sample detection method based on deep learning is characterized by comprising the following steps:

converting the sample data into two-dimensional matrix data;

2. The method of claim 1, wherein prior to said converting the sample data into two-dimensional matrix data, the method further comprises:

3. The method of claim 2, wherein said converting the sample data into two-dimensional matrix data comprises:

4. The method of claim 1, wherein the training the two-dimensional matrix data using CNN to obtain a fully connected layer comprises:

5. A malicious sample detection system based on deep learning, the system comprising:

6. The system of claim 5, further comprising:

7. The system of claim 6, wherein the matrix conversion unit comprises:

8. The system of claim 5, wherein the data training unit comprises:

9. The deep learning based malicious sample detection system according to claims 5-8, wherein the system further comprises:

a memory storing executable program code;

a processor coupled with the memory;

the processor calls the executable program codes stored in the memory to execute the malicious sample detection method based on deep learning of any one of claims 1-4.

10. A computer-readable storage medium storing a computer program, wherein the computer program causes a computer to execute a method for multi-deep learning based malicious sample detection as claimed in any one of claims 1 to 4.