CN117640190A

CN117640190A - Botnet detection method based on multi-mode stacking automatic encoder

Info

Publication number: CN117640190A
Application number: CN202311596885.1A
Authority: CN
Inventors: 孙宁; 陈乐兰; 韩光洁; 娄星宇
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2023-11-28
Filing date: 2023-11-28
Publication date: 2024-03-01

Abstract

The invention discloses a botnet detection method based on a multi-mode stacking automatic encoder. The method comprises the following steps: acquiring an executable file of an application program; respectively carrying out dynamic analysis and static analysis on a data set containing benign programs and bots, and extracting dynamic characteristics based on streams and static characteristics based on printable character string information graphs; pre-training two stacking automatic encoders to encode the flow-based features and the graph-based features respectively, and extracting deep features; fusing the dynamic characteristics and the static characteristics based on the multi-mode automatic encoder; fine tuning the multi-modal stacked auto-encoder model; and taking the encoder of the trained multi-mode stacking automatic encoder model as a feature extractor, and taking the output of the shared hidden layer as the input of one softmax layer for bot program detection. According to the invention, the static characteristics and the dynamic characteristics can be automatically fused through the improved multi-mode stacking automatic encoder, the complex relationship between the two different mode characteristics can be learned, the advantages of the hybrid analysis method can be fully exerted, and the precision of detecting the botnet program can be improved.

Description

Botnet detection method based on multi-mode stacking automatic encoder

Technical Field

The invention belongs to the field of network security and machine learning, and particularly relates to a botnet detection method based on a multi-mode stacking automatic encoder.

Background

For data acquisition and feature extraction, two methods of static analysis and dynamic analysis are mainly adopted in the botnet detection field. Static methods extract static features by analyzing binary code of zombie program instances without executing malware. Dynamic methods require execution of a given bot instance, typically in a sandboxed environment, and extraction of dynamic features that represent botnet behavior. Most of the existing botnet detection methods detect botnet programs based only on static features or dynamic features. Static analysis is simple and fast, but is susceptible to confusion techniques such as encryption. In contrast, dynamic analysis reflects the behavior of the program as it runs, is relatively confusing, and has better versatility to unknown attacks and attack variants, whereas the process of data collection is time consuming. Since static analysis is dominant in detecting the structure of malware, dynamic analysis can easily detect ambiguous malware. Therefore, by merging the two types of features in a proper way, the precision of botnet detection can be improved. Although the previous methods consider fusion of multiple features, the previous methods are feature-level fusion or single-mode fusion, and the complex relationship between two features cannot be learned by simply feature-combining the features, so that the advantages of the hybrid analysis method cannot be fully exerted.

Disclosure of Invention

In order to solve the problems, the invention provides a botnet detection method based on a multi-mode stacking automatic encoder. The method combines the multi-modal characteristics extracted by static analysis and dynamic analysis, learns the complex relationship between two different modal characteristics based on a stacked multi-modal automatic encoder, fully plays the advantages of a hybrid analysis method, and has higher detection capability on zombie programs.

In order to achieve the technical purpose and the technical effect, the invention is realized by the following technical scheme:

a botnet detection method based on a multi-mode stacking automatic encoder comprises the following steps:

(1) Acquiring an executable file of an application program and storing the executable file in an ELF format;

(2) Respectively carrying out dynamic analysis and static analysis on a data set containing benign programs and bots, and extracting dynamic characteristics based on streams and static characteristics based on Printable String Information (PSI) graphs;

(3) Pre-training two Stacking Automatic Encoders (SAE), respectively encoding dynamic features and static features, and extracting deep complex features;

(4) Fusing dynamic features and static features based on a multi-Modal Automatic Encoder (MAE);

(5) Trimming a multi-Modal Stacked Automatic Encoder (MSAE);

(6) And taking the encoder of the MSAE model with complete training as a feature extractor, and classifying the shared hidden layer output of the model as the input of a softmax layer to realize the detection of the bot program.

The following provides several alternatives, but not as additional limitations to the above-described overall scheme, and only further additions or preferences, each of which may be individually combined for the above-described overall scheme, or may be combined among multiple alternatives, without technical or logical contradictions.

Preferably, the step (2) of dynamically analyzing the data set containing benign programs and bots specifically includes the following steps:

1) Analyzing network behaviors of the ELF file through a Cuckoo sandbox, and recording network traffic in a pcap format;

2) According to five-tuple { source IP address, source port number, destination IP address, destination port number, protocol }, carrying out stream division on network traffic recorded in the pcap file, and aggregating data packets with the same five-tuple into stream data f= { p ₁ ,p ₂ ,...,p _i P, where _i Representing data packets having the same five-tuple;

3) The stream data are aggregated again, and the stream data collected from the same program running process are collected in a union mode to form a stream record of a corresponding ELF file:

4) And extracting statistical characteristics based on the stream records, wherein the statistical characteristics comprise an average value, a maximum value and a minimum value of the total number of data packets contained in the stream, an average value, a maximum value and a minimum value of communication duration of the stream, and the average value, the maximum value and the minimum value of the byte number contained in the data packets in the stream are 9 characteristic dimensions in total. Obtaining a stream-based dynamic feature setWherein n represents the number of ELF samples, +.>Representing flow-based features extracted from the ith ELF sample.

Preferably, in the step (2), the static analysis is performed on the data set containing the benign program and the zombie program, and specifically the method comprises the following steps:

1) Checking whether the ELF file is shelled using a shell checking tool DiE, and then unpacking and disassembling the binary using UPX and IDAPro;

2) Constructing a Function Call Graph (FCG) and a Printable String Information (PSI) graph according to a function caller-callee relationship in the assembly code;

3) Thereafter, a graph2vec graph embedding technique is usedConverting PSI diagram into numerical vector data to obtain static feature setWherein->Representing PSI map-based features extracted from the ith ELF sample.

Preferably, the function call graph is defined as a directed graph g= (V, E), and is defined by a vertex set v= { V ₁ ,v ₂ ,...,v _m Sum edge set e= { E ₁₂ ,e ₁₃ ,...,e _ij Composition, where m represents the number of vertices, e _ij Representing a function v _i Calling function v _j . Vertices in the FCG correspond to functions contained in the assembly code of the program, and edges represent caller-callee relationships between the two functions.

Preferably, the construction process of the function call graph is summarized as follows:

a) Extracting a set of identified functions from the assembly code;

b) Then determining an entry point function;

c) Building FCG using breadth-first search algorithm if function v is identified _i And v _j With caller-callee relationship, vertex v will be _i And v _j Add to the vertex set V and edge e _ij Added to edge set E.

Preferably, the PSI graph is constructed by selecting functions and relationships from FCGs that are close to zombie program operation steps, in order to minimize computational complexity, specifically:

a) Extracting all Printable String Information (PSI) existing in the binary file through an IDAPro plug-in, and selecting PSI containing at least three characters in length;

b) Selecting a set of PSI components P= { PSI containing important semantic information (which may reveal the intention of an attacker) ₁ ,psi ₂ ,...,psi _k }；

c) For vertex v in the function call graph _i If v _i Representation ofContains at least one important printable string information psi in the function of (a) _i Vertex v _i Adding the vertex set V' of the PSI graph, and continuing to execute the step 4), otherwise, skipping the step 4);

d) Traversing all representation functions v _i Edge e of call relation _ij If the function v _j Also contains at least one psi _i And (2) andthen vertex v _j Add the vertex set V' of the PSI map and edge e _ij Adding an edge set E' of the PSI graph;

e) Repeating the steps 3) and 4) until all vertexes in the function call graph are traversed, and finally outputting a PSI graph G ' = (V ', E ').

Preferably, two SAE are pre-trained in the step (3), specifically:

pre-training an SAE by using dynamic characteristics, wherein an encoder consists of two full connection layers and a ReLU activation function, and the decoder and the encoder are of symmetrical structures;

the other SAE is pre-trained by using static characteristics, the encoder consists of two convolution layers, a full connection layer and a ReLU activation function, and the decoder and the encoder are also symmetrical structures;

the pre-trained two SAEs are used to encode dynamic and static data, respectively, to obtain a potential representation of the two modality data.

Preferably, in the step (4), dynamic features and static features are fused based on the multi-mode automatic encoder, specifically:

the final hidden layer outputs of the encoders of the two pre-trained SAE are connected in series to be used as the input of the multi-mode automatic encoder;

fusing the potential representations of the two modal data based on a hidden layer of the multi-modal automatic encoder to generate a shared potential representation;

finally, a stacked multi-modal automatic encoder (MSAE) with all pre-training layers and shared hidden layers is constructed.

Preferably, the fine tuning process in the step (5) specifically includes:

the goal of an automatic encoder is to minimize reconstruction errors of the input and output, let the shared hidden layer learn a shared potential representation of the bimodal data, define a loss function as:

wherein,and->Respectively the dynamic and static feature vectors of the input,/->And->Is the corresponding reconstructed vector of MSAE output;

and fixing parameters of the pre-training layer, training by adopting a gradient descent algorithm, and updating only the weight and the parameters of the shared hidden layer.

Preferably, the encoder of the MSAE model is used as a feature extractor in the step (6), and classified by using a softmax layer, specifically:

unfolding the stacked automatic encoder, adding a softmax output layer on top of the shared hidden layer, and outputting the corresponding predictive label of the ith input

Where W represents the weight of the softmax layer, b represents the bias of the softmax layer, T is the number of object tag categories, z ⁽ⁱ⁾ Is the ith output of the shared hidden layer.

Preferably, in order to improve the detection accuracy of the zombie program, the classification error is also added into the loss function of the fine tuning stage, and the classification error is minimized based on the cross entropy loss function:

wherein y is ⁽ⁱ⁾ Is the true label of the i-th input sample,is the corresponding predictive label;

the final MSAE is minimized with respect to the reconstruction error L _r And classification error L _c Is a weighted sum of:

L＝αL _r +βL _c +λR (5)

wherein R is a regularization term, which is realized by carrying out L2 regularization on the weights of all layers in the network; alpha, beta and lambda are weighting factors.

Preferably, the weighting factors α and β are adaptively calculated by using a softmax function:

compared with the prior art, the invention has the following beneficial effects:

1. the invention combines static analysis and dynamic analysis methods, extracts the features based on the stream and the features based on the PSI graph to detect the zombie program, and has higher accuracy compared with a single feature by means of the complementary advantages of the two analysis methods.

2. According to the invention, the strong autonomous learning capability of the multi-mode automatic encoder is utilized, the static features and the dynamic features are automatically fused through the iterative training of the network model, and compared with a simple method for fusing the features of the two modes through direct splicing, the complex relationship between the two modes can be extracted, and the advantages of hybrid analysis are fully exerted.

3. The invention adopts a pre-training and fine-tuning mode to train the MSAE model, does not need a large number of marked data sets, and adds punishment of classification errors into a loss function of a fine-tuning stage, thereby further enhancing the performance of the network model.

Drawings

FIG. 1 is a training and testing flow diagram of one embodiment of the present invention;

FIG. 2 is a schematic diagram of a bot detection scheme according to one embodiment of the present invention;

FIG. 3 is a diagram of a pre-training network and a fine-tuning network architecture according to an embodiment of the present invention;

fig. 4 is a schematic diagram of an MSAE network structure according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. The described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The principle of application of the invention is described in detail below with reference to the accompanying drawings.

In the embodiment of the invention, the multi-mode characteristics extracted by combining static analysis and dynamic analysis are combined, a stacked multi-mode automatic encoder model is constructed based on the pre-trained stacked automatic encoder and the fine-tuned multi-mode automatic encoder and is used for automatically fusing the static characteristics and the dynamic characteristics, and then the accurate detection of the zombie program is realized based on the fused characteristics. The embodiment provides a botnet detection method based on a multi-mode stacking automatic encoder, specifically, as shown in fig. 1, comprising the following steps:

(1) An executable file of the application program is obtained and stored in an ELF format.

(2) And respectively carrying out dynamic analysis and static analysis on the data set containing the benign program and the zombie program, and extracting dynamic characteristics based on the stream and static characteristics based on the PSI graph.

(2.1) the specific steps of dynamic feature extraction are as follows:

(2.1.1) network behavior analysis is performed on the ELF file through a Cuckoo sandbox, and network traffic is recorded in a pcap format.

(2.1.2) stream-dividing the network traffic recorded in the pcap file according to the five-tuple { source IP address, source port number, destination IP address, destination port number, protocol }, and aggregating the data packets with the same five-tuple into stream data f= { p ₁ ,p ₂ ,...,p _i P, where _i Representing packets with the same five-tuple.

(2.1.3) re-aggregating the stream data, and merging the stream data collected from the same program runtime to form a stream record of the corresponding ELF file:

(2.1.4) extracting statistical features based on the flow records, including an average value, a maximum value, and a minimum value of the total number of data packets contained in the flow, an average value, a maximum value, and a minimum value of communication duration of the flow, and an average value, a maximum value, and a minimum value of the number of bytes contained in the data packets contained in the flow, totaling 9 feature elements. Obtaining a stream-based dynamic feature setWherein n represents the number of ELF samples, +.>Representing flow-based features extracted from the ith ELF sample.

(2.2) the specific steps of static feature extraction are as follows:

(2.2.1) checking whether the ELF file is shelled using the shell tool DiE, and then unpacking and disassembling the binary using UPX and IDAPro.

(2.2.2) constructing a Function Call Graph (FCG) and a Printable String Information (PSI) graph according to the function caller-callee relationship in the assembly code.

The function call graph is defined as a directed graph g= (V, E), defined by a vertex set v= { V ₁ ,v ₂ ,...,v _m Sum edge set e= { E ₁₂ ,e ₁₃ ,...,e _ij Composition, where m represents the number of vertices, e _ij Representing a function v _i Calling function v _j . Vertices in the FCG correspond to unique functions contained in the assembly code of the program, and edges represent caller-callee relationships between the two functions. The present embodiment uses an existing function call graph construction method, based on breadth-first search algorithm, to construct FCG using FIFO function queues, specifically:

a) Extracting a group of boundaries of the identified functions from the assembly code, and storing the functions into a function set named as FunSet;

b) Then extracting all the entry point functions, storing the entry point functions into EntryFunSet, and adding all the entry point functions into a vertex set V;

c) Initializing a function queue by using an entry point function, and setting a queuing flag 'enQFlag' of the function queue as true so as to prevent repeated queuing of the same vertex;

d) When the queue is not empty, dequeuing the queue from the head element of the queue, and adding the function v _i Treated as a function caller, after which the function v is traversed _i To fetch its called set;

e) When the called party is acquired, traversing the called party set, and checking whether the called party v exists in the graph _j If not, the callee v _j Added to the set of vertices V and checked if there is a slave caller V already in the graph _i To the called party v _j Edge e of (2) _ij If not, edge e _ij Adding into an edge set E;

f) Detecting whether the called party is queued, if not, setting a queuing flag 'enQFlag' of the called party as true, and attaching the queuing flag to the tail of the queue;

g) Repeating steps d), e), f) until the queue is empty.

The function call graph is intended to represent all possible runs of a program. FCGs are therefore often complex, with a large number of nodes and edges, which requires longer computation time and more memory. Although all call relationships of the program are represented in the FCG, some call relationships may never occur during actual running of the program. In order to minimize the computational complexity, the present embodiment selects functions and relationships close to the operation steps of the zombie program from the FCG to construct the PSI graph, specifically:

a) Extracting all Printable String Information (PSI) existing in the binary file through the IDAPro plug-in, and selecting PSI with at least three characters in length for balancing detection precision and calculation complexity;

b) Then a PSI composition set P= { PSI containing important semantic information (which can reveal the intention of an attacker) is selected ₁ ,psi ₂ ,...,psi _k }；

c) For vertex v in the function call graph _i If v _i The expressed function contains at least one important printable string information psi _i Vertex v _i Adding the vertex set V' of the PSI graph, and continuing to execute the step d), otherwise, skipping the step d);

e) Repeating steps c) and d) until all vertices in the function call graph are traversed, and finally outputting a PSI graph G ' = (V ', E ').

(2.2.3) converting the PSI map into numerical vector data by using a map embedding technique named graph2vec to obtain a static feature setWherein->Representing PSI map-based features extracted from the ith ELF sample. The result of this step is a set of one-hot vectors of arbitrary length representing the atlas. In this embodiment, the PSI chart is represented as a numerical vector of length 1024.

(3) Two Stacked Automatic Encoders (SAE) are pre-trained to encode dynamic and static features, respectively, extracting deep complex features.

And dividing the characteristic data set extracted in the step into a training set and a testing set, and then dividing the training set again to obtain a pre-training data set and a fine-tuning data set.

In an unsupervised learning mode, using dynamic features in the pre-training dataset as input, a SAE is pre-trained, the structure of which is shown in FIG. 3 (a). For convenience of explanation, this SAE is referred to as SAE1 in this example. SAE1 is composed of two parts, namely an encoder and a decoder, wherein the encoder is composed of two fully connected layers, a ReLU activation function is adopted between each layer, and the decoder and the encoder are of symmetrical structures. The size of the input layer and the output layer of SAE1 corresponds to the dimension of the dynamic feature, set to 9; the number of neurons of the two concealment layers in its encoder is 8 and 4, respectively, so the output size of the encoder final concealment layer is 4.

In an unsupervised learning manner, using static features in the pre-training dataset as input, another SAE, here called SAE2, is pre-trained, the structure of which is shown in FIG. 3 (b). SAE2 consists of two parts, encoder and decoder, which are also symmetrical structures, wherein the structure of the encoder is specifically:

(1) a convolution layer C1, the convolution kernel size is 3×3, the channel number is 16, and the output is 8×8×16;

(2) the pooling layer P1 performs a maximum pooling operation of 2×2 once and outputs 4×4×16;

(3) a convolution layer C2, the convolution kernel size is 3×3, the number of channels is 32, and the output is 4×4×32;

(4) the pooling layer P2 performs a maximum pooling operation of 2×2 once and outputs 2×2×32;

(5) the full connection layer FC1 consists of 128 neurons, adopts a ReLU activation function and outputs 128-dimensional vectors;

(6) the fully connected layer FC2, consisting of 10 neurons, uses the ReLU activation function, so the encoder final hidden layer output size is 10.

And then, respectively encoding the dynamic characteristics and the static characteristics in the fine adjustment data set by using the pre-trained two SAEs to obtain potential representations of the two modal data.

(4) The dynamic characteristics and the static characteristics are fused based on the multi-mode automatic encoder.

To fuse the static and dynamic features, the present embodiment concatenates the final hidden layer outputs of the pre-trained encoders of the two SAEs and takes them as inputs to the multi-mode auto-encoder. The implementation of a multi-modal auto-encoder is essentially based on a hidden layer of another auto-encoder fusing the potential representations of the two modal data, generating a shared potential representation, as shown in fig. 3 (c). A stacked multi-modal automatic encoder with all pre-training layers and a shared hidden layer is finally constructed as shown in fig. 4.

(5) A multi-Modal Stacked Automatic Encoder (MSAE) is fine tuned.

and (3) performing fine adjustment on the MSAE in a semi-supervised learning mode, taking a fine adjustment data set with a label as a model input, fixing parameters of a pre-training layer, and performing optimization updating on the parameters of the shared hidden layer only through an Adam optimization function based on a gradient descent algorithm.

The model structure for testing zombie programs based on MSAE is shown in FIG. 4. Specifically, the stacked auto encoder is expanded and a softmax output layer is added on top of the shared hidden layer to output the corresponding predictive label of the ith input

In order to improve the detection precision of zombie programs, the embodiment provides an improved MSAE, classification errors are added into a loss function in a fine tuning stage, and the classification errors are minimized based on a cross entropy loss function:

the final MSAE is minimized with respect to the reconstruction error L _r And divideClass error L _c Is a weighted sum of:

wherein R is a regularization term for preventing model overfitting by L2 regularization of weights of layers in the network, L refers to the number of layers of the network, W ^l Refers to the weight of the corresponding layer; alpha, beta are weighting factors for reconstruction loss and classification loss, respectively, and lambda is a regularization coefficient.

Further, the weighting factors α and β are adaptively calculated by using a softmax function:

as shown in fig. 2, the embodiment combines static analysis and dynamic analysis methods, respectively extracts static features based on PSI graphs and dynamic features based on streams, automatically fuses the static features and the dynamic features based on MSAE, automatically extracts fusion features through iterative training of a network, and detects zombie programs based on the fusion features. Compared with the prior art, the method can extract the complex relation between the bimodal features and fully exert the advantages of mixed analysis.

The foregoing has shown and described the basic principles and main features of the present invention and the advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A botnet detection method based on a multi-modal stacked automatic encoder, comprising the steps of:

(2) Respectively carrying out dynamic analysis and static analysis on a data set containing benign programs and bots, and extracting dynamic characteristics based on streams and static characteristics based on printable character string information graphs;

(3) Pre-training two stacking automatic encoders to encode dynamic characteristics and static characteristics respectively and extract deep complex characteristics;

(4) Fusing the dynamic characteristics and the static characteristics based on the multi-mode automatic encoder;

(5) Trimming the multi-mode stacking automatic encoder;

(6) And taking the encoder of the fully trained multi-mode stacking automatic encoder model as a feature extractor, and classifying the shared hidden layer output of the model as the input of a softmax layer to realize the detection of the zombie program.

2. The botnet detection method based on multi-modal stacked automatic encoders of claim 1, wherein in step (2) the data set comprising benign programs and botnets is dynamically analyzed, specifically:

2-1) analyzing network behaviors of the ELF file through a Cuckoo sandbox, and recording network traffic in a pcap format;

2-2) stream-dividing the network traffic recorded in the pcap file according to the five-tuple { source IP address, source port number, destination IP address, destination port number, protocol }, and aggregating the data packets with the same five-tuple into stream data f= { p ₁ ,p ₂ ,...,p _i P, where _i Representing data packets having the same five-tuple;

2-3) re-aggregating the stream data, and merging the stream data collected from the same program in running to form a stream record of the corresponding ELF file:

2-4) extracting statistical features based on flow records, packetsThe average value, the maximum value and the minimum value of the total number of the data packets contained in the stream, the average value, the maximum value and the minimum value of the communication duration of the stream, and the average value, the maximum value and the minimum value of the byte number contained in the data packets in the stream are 9 feature dimensions in total; obtaining a stream-based dynamic feature setWherein n represents the number of ELF samples, +.>Representing flow-based features extracted from the ith ELF sample.

3. The botnet detection method based on multi-modal stacked automatic encoders of claim 1, wherein in step (2) the data set comprising benign programs and botnets is statically analyzed, specifically:

3-1) checking whether the ELF file is shelled using a shell checking tool DiE, and then unpacking and disassembling the binary using UPX and IDAPro;

3-2) constructing a function call graph and a printable character string information graph according to the relation between a function caller and a callee in the assembly code;

3-3) converting the printable string information graph into numerical vector data by adopting a graph embedding technology named graph2vec to obtain a static feature setWherein->Representing features extracted from the ith ELF sample based on the printable string information graph;

the function call graph is defined as a directed graph g= (V, E), defined by a vertex set v= { V ₁ ,v ₂ ,...,v _m Sum edge set e= { E ₁₂ ,e ₁₃ ,...,e _ij Composition, where m represents the number of verticesQuantity e _ij Representing a function v _i Calling function v _j The method comprises the steps of carrying out a first treatment on the surface of the Vertices in the function call graph correspond to functions contained in assembly code of the program, and edges represent caller-callee relationships between the two functions.

4. The botnet detection method based on the multi-mode stacking automatic encoder as claimed in claim 2, wherein the construction process of the function call graph is specifically as follows:

4-1) extracting a set of identified functions from the assembly code;

4-2) then determining an entry point function;

4-3) building a function call graph using breadth-first search algorithm if a function v is identified _i And v _j With caller-callee relationship, vertex v will be _i And v _j Add to the vertex set V and edge e _ij Added to edge set E.

5. The botnet detection method based on the multi-modal stacked automatic encoder of claim 2, wherein the printable string information graph is constructed by selecting functions and relations close to the botnet operation steps from the function call graph, in particular:

5-1) extracting all printable string information existing in the binary file through the IDAPro plug-in, and selecting printable string information containing at least three characters in length;

5-2) selecting printable string information composition sets P= { psi containing important semantic information ₁ ,psi ₂ ,...,psi _k }；

5-3) for vertex v in the function call graph _i If v _i The expressed function contains at least one important printable string information psi _i Vertex v _i Adding the vertex set V' of the PSI graph, and continuing to execute the step 5-4), otherwise, skipping the step 5-4);

5-4) traversing all representation functions v _i Edge e of call relation _ij If the function v _j Also contains at least one psi _i And (2) andthen vertex v _j Add the vertex set V' of the PSI map and edge e _ij Adding an edge set E' of the PSI graph;

5-5) repeating the steps 5-3) and 5-4) until all vertices in the function call graph are traversed, and finally outputting a PSI graph G ' = (V ', E ').

6. The botnet detection method based on multi-modal stacked automatic encoders of claim 1, wherein the pre-training of two stacked automatic encoders in step (3) is specifically:

pre-training a stacked automatic encoder using dynamic features, the encoder consisting of two fully-connected layers and a ReLU activation function, the decoder and encoder being of symmetrical construction;

pre-training another stacked automatic encoder using static features, the encoder consisting of two convolutional layers, a fully-concatenated layer, and a ReLU activation function, the decoder and encoder also being of symmetrical construction;

the pre-trained two stacked auto-encoders are used to encode dynamic and static data, respectively, to obtain potential representations of both modality data.

7. The botnet detection method based on the multi-mode stacking automatic encoder as claimed in claim 1, wherein the dynamic features and the static features are fused based on the multi-mode stacking automatic encoder in the step (4), specifically:

the final hidden layer codes of the two pre-training stacked automatic encoders are connected in series to be used as the input of the multi-mode automatic encoder;

a stacked multi-modal automatic encoder with all pre-training layers and a shared hidden layer is ultimately built.

8. The botnet detection method based on the multi-mode stacking automatic encoder as claimed in claim 1, wherein the trimming process of the step (5) specifically comprises:

9. The botnet detection method based on the multi-modal stacked automatic encoder as claimed in claim 1, wherein the step (6) specifically comprises:

Where W represents the weight of the softmax layer, b represents the bias of the softmax layer, T is the number of object tag categories, z ⁽ⁱ⁾ Is the output of the shared hidden layer.

10. The botnet detection method based on multi-modal stacked automatic encoders of claim 9, wherein, to improve detection accuracy of botnets, classification errors are also added to the loss function of the fine tuning stage, and classification errors are minimized based on cross entropy loss function:

the minimization goal of the final multi-mode stacked auto-encoder is to reconstruct the error L _r And classification error L _c Is a weighted sum of:

L＝αL _r +βL _c +λR (5)

wherein R is a regularization term, which is realized by carrying out L2 regularization on the weights of all layers in the network; alpha, beta and lambda are weighting factors;

the weighting factors α and β for the reconstruction error and classification error are adaptively calculated using the softmax function: