CN112565261B - Multi-generator AugGAN-based dynamic malicious API sequence generation method - Google Patents

Multi-generator AugGAN-based dynamic malicious API sequence generation method Download PDF

Info

Publication number
CN112565261B
CN112565261B CN202011411208.4A CN202011411208A CN112565261B CN 112565261 B CN112565261 B CN 112565261B CN 202011411208 A CN202011411208 A CN 202011411208A CN 112565261 B CN112565261 B CN 112565261B
Authority
CN
China
Prior art keywords
api
malicious
sample
sequence
benign
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011411208.4A
Other languages
Chinese (zh)
Other versions
CN112565261A (en
Inventor
杨强
杨涛
郝唯杰
阮伟
王文海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202011411208.4A priority Critical patent/CN112565261B/en
Publication of CN112565261A publication Critical patent/CN112565261A/en
Application granted granted Critical
Publication of CN112565261B publication Critical patent/CN112565261B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention provides a multi-generator AugGAN-based method for generating an anti-dynamic malicious API sequence. In the invention, a black box detector is inserted between a generator and a discriminator to carry out unsupervised learning on the sample label printing; selecting a non-interfering API from a corresponding non-interfering API database using a plurality of producers; randomly inserting the selected interference-free API into a malicious API sequence to generate an anti-dynamic malicious API sample; inputting the confrontation sample and the benign sample into a black box detector for classification and labeling; inputting the confrontation sample, the benign sample and the corresponding labels thereof into a discriminator for training, and returning gradient information; the generator and the discriminator update the parameters thereof according to the returned gradient information; the continuous game interruption of the generator and the discriminator improves the generation capability and the discrimination capability, and finally reaches a Nash equilibrium state; after system balancing, the generated anti-dynamic malicious API sequence can bypass the detection of the black box detector.

Description

Multi-generator AugGAN-based dynamic malicious API sequence generation method
Technical Field
The invention relates to a method for generating an antagonistic API sequence, in particular to a method for generating an antagonistic dynamic malicious API sequence based on multi-generator AugGAN.
Background
With the continuous and deep integration development of new-generation information technology and work, life and study, the internet is fully permeating the aspects of human social activities, and becomes an indispensable part of people of all countries in the world. According to the incomplete statistics of the international telecommunication organization, by 2016, the world has 35 hundred million internet users, which account for half of the world's total population. It is expected that by 2020, internet terminals will reach 120 billion. With the continuous new change and characteristics of global network threats, especially new network attacks come out endlessly and present global spreading situation, the global network security situation is still serious, and the network space security protection work is still heavy and far away.
With the continuous development of artificial intelligence, the machine learning algorithm makes great progress in the field of network security. However, while the machine learning algorithm brings great convenience to us, it also exposes many security issues. In fact, by making slight modifications to the input data during the machine learning model inference phase, it is possible to let the model yield erroneous results in a short time. These slightly modified data are called challenge samples. The attack method aiming at the algorithm model can easily find the countersample which makes the model make wrong judgment. Typically, a challenge sample appears to be the input sample that, by deliberately adding a slight alteration to the data, causes an interference to the model forcing it to output an erroneous result with high confidence. In the classification algorithm, on the premise that human beings cannot recognize with naked eyes, the confrontation samples can increase the prediction error of the model, so that the samples which are originally correctly classified are moved to the other side of the judgment area and are classified into another category.
Aiming at the condition that malicious files are updated rapidly at present, if only the original data set is used for training an attack detection machine learning model, the model can not detect variants of some malicious files, so that the undetected rate of the model is increased, and many machine learning algorithms are very easy to be deliberately attacked, namely, the malicious file detection model based on machine learning is very easy to be bypassed by some countermeasure technologies. Therefore, the research on the antagonistic sample generation method is helpful for improving the confusion and disturbance resisting capability of the machine learning detection model.
The generated countermeasure network technology (GAN) is rarely and rarely applied to the field of information security, and the reason for analyzing the GAN is mainly that in the process of generating countermeasure samples, malware and malicious network flow samples have remarkable particularity compared with common samples such as images. For an image sample, the sample generated against the original sample may be visually indistinguishable without destroying the visual characteristics of the sample. For an anti-dynamic malicious API sequence sample, the anti-sample needs to satisfy the same performability and maliciousness compared to the original sample. In a malicious dynamic API sequence sample, each API has its specific meaning, and randomly adding or subtracting APIs from sensitive locations may result in the network stream sample being unusable, such as illegal field generation, invalid check fields, and offensive damage. Therefore, conditions for generating malicious files and network flows to resist the samples are more severe. Therefore, when the API dynamic sequence is generated, how to ensure that the generated countermeasure sample has malicious aggressivity and executability is a key problem for urgent research and solution in the field of network security for generating the countermeasure network technology.
Disclosure of Invention
The invention aims to solve the problem of how to ensure deception, maliciousness and performability of countermeasure samples generated in the field of information security. A perfect analysis method is provided aiming at the defect that the generative countermeasure network technology (GAN) is applied to the research of the information security field: the anti-dynamic malicious API sequence generation method based on the multi-generator Augmentation genetic Network (AugGAN) provided by the invention has guiding significance on the application of the GAN to the field of information safety and how to improve the anti-confusion and anti-disturbance capability of a machine learning detection model.
The purpose of the invention can be realized by the following technical scheme:
a multi-generator AugGAN-based method for generating an anti-dynamic malicious API sequence comprises the following steps:
(1) acquiring malicious API sequence samples and benign API sequence samples, and numbering according to the sequence;
(2) establishing three non-interference API databases and a multi-generator AugGAN network, wherein each API database stores a non-interference API with a number; the multi-generator AugGAN network comprises three parallel generators and a discriminator, wherein each generator corresponds to an interference-free API database;
(3) taking random noise as the input of each generator to obtain three random vectors; mapping the three random vectors respectively to obtain a series of numbers, and selecting an interference-free API from a corresponding interference-free API database according to the numbers;
(4) inserting an index number, a thread number and a file name into the selected interference-free API to form a complete interference-free API calling sequence;
(5) inserting the complete interference-free API call sequence into the malicious API sequence sample to generate an anti-dynamic malicious API sequence sample; extracting feature vectors of the antagonistic dynamic malicious API sequence samples and the benign API sequence samples, and classifying by using a black box detector to obtain classification labels, wherein the classification labels comprise benign and malignant;
(6) training by taking the characteristic vectors of the tagged anti-dynamic malicious API sequence sample and the benign API sequence sample as the input of the discriminator, returning gradient information, updating the weight parameters of the generator and the discriminator, and stopping training after reaching a Nash equilibrium state;
(7) and (3) repeating the steps (3) to (4) by using a trained generator to generate a complete non-interference API calling sequence, inserting the complete non-interference API calling sequence into a malicious API sequence sample, classifying the inserted API sequence sample by using a black box detector, and taking the API sequence sample classified as benign as a finally generated dynamic malicious API sequence.
Further, the method for establishing three interference-free API databases in step (2) includes:
(1.1) selecting n interference-free APIs, and averagely dividing the n interference-free APIs into three groups which are respectively expressed as original _ DiI is 1-3;
(1.2) adopting a moving window method to respectively expand the number of each group of non-interference API by integral multiple to form three non-interference API databases which are respectively expressed as D1、D2、D3And to DiEach non-interfering API in (a) renumbers; the calculation formula is as follows:
Figure BDA0002818353990000031
sernumberij=j
constraint conditions are as follows:
Figure BDA0002818353990000032
in the formula, movwindoww(original_DiPer) indicates that the w-th time is in original _ D by moving window methodiIs selected and cutSlice, per represents the slice size as a percentage of the original window size, m represents augmented DiE represents expanding the e-times interference-free API database origin _ Di;sernumberijRepresenting a non-interfering database DiThe jth non-interfering API in (a) is numbered j.
Further, the step (3) is specifically as follows:
(3.1) three generators respectively input three noises and output three vectors with dimensions of 1 × t, and the calculation method of the three vectors with dimensions of 1 × t is as follows:
zi=random(p)
yi=Sigmoid(Gi(zi))
Figure BDA0002818353990000041
in the formula, ziTo input the noise of the ith generator, i is 1,2, 3; p is constant, random (p) represents the generation of p random numbers; the last layer of the generator i is the Sigmoid function, Gi(zi) To generate the output result of the second layer of the generator i; y isiVector of dimension 1 x t, y, output for the ith generatoriIs always at [0,1]]Within the interval;
(3.2) mapping the vector with the dimension of 1 × t output by the generator to the interval of [0, m ], taking an integer, and calculating the formula as follows:
y′i=[m×yi]
in the formula [ ·]Denotes a rounding off integer, y'iThe vector is a 1 x t-dimensional vector obtained after the mapping and the rounding;
(3.3) utilize y'iIn the corresponding non-interfering API database DiAnd picking a non-interference API, and expressing:
Figure BDA0002818353990000042
in the formula (I), the compound is shown in the specification,
Figure BDA0002818353990000043
presentation in a non-interfering API database DiLichout interference-free API, YiPresentation in a non-interfering API database DiThe resulting set of t non-interfering APIs is picked.
Further, the step (4) is specifically as follows:
(4.1) for YiThe file name, the thread number and the index number of each non-interference API are assigned to finally form a complete non-interference API calling sequence, and the calculation formula is as follows:
tidi=random(c1,c2)
indexij=j
idi=IDk
YYi=[idi,Yi,tidi,indexi]
in the formula, tidiRepresents YiThread number, index of each interference-free API in the APIiRepresenting non-interfering API database YiIndex number, index of the middle APIijRepresents YiThe index number of the jth interference-free API; random (c)1,c2) Is expressed in a positive integer c1、c2Randomly selecting a positive integer; idiRepresents YiCalling the filename, ID of each non-interfering APIkRepresenting the file name of the malicious API sequence sample with the number k; YYiRepresenting a complete non-interfering API call sequence.
Further, the step (5) is specifically as follows:
(5.1) randomly inserting a complete non-interference API calling sequence into the malicious API sequence sample to generate a malicious-resisting API sequence sample, wherein the formula is as follows:
Figure BDA0002818353990000051
in the formula, MalkIndicating a malicious API sample, AdMal, numbered kkRepresenting generated anti-malware API sequence samples,YYiRepresenting a complete non-interfering API call sequence;
(5.2) extracting the characteristics of the anti-malicious API sequence sample and the benign API sequence sample, wherein each sample corresponds to a 1 xq-dimensional characteristic vector, inputting the characteristic vectors into a trained black box detector for classification, and obtaining a classification label of each sample, wherein the formula is as follows:
AdMal′k=Feature_selection(AdMalk)
Benign′k=Feature_selection(Benignk)
black_box(AdMal′k)=label_AdMalk
black_box(Benign′k)=label_Benignk
in the formula, Feature _ selection (. cndot.) is a Feature extraction algorithm, BenignkBenign API sequence samples numbered k; AdMal'kExtracting AdMal by using a feature extraction algorithmkThe 1 xq dimension feature vector is output; benign'kExtracting Benign by using a feature extraction algorithmkThe 1 xq dimension feature vector is output; black _ box (. cndot.) is a black box detector, label _ AdMalkAdMal to combat malicious sampleskLabel of (1), label _ BenignkBenign specimen BenignkThe label of (1); when the label is 1, the black box detector judges that the input sample is a malicious sample, and when the label is 0, the black box detector judges that the input sample is a benign sample.
Further, the step (6) is specifically as follows:
(6.1) feature vector AdMal 'to confront malicious sample'kAnd Benign sample feature vector Benign'kAnd the corresponding label is input into a discriminator for learning, and the discriminator and the generator respectively calculate two loss functions, wherein the formula is as follows:
maxLD=log[D(Benign′k)]+log[label_AdMalk-D(AdMal′k)]
minLG=-log[label_AdMalk-D(AdMal′k)]
in the formula, LDTo representThe loss function value of the discriminator; d (-) represents a discriminator function; l isGRepresenting a loss function value of the generator;
(6.2) the generator and the discriminator continuously update self parameters according to the returned gradient information, the discriminator correctly discriminates the anti-malicious samples and the benign samples as much as possible, the generator selects a proper non-interference API as much as possible so as to generate the anti-malicious samples which can bypass the black box detector, the anti-malicious samples and the anti-malicious samples play with each other, Nash balance is finally achieved, and anti-dynamic malicious API sequence samples which can bypass the black box detector are generated.
When it is satisfied with
Figure BDA0002818353990000061
When the training is finished, stopping training; wherein AdsumNThe number of samples generated for the Nth epoch, Adsum, that can bypass the black box detectorN-1The number of samples generated for the N-1 epoch that can bypass the black box detector; sum is the number of malicious samples in the training set; u is the minimum percentage of the number of anti-malicious samples that can bypass the black box detector added in the Nth round to the total number of malicious samples in the training set.
The invention has the beneficial effects that:
(1) the invention establishes a multi-generator AugGAN network, which comprises three generators, wherein random noise is used as the input of each generator, the outputs of the three generators are respectively mapped to obtain a series of numbers, and an interference-free API is selected from a corresponding interference-free API database according to the numbers; in the model provided by the invention, if a single generator is adopted, the model can fall into a local optimal state in the process of training the model, so that the problems of low efficiency of the model, small quantity of generated anti-malicious samples and the like can be caused. Therefore, the invention adopts the process of selecting the interference-free API by the cooperative optimization of the three generators, thus avoiding the whole model from falling into the local optimum and further achieving the aim of global optimization.
(2) The method of the invention can be used for generating the API sequence resisting the dynamic malicious code in the field of information security. The anti-malicious samples are generated by inserting the interference-free API which does not influence the function of the malicious file, so that the performability and the maliciousness of the anti-malicious samples are ensured, the anti-malicious samples are deceptive, the problem of small number of the malicious samples is solved, the number of the malicious samples is greatly expanded, and the anti-confusion and anti-disturbance capability of the machine learning model is improved.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a variation of the loss function of the generator;
FIG. 3 is a variation of the arbiter loss function;
fig. 4 is the number of anti-malicious samples generated at each epoch.
Detailed Description
The objects and effects of the present invention will become more apparent from the following detailed description of the present invention with reference to the accompanying drawings.
Referring to fig. 1, the method for generating an anti-dynamic malicious API sequence based on multi-generator AugGAN provided by the present invention mainly includes the following steps:
step 1: acquiring malicious API sequence samples and benign API sequence samples, and numbering according to the sequence;
step 2: establishing three non-interference API databases and a multi-generator AugGAN network, wherein each API database stores a non-interference API with a number; the multi-generator AugGAN network comprises three parallel generators and a discriminator, wherein each generator corresponds to an interference-free API database;
and step 3: taking random noise as the input of each generator to obtain three random vectors; mapping the three random vectors respectively to obtain a series of numbers, and selecting an interference-free API from a corresponding interference-free API database according to the numbers;
and 4, step 4: inserting an index number, a thread number and a file name into the selected interference-free API to form a complete interference-free API calling sequence;
and 5: inserting the complete interference-free API call sequence into the malicious API sequence sample to generate an anti-malicious API sequence sample; extracting feature vectors of the anti-malicious API sequence samples and the benign API sequence samples, and classifying by using a black box detector to obtain classification labels, wherein the classification labels comprise benign and malignant;
step 6: training by taking the characteristic vectors of the tagged anti-malicious API sequence sample and the benign API sequence sample as the input of the discriminator, returning gradient information, updating the weight parameters of the generator and the discriminator, and stopping training after reaching a Nash equilibrium state;
and 7: and (3) repeating the steps (3) to (4) by using a trained generator to generate a complete non-interference API calling sequence, inserting the complete non-interference API calling sequence into a malicious API sequence sample, classifying the inserted API sequence sample by using a black box detector, and taking the API sequence sample classified as benign as a finally generated dynamic malicious API sequence.
In one embodiment of the present invention, the various steps of the present invention are described in more detail.
And collecting a large number of malicious file samples and benign samples, putting the malicious file samples and the benign samples into a sandbox for operation, extracting an operation log, and analyzing the operation log to obtain an API call information table. In this embodiment, the number of malicious files collected here is 11000, and the number of benign files is 11000. The samples are divided into a training set and a testing set, wherein the training set comprises 9500 malicious samples and 9500 benign samples. The test set contains 9500 malicious samples and 9500 benign samples. The API call information is represented, for example, in the following table:
TABLE 1
Figure BDA0002818353990000081
Figure BDA0002818353990000091
The fields have the following meanings:
TABLE 2
Field(s) Type (B) Explanation of the invention
file_id Int File numbering
api string API name for file call
tid Int Thread numbering for calling API
index string Sequential numbering of API calls in threads
In this embodiment, a total of 42 non-interfering APIs are selected, as shown in the following table:
TABLE 3
GetSystemDirectoryA GetDiskFreeSpaceW
GetSystemWindowsDirectoryW GetVolumeNameForVolumeMountPointW
GetComputerNameA GetUserNameExA
GetSystemWindowsDirectoryA GetVolumePathNamesForVolumeNameW
GetAsyncKeyState GetInterfaceInfo
GetUserName GetShortPathNameW
GetForegroundWindow GetKeyboardState
GetFileSize GetFileVersionInfoExW
GetTimeZoneInformatio GetAdaptersAddresse
GetSystemMetrics GetUserNameW
GetTempPathW GetFileVersionInfoSizeW
GetVolumePathNameW GetSystemDirectoryW
GetFileInformationByHandle GetSystemTimeAsFileTime
GetFileInformationByHandleEx GetFileSizeEx
GetDiskFreeSpaceExW GetFileVersionInfoSizeExW
GetAdaptersInfo GetKeyState
GetSystemInfo GetFileAttributesW
GetAddrInfoW GetFileVersionInfoW
GetFileType GetCursorPos
GetFileAttributesExW GetBestInterfaceE
GetNativeSystemInfo GetComputerNameW
Randomly dividing 42 interference-free APIs into 3 groups, which are respectively original _ DiEach group (i 1-3) contains 14 non-interfering APIs. And the number of the interference-free APIs in each group is increased to 70 by adopting a moving window method. The formula is as follows:
Figure BDA0002818353990000101
sernumberij=j
constraint conditions are as follows:
Figure BDA0002818353990000102
in the formula, movwindoww(original_DiPer) indicates that the w-th time is in original _ D by moving window methodiWhere the slice is selected per represents the percentage of the slice size to the original window size. m represents extended DiE represents expanding the e-times interference-free API database origin _ Di;sernumberijJ denotes the interference-free database DiThe jth non-interfering API in (a) is numbered j. Here, n is 42, per is 1, and m is 70. Finally obtaining three non-interference API databases D1、D2、D3. Each non-interfering API has a corresponding number. Non-interfering database D1The following table is shown for example:
TABLE 4
Figure BDA0002818353990000103
Figure BDA0002818353990000111
3 random noises of 1 × p dimension are respectively input into three corresponding generators, and 3 data of 1 × t dimension in [0,1] interval are respectively output. The formula is as follows:
zi=random(p)
yi=Sigmoid(Gi(zi))
Figure BDA0002818353990000112
in the formula, ziP is a constant, and random (p) represents the generation of p random numbers for the noise input to generator i. The last layer of generator i is the Sigmoid function, GiFor all layers before the last layer of generator iAnd (4) function expression. y isiTo generate the final output 1 × t-dimensional vector of the generator i. Sigmoid (. cndot.) is an activation function, so that yiIs always at [0,1]]Within the interval. Where p is 100 and t is 1000.
The three generators all use the same feedforward neural network structure, as shown in the following table:
TABLE 5
Input device Activating a function Output of
L1 100 256
L2 Tanh
L3 256 512
L4 Tanh
L5 512 1024
L6 Tanh
L7 1024 1000
L8 Sigmoid
The data of 1 × t dimension output by the generator is processed by data processing, and is mapped to a [0, m ] interval and rounded, and the formula is as follows:
y′i=[m×yi]
in the formula [ ·]Denotes a rounding off integer, y'iIs a vector of dimension 1 × t, in which each value is [0, m]The integer between the numbers corresponds to the non-interference API in the non-interference API database i one by one, and represents the serial number of the non-interference API to be selected. Here, t is 1000 and m is 70.
From y'iIn the corresponding non-interfering API database DiAnd selecting interference-free API, wherein the algorithm is as follows:
Figure BDA0002818353990000121
in the formula (I), the compound is shown in the specification,
Figure BDA0002818353990000122
presentation in a non-interfering API database DiLi select, YiPresentation in a non-interfering API database DiT non-interfering APIs are chosen, where t is 1000.
For YiAssigning the thread number and the index number of each interference-free API to finally form a complete interference-free API sequence, wherein the formula is as follows:
tidi=random(c1,c2)
indexij=j
idi=IDk
YYi=[idi,Yi,tidi,indexi]
in the formula, tidiRepresents YiThread number, index of each interference-free API in the APIiRepresenting non-interfering API database YiIndex number, index of the middle APIijRepresents YiThe index number of the jth interference-free API; random (c)1,c2) Is expressed in a positive integer c1、c2Randomly selecting a positive integer; idiRepresents YiCalling the filename, ID of each non-interfering APIkRepresenting the file name of the malicious API sequence sample with the number k; YYiRepresenting a complete non-interfering API call sequence including a file name, non-interfering API, thread number, index number. Here, c1=20,c2=4000。
YYiSome are shown in the following table:
TABLE 6
file_id api tid index
09162d115caa6368ca1cf2e9ef85e1af GetAsyncKeyState 2805 1
09162d115caa6368ca1cf2e9ef85e1af GetForegroundWindow 2805 2
09162d115caa6368ca1cf2e9ef85e1af GetForegroundWindow 2805 3
09162d115caa6368ca1cf2e9ef85e1af GetUserName 2805 4
09162d115caa6368ca1cf2e9ef85e1af GetUserName 2805 5
09162d115caa6368ca1cf2e9ef85e1af GetUserName 2805 6
09162d115caa6368ca1cf2e9ef85e1af GetForegroundWindow 2805 7
09162d115caa6368ca1cf2e9ef85e1af GetUserName 2805 8
09162d115caa6368ca1cf2e9ef85e1af GetUserName 2805 9
09162d115caa6368ca1cf2e9ef85e1af GetAsyncKeyState 2805 10
09162d115caa6368ca1cf2e9ef85e1af GetUserName 2805 11
09162d115caa6368ca1cf2e9ef85e1af GetUserName 2805 12
09162d115caa6368ca1cf2e9ef85e1af GetUserName 2805 13
09162d115caa6368ca1cf2e9ef85e1af GetForegroundWindow 2805 14
09162d115caa6368ca1cf2e9ef85e1af GetAsyncKeyState 2805 15
09162d115caa6368ca1cf2e9ef85e1af GetForegroundWindow 2805 16
09162d115caa6368ca1cf2e9ef85e1af GetUserName 2805 17
09162d115caa6368ca1cf2e9ef85e1af GetUserName 2805 18
09162d115caa6368ca1cf2e9ef85e1af GetUserName 2805 19
09162d115caa6368ca1cf2e9ef85e1af GetForegroundWindow 2805 20
09162d115caa6368ca1cf2e9ef85e1af GetUserName 2805 21
09162d115caa6368ca1cf2e9ef85e1af GetAsyncKeyState 2805 22
09162d115caa6368ca1cf2e9ef85e1af GetFileSize 2805 23
09162d115caa6368ca1cf2e9ef85e1af GetFileSize 2805 24
09162d115caa6368ca1cf2e9ef85e1af GetUserName 2805 25
09162d115caa6368ca1cf2e9ef85e1af GetAsyncKeyState 2805 26
09162d115caa6368ca1cf2e9ef85e1af GetFileSize 2805 27
09162d115caa6368ca1cf2e9ef85e1af GetForegroundWindow 2805 28
09162d115caa6368ca1cf2e9ef85e1af GetUserName 2805 29
09162d115caa6368ca1cf2e9ef85e1af GetAsyncKeyState 2805 30
Randomly inserting the selected complete non-interference API sequence sample into a malicious sample, wherein the algorithm is as follows:
Figure BDA0002818353990000141
in the formula, MalkIndicating a malicious API sample, AdMal, numbered kkRepresenting a generated antagonistic dynamic malicious API sequence, YYiRepresenting from non-interfering API database DiThe complete non-interfering API call sequence is selected.
Extracting the characteristics of the API calling sequence by using a characteristic extraction algorithm, outputting a 1 xq-dimensional vector, inputting the vector into a trained black box detector model, and classifying the sample by using the black box detector model, wherein the formula is as follows:
AdMal′k=Feature_selection(AdMalk)
Benign′k=Feature_selection(Benignk)
black_box(AdMal′k)=label_AdMalk
black_box(Benign′k)=label_Benignk
in the formula, Feature _ selection (. cndot.) is a Feature extraction algorithm, BenignkBenign samples numbered k. AdMal'kExtracting AdMal by using a feature extraction algorithmkThe 1 xq dimension feature vector is output later; benign'kExtracting Benign by using a feature extraction algorithmkAnd outputting the 1 xq feature vector. Black _ box (. cndot.) is a black box detector, label _ AdMalkAdMal to combat malicious sampleskLabel of (1), label _ BenignkBenign specimen BenignkThe label of (1). When the label is 1, the black box detector judges that the input sample is a malicious sample, and when the label is 0, the black box detector judges that the input sample is a benign sample. The black box detector adopts a random forest classification model, q is 2171, and the trained black box detector model is obtained by utilizing a malicious API sequence sample and a benign API sequence sample to train in advance.
Feature vector AdMal 'of anti-malicious sample'kAnd Benign sample feature vector Benign'kAnd the corresponding label is input into a discriminator for learning, the discriminator and the generator respectively calculate two loss functions and update the parameters thereof, and the algorithm is as follows:
maxLD=log[D(Benign′k)]+log[label_AdMalk-D(AdMal′k)]
minLG=-log[label_AdMalk-D(AdMal′k)]
in the formula, LDThe loss function value of the discriminator needs to be as large as possible; d (-) represents a discriminator function; l isGThe loss function value, which represents the generator, needs to be as small as possible.
The discriminator adopts a feedforward neural network structure, and the parameters are as follows
TABLE 7
Input device Activating a function Output of
L1 2171 1000
L2 Tanh
L3 1000 500
L4 Tanh
L5 500 100
L6 Tanh
L7 100 1
L8 Sigmoid
The generator and the discriminator continuously update self parameters according to the returned gradient information, the discriminator correctly discriminates the anti-malicious samples and the benign samples as much as possible, the generator selects a proper non-interference API as much as possible so as to generate the anti-malicious samples which can bypass the black box detector, the generator and the discriminator game mutually, so that nash balance is finally achieved, and the anti-dynamic malicious API sequence samples which can bypass the black box detector are generated.
The experimental environment used is shown in the following table:
TABLE 8
Version(s)
System for controlling a power supply Ubuntu 19.04
GPU Nvidia Titan Xp
CPU Intel(R)Core(TM)i9-9820X CPU@3.30GHz
Deep learning framework pytorch1.5
Memory device 64G
In the present embodiment, the learning rates of the three generators are all set to 0.003, the discriminator learning rate is 0.01, and the batch _ size is 16.
When it is satisfied with
Figure BDA0002818353990000151
When the training is finished, stopping training; wherein AdsumNThe number of samples generated for the Nth epoch, Adsum, that can bypass the black box detectorN-1The number of samples generated for the N-1 epoch that can bypass the black box detector; sum is the number of malicious samples in the training set; u is the minimum percentage of the number of anti-malicious samples that can bypass the black box detector added in the Nth round to the total number of malicious samples in the training set. Here, u is 0.01 and sum is 9500.
In this embodiment, when N is 15, the rule is satisfied and the training is stopped, and 2013 anti-malicious samples capable of bypassing the black box detector are generated in total.
The variation of the value of the loss function of the generator is shown in fig. 2.
The change of the discriminator loss function is shown in FIG. 3
The number of samples generated at each epoch that are resistant to malicious attacks that can bypass the black box detector is shown in fig. 4.
The 2013 anti-malicious samples are divided into a training set and a testing set, wherein the training set accounts for 1610 in 80%, and the testing set accounts for 403 in 20%.
Inputting 403 anti-malicious samples in the anti-malicious sample test set and 1500 malicious samples in the original sample test set into a random forest detection model, and obtaining the following results:
TABLE 9
Fighting against malicious samples Malicious sample
Rate of accuracy of detection 0 0.9
Wherein:
Figure BDA0002818353990000161
retraining the random forest model by utilizing the training set of the anti-malicious samples, and inputting 1500 malicious samples in the test set of the anti-malicious samples and the original sample test set into the retrained random forest model to obtain the following detection results:
watch 10
Fighting against malicious samples Malicious sample
Rate of accuracy of detection 0.98 0.96
The result shows that the detection accuracy of the retrained random forest model on the anti-malicious sample is improved to 0.98, and the detection accuracy of the real malicious sample is improved to 0.96, which shows that the detection capability of the retrained random forest classification model is obviously improved, and the anti-disturbance and anti-confusion capabilities are greatly enhanced.
Counting the types and the number of interference-free APIs inserted into the anti-malicious sample which can bypass the black box detector;
each non-interfering API corresponds to a segment of C + + source code, as exemplified by the following table:
TABLE 11
Figure BDA0002818353990000162
Therefore, the source code corresponding to the inserted interference-free API can be directly written in the source code of the relevant malicious file, and the executable file can be generated. Experiments show that the executable file has maliciousness, performability and deceptiveness.
This also demonstrates that the anti-malicious samples generated by the method of the present invention can have the ability to fool the black box detector while ensuring the maliciousness and performability.
The foregoing lists merely illustrate specific embodiments of the invention. It is obvious that the invention is not limited to the above embodiments, but that many variations are possible. All modifications which can be derived or suggested by a person skilled in the art from the disclosure of the present invention are to be considered within the scope of the invention.

Claims (6)

1. A multi-generator AugGAN-based method for generating an API sequence resisting dynamic malicious activities is characterized by comprising the following steps:
(1) acquiring malicious API sequence samples and benign API sequence samples, and numbering according to the sequence;
(2) establishing three non-interference API databases and a multi-generator AugGAN network, wherein each API database stores a non-interference API with a number; the multi-generator AugGAN network comprises three parallel generators and a discriminator, wherein each generator corresponds to an interference-free API database;
(3) taking random noise as the input of each generator to obtain three random vectors; mapping the three random vectors respectively to obtain a series of numbers, and selecting an interference-free API from a corresponding interference-free API database according to the numbers;
(4) inserting an index number, a thread number and a file name into the selected interference-free API to form a complete interference-free API calling sequence;
(5) inserting the complete interference-free API call sequence into the malicious API sequence sample to generate an anti-dynamic malicious API sequence sample; extracting feature vectors of the anti-malicious API sequence samples and the benign API sequence samples, and classifying by using a black box detector to obtain classification labels, wherein the classification labels comprise benign and malignant;
(6) training by taking the characteristic vectors of the tagged anti-dynamic malicious API sequence sample and the benign API sequence sample as the input of the discriminator, returning gradient information, updating the weight parameters of the generator and the discriminator, and stopping training after reaching a Nash equilibrium state;
(7) and (3) repeating the steps (3) to (4) by using a trained generator to generate a complete non-interference API calling sequence, inserting the complete non-interference API calling sequence into a malicious API sequence sample, classifying the inserted API sequence sample by using a black box detector, and taking the API sequence sample classified as benign as a finally generated dynamic malicious API sequence.
2. The method for multi-producer AugGAN-based dynamic malicious API sequence generation according to claim 1, wherein the method for creating three non-interfering API databases in step (2) comprises:
(1.1) selecting n interference-free APIs, and averagely dividing the n interference-free APIs into three groups which are respectively expressed as original _ DiI is 1-3;
(1.2) adopting a moving window method to respectively expand the number of each group of non-interference API by integral multiple to form three non-interference API databases which are respectively expressed as D1、D2、D3And to DiEach non-interfering API in (a) renumbers; the calculation formula is as follows:
Figure FDA0003183183560000021
sernumberij=j
Figure FDA0003183183560000022
in the formula, movwindoww(original_DiPer) indicates that the w-th time is in original _ D by moving window methodiSelecting a slice, per representing the percentage of the slice size in the original window size, and m representing the expanded DiE represents expanding the e-times interference-free API database origin _ Di;sernumberijRepresenting a non-interfering database DiThe jth non-interfering API in (a) is numbered j.
3. The method for generating an anti-dynamic malicious API sequence based on multi-generator AugGAN according to claim 1, wherein the step (3) is specifically:
(3.1) three generators respectively input three noises and output three vectors with dimensions of 1 × t, and the calculation method of the three vectors with dimensions of 1 × t is as follows:
zi=random(p)
yi=Sigmoid(Gi(zi))
Figure FDA0003183183560000023
in the formula, ziTo input the noise of the ith generator, i is 1,2, 3; p is constant, random (p) represents the generation of p random numbers; the last layer of the generator i is the Sigmoid function, Gi(zi) To generate the output result of the second layer of the generator i; y isiVector of dimension 1 x t, y, output for the ith generatoriIs always at [0,1]]Within the interval;
(3.2) mapping the vector with the dimension of 1 × t output by the generator to the interval of [0, m ], taking an integer, and calculating the formula as follows:
y′i=[m×yi]
in the formula [ ·]Denotes a rounding off integer, y'iThe vector is a 1 x t-dimensional vector obtained after the mapping and the rounding;
(3.3) utilize y'iIn the corresponding non-interfering API database DiAnd picking a non-interference API, and expressing:
Figure FDA0003183183560000032
in the formula (I), the compound is shown in the specification,
Figure FDA0003183183560000033
presentation in a non-interfering API database DiLichout interference-free API, YiPresentation in a non-interfering API database DiThe resulting set of t non-interfering APIs is picked.
4. The method for generating an anti-dynamic malicious API sequence based on multi-generator AugGAN according to claim 1, wherein the step (4) is specifically as follows:
(4.1) for YiThe file name, thread number and index number of each non-interfering API in the set are assigned,finally, a complete non-interference API calling sequence is formed, and the calculation formula is as follows:
tidi=random(c1,c2)
indexij=j
idi=IDk
YYi=[idi,Yi,tidi,indexi]
in the formula, YiPresentation in a non-interfering API database DiSelect the set of t non-interfering APIs, tidiRepresents YiThread number, index of each interference-free API in the APIiRepresenting non-interfering API database YiIndex number, index of the middle APIijRepresents YiThe index number of the jth interference-free API; random (c)1,c2) Is expressed in a positive integer c1、c2Randomly selecting a positive integer; idiRepresents YiCalling the filename, ID of each non-interfering APIkRepresenting the file name of the malicious API sequence sample with the number k; YYiRepresenting a complete non-interfering API call sequence.
5. The method for generating an anti-dynamic malicious API sequence based on multi-generator AugGAN according to claim 1, wherein the step (5) is specifically as follows:
(5.1) randomly inserting a complete non-interference API calling sequence into the malicious API sequence sample to generate a malicious-resisting API sequence sample, wherein the formula is as follows:
Figure FDA0003183183560000031
in the formula, MalkIndicating a malicious API sample, AdMal, numbered kkRepresenting a generated anti-malicious API sequence sample, YYiRepresenting a complete non-interfering API call sequence;
(5.2) extracting the characteristics of the anti-malicious API sequence sample and the benign API sequence sample, wherein each sample corresponds to a 1 xq-dimensional characteristic vector, inputting the characteristic vectors into a trained black box detector for classification, and obtaining a classification label of each sample, wherein the formula is as follows:
AdMal′k=Feature_selection(AdMalk)
Benign′k=Feature_selection(Benignk)
black_box(AdMal′k)=label_AdMalk
black_box(Benign′k)=label_Benignk
in the formula, Feature _ selection (. cndot.) is a Feature extraction algorithm, BenignkBenign API sequence samples numbered k; AdMal'kExtracting AdMal by using a feature extraction algorithmkThe 1 xq dimension feature vector is output; benign'kExtracting Benign by using a feature extraction algorithmkThe 1 xq dimension feature vector is output; black _ box (. cndot.) is a black box detector, label _ AdMalkAdMal to combat malicious sampleskLabel of (1), label _ BenignkBenign specimen BenignkThe label of (1); when the label is 1, the black box detector judges that the input sample is a malicious sample, and when the label is 0, the black box detector judges that the input sample is a benign sample.
6. The method for generating an anti-dynamic malicious API sequence based on multi-generator AugGAN according to claim 1, wherein the step (6) is specifically:
(6.1) feature vector AdMal 'to confront malicious sample'kAnd Benign sample feature vector Benign'kAnd the corresponding label is input into a discriminator for learning, and the discriminator and the generator respectively calculate two loss functions, wherein the formula is as follows:
max LD=log[D(Benign′k)]+log[label_AdMalk-D(AdMal′k)]
min LG=-log[label_AdMalk-D(AdMal′k)]
in the formula, LDA loss function value representing the discriminator; d (-) represents a discriminator function; l isGLos representing generatorss function value;
(6.2) the generator and the discriminator continuously update self parameters according to the returned gradient information, so that Nash equilibrium is finally achieved, and an anti-dynamic malicious API sequence sample which can bypass the black box detector is generated; when it is satisfied with
Figure FDA0003183183560000041
When the training is finished, stopping training; wherein AdsumNNumber of samples, Adsum, generated for the Nth training round that can bypass the black box detectorN-1The number of samples that can bypass the black box detector generated for the (N-1) th training round; sum is the number of malicious samples in the training set; u is the minimum percentage of the number of anti-malicious samples that can bypass the black box detector added in the Nth round to the total number of malicious samples in the training set.
CN202011411208.4A 2020-12-04 2020-12-04 Multi-generator AugGAN-based dynamic malicious API sequence generation method Active CN112565261B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011411208.4A CN112565261B (en) 2020-12-04 2020-12-04 Multi-generator AugGAN-based dynamic malicious API sequence generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011411208.4A CN112565261B (en) 2020-12-04 2020-12-04 Multi-generator AugGAN-based dynamic malicious API sequence generation method

Publications (2)

Publication Number Publication Date
CN112565261A CN112565261A (en) 2021-03-26
CN112565261B true CN112565261B (en) 2021-11-23

Family

ID=75048715

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011411208.4A Active CN112565261B (en) 2020-12-04 2020-12-04 Multi-generator AugGAN-based dynamic malicious API sequence generation method

Country Status (1)

Country Link
CN (1) CN112565261B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113221109B (en) * 2021-03-30 2022-06-28 浙江工业大学 Intelligent malicious file analysis method based on generation countermeasure network
CN113158190B (en) * 2021-04-30 2022-03-29 河北师范大学 Malicious code countermeasure sample automatic generation method based on generation type countermeasure network
CN115249048B (en) * 2022-09-16 2023-01-10 西南民族大学 Confrontation sample generation method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111447212A (en) * 2020-03-24 2020-07-24 哈尔滨工程大学 Method for generating and detecting APT (advanced persistent threat) attack sequence based on GAN (generic antigen network)
CN111881446A (en) * 2020-06-19 2020-11-03 中国科学院信息工程研究所 Method and device for identifying malicious codes of industrial internet

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190028880A (en) * 2017-09-11 2019-03-20 숭실대학교산학협력단 Method and appratus for generating machine learning data for botnet detection system
WO2019237240A1 (en) * 2018-06-12 2019-12-19 中国科学院深圳先进技术研究院 Enhanced generative adversarial network and target sample identification method
CN110598843B (en) * 2019-07-23 2023-12-22 中国人民解放军63880部队 Training method for generating countermeasure network organization structure based on discriminant sharing
CN110826059B (en) * 2019-09-19 2021-10-15 浙江工业大学 Method and device for defending black box attack facing malicious software image format detection model
CN111259393B (en) * 2020-01-14 2023-05-23 河南信息安全研究院有限公司 Malicious software detector concept drift resistance method based on generation countermeasure network
CN111832019B (en) * 2020-06-10 2024-02-23 国家计算机网络与信息安全管理中心 Malicious code detection method based on generation countermeasure network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111447212A (en) * 2020-03-24 2020-07-24 哈尔滨工程大学 Method for generating and detecting APT (advanced persistent threat) attack sequence based on GAN (generic antigen network)
CN111881446A (en) * 2020-06-19 2020-11-03 中国科学院信息工程研究所 Method and device for identifying malicious codes of industrial internet

Also Published As

Publication number Publication date
CN112565261A (en) 2021-03-26

Similar Documents

Publication Publication Date Title
CN112565261B (en) Multi-generator AugGAN-based dynamic malicious API sequence generation method
CN113554089B (en) Image classification countermeasure sample defense method and system and data processing terminal
Wu et al. Backdoorbench: A comprehensive benchmark of backdoor learning
CN111914256B (en) Defense method for machine learning training data under toxic attack
Lin et al. Character-level intrusion detection based on convolutional neural networks
CN112491796B (en) Intrusion detection and semantic decision tree quantitative interpretation method based on convolutional neural network
CN110826059A (en) Method and device for defending black box attack facing malicious software image format detection model
Sperl et al. DLA: dense-layer-analysis for adversarial example detection
CN108718310A (en) Multi-level attack signatures generation based on deep learning and malicious act recognition methods
CN108717680A (en) Spatial domain picture steganalysis method based on complete dense connection network
CN112019651B (en) DGA domain name detection method using depth residual error network and character-level sliding window
Xia et al. Enhancing backdoor attacks with multi-level mmd regularization
CN110363003A (en) A kind of Android virus static detection method based on deep learning
Mohamed et al. Face liveness detection using a sequential cnn technique
Li et al. Detecting localized adversarial examples: A generic approach using critical region analysis
Naveen et al. Deep learning for threat actor attribution from threat reports
Sathya Ensemble Machine Learning Techniques for Attack Prediction in NIDS Environment
Victoriano Exposing android ransomware using machine learning
CN115937994A (en) Data detection method based on deep learning detection model
Liyanage et al. Clustered Approach for Clone Detection in social media
CN110458209A (en) A kind of escape attack method and device for integrated Tree Classifier
CN115883261A (en) ATT and CK-based APT attack modeling method for power system
Obaidat et al. Artificial intelligence bias minimization via random sampling technique of adversary data
Gupta et al. Adversarial input detection using image processing techniques (ipt)
Raj et al. Detection of Botnet Using Deep Learning Architecture Using Chrome 23 Pattern with IOT

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant