CN116340944B - Malicious code classification method and system based on RGB image and lightweight model - Google Patents

Malicious code classification method and system based on RGB image and lightweight model Download PDF

Info

Publication number
CN116340944B
CN116340944B CN202310608993.XA CN202310608993A CN116340944B CN 116340944 B CN116340944 B CN 116340944B CN 202310608993 A CN202310608993 A CN 202310608993A CN 116340944 B CN116340944 B CN 116340944B
Authority
CN
China
Prior art keywords
operation code
image
file
markov
byte
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310608993.XA
Other languages
Chinese (zh)
Other versions
CN116340944A (en
Inventor
赵大伟
孙晨宇
杨淑棉
徐丽娟
李鑫
张雨鑫
徐庆玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qilu University of Technology
Shandong Computer Science Center National Super Computing Center in Jinan
Original Assignee
Qilu University of Technology
Shandong Computer Science Center National Super Computing Center in Jinan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qilu University of Technology, Shandong Computer Science Center National Super Computing Center in Jinan filed Critical Qilu University of Technology
Priority to CN202310608993.XA priority Critical patent/CN116340944B/en
Publication of CN116340944A publication Critical patent/CN116340944A/en
Application granted granted Critical
Publication of CN116340944B publication Critical patent/CN116340944B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/561Virus type analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Virology (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of malicious code classification, and provides a method and a system for classifying malicious codes based on RGB images and a lightweight model, wherein the method comprises the following steps: decompiling an original malicious code file to generate an asm file and a bytes file; extracting an operation code sequence in an asm file and a byte sequence in a bytes file, and fusing a gray level image and a Markov image generated based on the operation code sequence and the Markov image generated based on the byte sequence to obtain a fused RGB image; and inputting the model into a trained lightweight model for classification. The invention extracts the operation code sequence and the byte sequence respectively to obtain a gray level image based on the frequency of the operation code, a Markov image based on the operation code sequence and a Markov image based on the byte sequence; the operation code sequence is visualized as a Markov image, the integrity of the extracted features is ensured to the maximum extent, and the generalization capability of the model is improved.

Description

Malicious code classification method and system based on RGB image and lightweight model
Technical Field
The invention belongs to the technical field of malicious code classification, and particularly relates to a method and a system for classifying malicious codes based on RGB images and a lightweight model.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
The traditional malicious code static detection technology is simple and rapid, does not occupy large resources, but cannot detect unknown families, and is easy to be interfered by modes such as polymorphism, deformation and the like, so that missing report and false report are caused. The traditional static detection method of malicious code has failed under the condition that the faster the number of varieties increases. In recent years, the visualization technology is widely applied to the field of malicious code detection, and compared with the traditional static detection method, the visualization method can completely retain the characteristic information of malicious samples, can more intuitively observe different points of each malicious sample, and solves the influence caused by the confusion technology. In addition, the existing research usually adopts graying treatment on malicious binary files, features are extracted from gray images and combined with a neural network to train and obtain experimental results, but the gray images are single-channel data images, and the malicious code information features contained in the gray images are less and single, so that the gray images are not more visual and better in effect than three-channel data images in the neural network.
The deep learning has a network deep enough to accommodate more abundant semantic information, saves a large amount of manpower and material resources to analyze the characteristics, and the effect is improved continuously along with the continuous increase of the data quantity. But also has the defects of large calculation amount, high cost and the like, can not fully utilize the rich information contained in the malicious samples, has low deep learning network efficiency, and can not meet the requirements of instantaneity and accuracy of code classification.
Disclosure of Invention
Aiming at the problems that the existing known detection method cannot fully utilize rich information contained in a malicious sample and the deep learning network is low in efficiency, the invention discloses a malicious code classification method and a malicious code classification system based on an RGB image and a lightweight model, which can fully extract characteristic information in the malicious sample and greatly reduce the quantity of parameters and calculation under the condition of ensuring accuracy.
To achieve the above object, one or more embodiments of the present invention provide the following technical solutions:
the first aspect of the invention provides a malicious code classification method based on an RGB image and a lightweight model, which comprises the following steps:
decompiling an original malicious code file to generate an asm file and a bytes file;
extracting an operation code sequence in the asm file, and respectively converting the extracted operation code sequence into a gray level image and a Markov image;
extracting byte sequences in the bytes file, and visualizing the byte sequences into Markov images;
carrying out feature fusion on a gray level image and a Markov image generated based on an operation code sequence and a Markov image generated based on a byte sequence to obtain a fused RGB image;
and inputting the fused RGB images into a trained lightweight model for classification.
A second aspect of the present invention provides a malicious code classification system based on RGB images and lightweight models, comprising:
a decompilation module configured to: decompiling an original malicious code file to generate an asm file and a bytes file;
an asm file extraction module configured to: extracting an operation code sequence in the asm file, and respectively converting the extracted operation code sequence into a gray level image and a Markov image;
a bytes file extraction module configured to: extracting byte sequences in the bytes file, and visualizing the byte sequences into Markov images;
a fusion module configured to: carrying out feature fusion on a gray level image and a Markov image generated based on an operation code sequence and a Markov image generated based on a byte sequence to obtain a fused RGB image;
a classification module configured to: and inputting the fused RGB images into a trained lightweight model for classification.
The one or more of the above technical solutions have the following beneficial effects:
(1) The invention extracts the operation code sequence and byte sequence of asm file and byte file to obtain gray scale image based on operation code frequency, markov image based on operation code sequence and Markov image based on byte sequence; the operation code sequence is visualized into a Markov image, so that the integrity of the extracted features is ensured to the maximum extent, and the generalization capability of the model is improved; the three generated images are all 256 multiplied by 256 in fixed size, so that the process of standardizing the images is omitted, three different single-channel images can be directly filled into three R, G, B channels respectively, the three single-channel images are fused into RGB images of 256 multiplied by 3, the characteristic information in a malicious sample can be fully extracted, and the advantage complementary effect among different characteristics is achieved.
(2) According to the invention, a malicious code detection framework combining a MobileNet V2 and a CBAM attention mechanism is designed, a CBAM module is added after point convolution of a seventh inverse residual error structure on the basis of an original MobileNet V2 model, key information in an RGB image can be effectively found, the weight of the key information is increased, and the problems that the existing neural network is more and more bulky and the calculated amount is obviously increased are solved.
Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.
Fig. 1 is a flowchart of RGB image acquisition according to a first embodiment.
Fig. 2 is a structural diagram of a lightweight model of the first embodiment.
Fig. 3 is a schematic diagram of converting an extracted operation code sequence into a two-dimensional transition probability matrix according to the first embodiment.
Fig. 4 is a schematic diagram of converting the extracted byte sequence into a two-dimensional transition probability matrix according to the first embodiment.
Fig. 5 is a graph comparing performance of a lightweight model trained with different images according to the first embodiment.
Fig. 6 (a) and 6 (b) are graphs showing the training set accuracy and loss value variation of the first embodiment, respectively.
Fig. 7 (a) and 7 (b) are graphs showing the test set accuracy and loss value variation of the first embodiment, respectively.
Detailed Description
The general conception of the invention is as follows:
according to the invention, an IDA Pro tool is utilized to convert a PE file into an asm file and a bytes file, then an original operation code sequence in the asm file is extracted, the extracted operation code sequence is respectively converted into two different gray patterns by using two different methods, meanwhile, a byte sequence in the bytes file is extracted, the gray patterns are generated, and finally three different gray patterns are fused to generate an RGB color image. The generated RGB images are then input into a modified MobileNet V2 lightweight model for classification.
Example 1
As shown in fig. 1, the embodiment discloses a malicious code classification method based on an RGB image and a lightweight model, which includes:
step 1, decompiling an original malicious code file to generate an asm file and a bytes file;
step 2, extracting an operation code sequence in an asm file, and respectively converting the extracted operation code sequence into a gray level image and a Markov image;
step 3, extracting byte sequences in the bytes file, and visualizing the byte sequences into Markov images;
step 4, respectively filling a gray level image and a Markov image generated based on the operation code sequence and a Markov image generated based on the byte sequence into R, G, B channels to obtain a fused RGB image;
and 5, inputting the fused RGB images into a trained lightweight model for classification.
In step 1, an IDA tool is utilized to decompil an original malicious code file, an asm file and a bytes file are generated, and then characteristic information in the two files is extracted.
In step 2, considering that a single opcode in an asm file does not have an effect, but an opcode sequence consisting of multiple opcodes may be detrimental, the present invention may utilize an N-gram algorithm to extract an opcode sequence consisting of three consecutive opcodes as a feature. Experiments have shown that a subsequence of three opcodes processed using the N-gram algorithm is optimal, as a single opcode is of little significance.
On the other hand, because the N-gram algorithm lacks long-term dependence, only the first N-1 words can be modeled, and a part of characteristics are lost; in the process of extracting a significant operation code sequence by using an N-gram algorithm, the N is large in actual conditions, which can cause dimension disasters, so that the frequency of counting the operation code sequence formed by three operation codes is finally selected to generate a gray level diagram;
in the first R channel: firstly, dismantling an generated asm file by using a dismantling tool IDA Pro, and extracting an operation code sequence formed by three operation codes by using an N-gram algorithm as a characteristic;
extracting all operation codes of each malicious asm file, and selecting every three continuous operation codes as subsequences, wherein for example, all operation codes of each malicious asm sample are 'pop, push, mov, call', and the generated subsequences are 'pop, push, mov', 'push, mov, call';
then calculating the frequency of the subsequences, selecting the 256 subsequences with highest frequency in each malicious asm sample, which are smaller than the complement 0 of 256, filling 256×256 matrixes, and finally generating a gray image;
in the second G channel: the method extracts all operation code sequences of text fields in malicious asm files, generates a two-dimensional transition probability matrix, and finally visualizes the two-dimensional transition probability matrix into a Markov image. The invention visualizes the operation code sequence into the Markov image, ensures the integrity of the extracted features to the maximum extent, and improves the generalization capability of the model; the asm file is obtained by disassembling an executable file and has a structure consistent with that of the executable file, wherein the text field refers to a code section of the executable file and can be used for storing a memory area of program execution codes.
Specifically, because the operation code sequence with the most occurrence times of the extracted text field is counted, the operation code sequence represents that the effect of the operation codes in the whole is outstanding, the operation code sequence in a single malicious asm file is extracted, and the operation of searching for a difference value is performed, which is equivalent to the operation code sequence with the outstanding effect in a single malicious asm, and the operation code sequence is a characteristic selection process.
As shown in fig. 3, in the second G channel: firstly, extracting all operation code sequences of a text field of all malicious asm files, counting the first 255 operation code sequences with the largest occurrence frequency, putting the 255 operation code sequences in a one-dimensional array, filling the tail of the one-dimensional array with aaa, and naming the array as a; then, extracting an operation code sequence of a text field in a single malicious asm file and storing the operation code sequence in an array b; and then judging whether the types of the operation code sequences in the array a are the same as those in the array b, namely searching the difference value of the two arrays by using the setdif 1d function, if the types of the operation codes of the two arrays are the same, setting the result of the difference value to be null, otherwise outputting unique values in the array a but not in the array b, finally setting the unique values to be 'aaa', and outputting the operation code sequences of 256 types.
Finally, a sliding window with the size of 2 is set, the first operation code is regarded as a row, the second operation code is regarded as a column in two continuous operation codes, 1 is added to the corresponding position, the sliding is completed until the last operation code of a single file is finished, a two-dimensional transition probability matrix is generated, and finally the two-dimensional transition probability matrix is visualized as a Markov image.
As shown in fig. 4, in step 3, the third B channel: for unification with the image of the G channel, a matrix initialized to zero and 256×256 in size is also created, and hexadecimal numbers in malicious byte files are converted into decimal numbers of 0-255;
after conversion, setting a sliding window with the size of 2, and taking the first byte as a row and the second byte as a column in two continuous bytes, adding 1 at a corresponding position, sliding until the last byte is finished, and generating a two-dimensional transfer frequency matrix;
and then calculating the ratio of the frequency of each position on the matrix to the sum of the frequencies of each row, multiplying the ratio by 255 because the floating point type is complex, filling the ratio into a zero matrix after the integer is formed, generating a two-dimensional transition probability matrix, filling the value of each position of the two-dimensional transition probability matrix as the pixel point of the Markov image, and visualizing the Markov image.
In step 4, the invention considers the problem that the single characteristic images have different data sets and different expression conditions, in order to have good expression on all the data sets, three single characteristic images can be respectively filled into three RGB channels to be fused into a color image, and the effective information of the finally generated color image is three times of that of the single image.
The three images generated by the method are all 256 multiplied by 256 in fixed size, so that the process of standardizing the images is omitted, three different single-channel images can be directly filled into R, G, B three channels respectively, the images are fused into 256 multiplied by 3 RGB images, the characteristic information in a malicious sample is fully extracted, and the advantage complementary effect among different characteristics is achieved.
In step 5, the invention adds the CBAM module (CBAM (Convolutional Block Attention Module) is a lightweight attention module) after the point convolution of the seventh inverse residual error structure on the basis of the original MobileNet V2 model, and can effectively find the key features in the classification detection, increase the weight of the key features, effectively reduce the loss of information, improve the performance of the model and increase the interpretability of the neural network model.
The MobileNet V2 model structure comprises an input layer, seven inverse residual structures, two standard convolution layers, a global averaging pooling layer, a full connection layer and an output layer. Each inverse residual structure is first convolved by 1 x 1 with an activation function of Relu6; then, through point convolution, the convolution kernel is 3 multiplied by 3, and the activation function is Relu6; finally, through convolution processing of 1×1, the activation function is a Linear activation function. Table 1 shows the structure of the modified MobileNet V2 model of the present invention
TABLE 1
Wherein t is a spreading factor, and the spreading factor of a convolution kernel in the first layer 1×1 convolution; c is the number of output characteristic channels; n is the number of repetitions of the bottleneck; s represents stride, controlling the feature size.
As shown in table 2, the input image size was 256×256×3, the output image size after passing through one standard convolution layer was 128×128×32, the output image size after passing through the first inverse residual structure was 128×128×16, the output image size after passing through the second inverse residual structure was 64×64×24, the size of the image output after the third inverse residual structure is 32 multiplied by 32, the size of the image output after the fourth inverse residual structure is 16 multiplied by 64, the size of the image output after the fifth inverse residual structure is 16 multiplied by 96, the size of the image output after the sixth inverse residual structure is 8×8×160, the size of the image output after the seventh inverse residual structure is 8×8×320, then the size of the image output is 8×8×1280 through a convolution layer with a convolution kernel size of 1×1, after global flattening, the size of the image output is 1×1280, finally the data after training is output through a convolution layer, namely a full connection layer, wherein 1280 neurons output are fully connected with 1000 neurons in Softmax.
Training process: RGB images of size 256×256×3 were input into the modified MobileNet V2 model for training, and subjected to family classification by connecting Softmax layers after two convolutional layers, seven reverse residual structures, one average pooling layer and one full connection layer. Different batch sizes, loss functions, optimizers, training cycles and learning rates are set according to different malware samples. The modified MobileNet V2 model hyper-parameter settings are shown in figure 2.
TABLE 2
The embodiment is used for verifying on a public data set Big2015 data set and researching which family the unknown malicious file belongs to, and comparing the characteristics of the unknown malicious file extracted with the malicious files of the known families, so as to judge which family the malicious file belongs to.
In order to explore the detection performance of gray images and RGB images in a neural network, gray images based on operation code frequency, markov images based on byte sequences, markov images based on operation code sequences and fused RGB images are input into the improved lightweight model for classification training, whether the gray images and the color images are different in classification detection performance or not is explored, and an experimental result is shown in fig. 5.
As can be seen from the observation of FIG. 5, the effect of a single gray scale is different from that of the model training of the invention, but the overall effect is not as good as that of the fused RGB image, the accuracy can be up to 99.9% and the F1 score (F1-score) value can also be up to 99.5% by training with the RGB image.
In order to explore the effectiveness of the lightweight model proposed in this embodiment, the invention is compared with the traditional classification detection model, the change curves of the training set accuracy and the loss value along with the training wheel number are shown in fig. 6 (a) and fig. 6 (b), and the test set accuracy and the loss value are shown in fig. 7 (a) and fig. 7 (b).
As can be seen from the figure, compared with other traditional classification detection models, the model curve of the invention always tends to be stable, and has the highest accuracy and the lowest loss no matter in a training set or a test set, so that the performance of the lightweight model provided by the invention is higher than that of other models (alexnet, vgg, restnet50 and mobiletv 1), and the convergence speed is relatively high.
Example two
The embodiment discloses a malicious code classification system based on RGB image and lightweight model, includes:
a decompilation module configured to: decompiling an original malicious code file to generate an asm file and a bytes file;
an asm file extraction module configured to: extracting an operation code sequence in the asm file, and respectively converting the extracted operation code sequence into a gray level image and a Markov image;
a bytes file extraction module configured to: extracting byte sequences in the bytes file, and visualizing the byte sequences into Markov images;
a fusion module configured to: carrying out feature fusion on a gray level image and a Markov image generated based on an operation code sequence and a Markov image generated based on a byte sequence to obtain a fused RGB image;
a classification module configured to: and inputting the fused RGB images into a trained lightweight model for classification.
The lightweight model is an improved MobileNet V2 model; the improvement to the MobileNet V2 model includes adding an attention module after the point convolution of the seventh inverse residual structure of the MobileNet V2 model.
It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented by general-purpose computer means, alternatively they may be implemented by program code executable by computing means, whereby they may be stored in storage means for execution by computing means, or they may be made into individual integrated circuit modules separately, or a plurality of modules or steps in them may be made into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.
While the foregoing description of the embodiments of the present invention has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the invention, but rather, it is intended to cover all modifications or variations within the scope of the invention as defined by the claims of the present invention.

Claims (5)

1. A malicious code classification method based on an RGB image and a lightweight model, comprising:
decompiling an original malicious code file to generate an asm file and a bytes file;
extracting an operation code sequence in the asm file, and respectively converting the extracted operation code sequence into a gray level image and a Markov image;
wherein extracting the operation code sequence in the asm file, and converting the extracted operation code sequence into a markov image comprises:
extracting all operation code sequences of text fields;
counting the first 255 operation code sequences with the largest occurrence number, and filling "aaa" at the tail of the operation code sequences;
creating a matrix of size 256×256 initialized to zero;
judging whether the single file operation code type is the same as 256 operation codes, if so, setting the operation codes to be empty, otherwise, outputting the operation codes which are different;
judging whether the output operation code is the same as the single file operation code in type and is set as 'aaa' differently;
setting a sliding window with the size of 2, regarding the first operation code as a row and the second operation code as a column in two continuous operation codes, adding 1 to the corresponding position, sliding until the last operation code of a single file is finished, and then generating a two-dimensional transition probability matrix;
visualizing the two-dimensional transition probability matrix as a Markov image;
extracting byte sequences in the bytes file, and visualizing the byte sequences into Markov images;
the extracting the byte sequence in the bytes file, visualizing the byte sequence into a Markov image, and comprises the following steps:
creating a matrix of size 256×256 initialized to zero;
converting hexadecimal numbers in the bytes file into decimal numbers of 0-255;
setting a sliding window with the size of 2, regarding the first byte as a row and the second byte as a column in two continuous bytes, adding 1 at a corresponding position, sliding until the last byte is finished, and generating a two-dimensional transfer frequency matrix;
calculating the ratio of the frequency number of each position on the two-dimensional transfer frequency matrix to the sum of the frequency numbers of each row;
multiplying the ratio by 255, forming integer, and filling the integer into a zero matrix to generate a two-dimensional transition probability matrix;
visualizing the two-dimensional transition probability matrix as a Markov image;
fusing a gray level image and a Markov image generated based on the operation code sequence and a Markov image generated based on the byte sequence to obtain a fused RGB image;
inputting the fused RGB images into a trained lightweight model for classification;
the lightweight model is an improved MobileNet V2 model; the improvement to the MobileNet V2 model includes adding an attention module after the point convolution of the seventh inverse residual structure of the MobileNet V2 model.
2. The method of claim 1, wherein extracting the operation code sequence in the asm file, and converting the extracted operation code sequence into a gray scale map comprises:
extracting all operation codes for extracting each malicious sample in the asm file by using an N-gram algorithm;
selecting every three operation codes as subsequences, and calculating the frequency of the subsequences;
the 256 subsequences with highest frequency in each malicious sample are selected, the subsequences are smaller than the complement 0 of 256, the matrix of 256×256 is filled, and finally a gray level image is generated.
3. The method for classifying malicious codes based on RGB images and lightweight models according to claim 1, wherein the feature fusion of the gray level map and the markov image generated based on the operation code sequence and the markov image generated based on the byte sequence to obtain the fused RGB images comprises: and filling the gray level image and the Markov image generated based on the operation code sequence and the Markov image generated based on the byte sequence into an R channel, a G channel and a B channel respectively, and fusing the R channel, the G channel and the B channel into a colorful RGB image.
4. The method for classifying malicious code based on an RGB image and a lightweight model according to claim 1, wherein the inputting the fused RGB image into the trained lightweight model for classification comprises: inputting the generated RGB image into the improved MobileNet V2 model for training to obtain a trained classification model, carrying out feature fusion on malicious files to be detected to generate the RGB image, and inputting the RGB image into the trained classification model to obtain a malicious family classification result.
5. A malicious code classification system based on RGB images and lightweight models, comprising:
a decompilation module configured to: decompiling an original malicious code file to generate an asm file and a bytes file;
an asm file extraction module configured to: extracting an operation code sequence in the asm file, and respectively converting the extracted operation code sequence into a gray level image and a Markov image;
wherein extracting the operation code sequence in the asm file, and converting the extracted operation code sequence into a markov image comprises:
extracting all operation code sequences of text fields;
counting the first 255 operation code sequences with the largest occurrence number, and filling "aaa" at the tail of the operation code sequences;
creating a matrix of size 256×256 initialized to zero;
judging whether the single file operation code type is the same as 256 operation codes, if so, setting the operation codes to be empty, otherwise, outputting the operation codes which are different;
judging whether the output operation code is the same as the single file operation code in type and is set as 'aaa' differently;
setting a sliding window with the size of 2, regarding the first operation code as a row and the second operation code as a column in two continuous operation codes, adding 1 to the corresponding position, sliding until the last operation code of a single file is finished, and then generating a two-dimensional transition probability matrix;
visualizing the two-dimensional transition probability matrix as a Markov image;
a bytes file extraction module configured to: extracting byte sequences in the bytes file, and visualizing the byte sequences into Markov images;
the extracting the byte sequence in the bytes file, visualizing the byte sequence into a Markov image, and comprises the following steps:
creating a matrix of size 256×256 initialized to zero;
converting hexadecimal numbers in the bytes file into decimal numbers of 0-255;
setting a sliding window with the size of 2, regarding the first byte as a row and the second byte as a column in two continuous bytes, adding 1 at a corresponding position, sliding until the last byte is finished, and generating a two-dimensional transfer frequency matrix;
calculating the ratio of the frequency number of each position on the two-dimensional transfer frequency matrix to the sum of the frequency numbers of each row;
multiplying the ratio by 255, forming integer, and filling the integer into a zero matrix to generate a two-dimensional transition probability matrix;
visualizing the two-dimensional transition probability matrix as a Markov image;
a fusion module configured to: carrying out feature fusion on a gray level image and a Markov image generated based on an operation code sequence and a Markov image generated based on a byte sequence to obtain a fused RGB image;
a classification module configured to: inputting the fused RGB images into a trained lightweight model for classification;
the lightweight model is an improved MobileNet V2 model;
the improvement to the MobileNet V2 model includes adding an attention module after the point convolution of the seventh inverse residual structure of the MobileNet V2 model.
CN202310608993.XA 2023-05-29 2023-05-29 Malicious code classification method and system based on RGB image and lightweight model Active CN116340944B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310608993.XA CN116340944B (en) 2023-05-29 2023-05-29 Malicious code classification method and system based on RGB image and lightweight model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310608993.XA CN116340944B (en) 2023-05-29 2023-05-29 Malicious code classification method and system based on RGB image and lightweight model

Publications (2)

Publication Number Publication Date
CN116340944A CN116340944A (en) 2023-06-27
CN116340944B true CN116340944B (en) 2023-08-18

Family

ID=86889812

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310608993.XA Active CN116340944B (en) 2023-05-29 2023-05-29 Malicious code classification method and system based on RGB image and lightweight model

Country Status (1)

Country Link
CN (1) CN116340944B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116861431B (en) * 2023-09-05 2023-11-21 国网山东省电力公司信息通信公司 Malicious software classification method and system based on multichannel image and neural network
CN117034274A (en) * 2023-10-08 2023-11-10 广东技术师范大学 Malicious software classification method, device, equipment and medium based on feature fusion
CN117972701B (en) * 2024-04-01 2024-06-07 山东省计算中心(国家超级计算济南中心) Anti-confusion malicious code classification method and system based on multi-feature fusion

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113468531A (en) * 2021-07-15 2021-10-01 杭州电子科技大学 Malicious code classification method based on deep residual error network and mixed attention mechanism
CN113806746A (en) * 2021-09-24 2021-12-17 沈阳理工大学 Malicious code detection method based on improved CNN network
CN115630358A (en) * 2022-07-20 2023-01-20 哈尔滨工业大学(深圳) Malicious software classification method and device, computer equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11461468B2 (en) * 2019-11-06 2022-10-04 Mcafee, Llc Visual identification of malware
US11790085B2 (en) * 2020-10-29 2023-10-17 Electronics And Telecommunications Research Institute Apparatus for detecting unknown malware using variable opcode sequence and method using the same

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113468531A (en) * 2021-07-15 2021-10-01 杭州电子科技大学 Malicious code classification method based on deep residual error network and mixed attention mechanism
CN113806746A (en) * 2021-09-24 2021-12-17 沈阳理工大学 Malicious code detection method based on improved CNN network
CN115630358A (en) * 2022-07-20 2023-01-20 哈尔滨工业大学(深圳) Malicious software classification method and device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
A PE header-based method for malware detection using clustering and deep embedding techniques;Tina Rezaei等;《Journal of Information Security and Applications》;第60卷;1-12 *

Also Published As

Publication number Publication date
CN116340944A (en) 2023-06-27

Similar Documents

Publication Publication Date Title
CN116340944B (en) Malicious code classification method and system based on RGB image and lightweight model
EP3451165B1 (en) Neural network operation device and method supporting few-bit floating-point number
CN108985317B (en) Image classification method based on separable convolution and attention mechanism
CN109086722B (en) Hybrid license plate recognition method and device and electronic equipment
CN109344618B (en) Malicious code classification method based on deep forest
CN113806746B (en) Malicious code detection method based on improved CNN (CNN) network
CN115937655B (en) Multi-order feature interaction target detection model, construction method, device and application thereof
CN111027576B (en) Cooperative significance detection method based on cooperative significance generation type countermeasure network
CN108304573A (en) Target retrieval method based on convolutional neural networks and supervision core Hash
EP4237977B1 (en) Method for detection of malware
CN111461129B (en) Context prior-based scene segmentation method and system
CN111259397A (en) Malware classification method based on Markov graph and deep learning
Maryum et al. Cassava leaf disease classification using deep neural networks
Shen et al. Feature fusion-based malicious code detection with dual attention mechanism and BiLSTM
CN111400713B (en) Malicious software population classification method based on operation code adjacency graph characteristics
CN111241550B (en) Vulnerability detection method based on binary mapping and deep learning
Zhu et al. Malware homology determination using visualized images and feature fusion
CN115828248B (en) Malicious code detection method and device based on interpretive deep learning
CN116258917B (en) Method and device for classifying malicious software based on TF-IDF transfer entropy
CN116595525A (en) Threshold mechanism malicious software detection method and system based on software map
CN116975864A (en) Malicious code detection method and device, electronic equipment and storage medium
CN114861178B (en) Malicious code detection engine design method based on improved B2M algorithm
Cho Dynamic RNN-CNN based malware classifier for deep learning algorithm
CN111079143B (en) Trojan horse detection method based on multi-dimensional feature map
CN114358058A (en) Wireless communication signal open set identification method and system based on deep neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Zhao Dawei

Inventor after: Sun Chenyu

Inventor after: Yang Shumian

Inventor after: Xu Lijuan

Inventor after: Li Xin

Inventor after: Zhang Yuxin

Inventor after: Xu Qingling

Inventor before: Zhao Dawei

Inventor before: Sun Chenyu

Inventor before: Yang Shumian

Inventor before: Xu Lijuan

Inventor before: Li Xin

Inventor before: Zhang Yuxin

Inventor before: Xu Qingling

EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20230627

Assignee: Shandong Geek Security Technology Co.,Ltd.

Assignor: SHANDONG COMPUTER SCIENCE CENTER(NATIONAL SUPERCOMPUTER CENTER IN JINAN)|Qilu University of Technology (Shandong Academy of Sciences)

Contract record no.: X2024980000068

Denomination of invention: A Malicious Code Classification Method and System Based on RGB Images and Lightweight Models

Granted publication date: 20230818

License type: Common License

Record date: 20240104