CN117852030A - Method and device for generating training sample and training code classification model - Google Patents

Method and device for generating training sample and training code classification model Download PDF

Info

Publication number
CN117852030A
CN117852030A CN202311218705.6A CN202311218705A CN117852030A CN 117852030 A CN117852030 A CN 117852030A CN 202311218705 A CN202311218705 A CN 202311218705A CN 117852030 A CN117852030 A CN 117852030A
Authority
CN
China
Prior art keywords
coding
training
encoding
sample data
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311218705.6A
Other languages
Chinese (zh)
Inventor
王雅娴
徐晓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Topsec Network Security Technology Co Ltd
Original Assignee
Beijing Topsec Network Security Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Topsec Network Security Technology Co Ltd filed Critical Beijing Topsec Network Security Technology Co Ltd
Priority to CN202311218705.6A priority Critical patent/CN117852030A/en
Publication of CN117852030A publication Critical patent/CN117852030A/en
Pending legal-status Critical Current

Links

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

Some embodiments of the present application provide a method and apparatus for generating training samples and training code classification models, where the method includes: encoding an encoding object in a program code file to obtain an encoding table, wherein the encoding object comprises: at least one of an opcode, an operand, and a flag register; sample coding is carried out on the program code file based on the coding table, and sample data are obtained; and inputting the sample data into a trained target coding model to obtain training sample data, wherein the training sample data is related to the associated information of the coding object. Some embodiments of the present application may simplify the encoding scheme and reduce the sample footprint.

Description

Method and device for generating training sample and training code classification model
Technical Field
The application relates to the technical field of malicious code classification, in particular to a method and a device for generating training samples and training a code classification model.
Background
In recent years, malicious codes on the internet have increased dramatically, and thousands of malicious code samples are uploaded to the virus total for detection every day by the virus total report, so that the malicious codes pose a great threat to network security.
Currently, malicious code detection is performed by introducing code detection rules, which are obtained by analyzing a large amount of data by security personnel. However, this method cannot keep pace with the update speed of malicious code, and thus the detection accuracy cannot be guaranteed.
Therefore, how to provide a technical solution for detecting malicious code with a better effect is a technical problem to be solved.
Disclosure of Invention
An object of some embodiments of the present application is to provide a method and an apparatus for generating a training sample and a training code classification model, which can reduce the occupied space in the process of generating the training sample through the technical scheme of the embodiments of the present application, improve the data quality of the training sample, and further ensure the training effect and the detection accuracy of the training model.
In a first aspect, some embodiments of the present application provide a method of generating training samples, comprising: encoding an encoding object in a program code file to obtain an encoding table, wherein the encoding object comprises: at least one of an opcode, an operand, and a flag register; sample coding is carried out on the program code file based on the coding table, and sample data are obtained; and inputting the sample data into a trained target coding model to obtain training sample data, wherein the training sample data is related to the associated information of the coding object.
In some embodiments of the present application, after a coding table is obtained by coding a coding object in a program code file, then sample coding is performed on the whole program code file to obtain sample data, and finally the sample data is input to a target coding model to obtain training sample data. Some embodiments of the application can simply encode different types of encoding objects, and the occupied space is reduced in the process of generating training samples, so that the data quality of the training samples is improved, and the effect of a training model is further ensured.
In some embodiments, the encoding object in the program code file, obtaining the encoding table includes: extracting an operation sentence from the program code file, and counting the word frequency of the coding object in the operation sentence to generate a word list corresponding to the coding object; and sequencing and encoding the encoding objects based on the word frequency to obtain the encoding table.
According to some embodiments of the method, the vocabulary is obtained by counting the operation sentences in the program code file, the encoding table is obtained by processing the vocabulary, the encoding mode is simple and easy to operate, and the occupied space is small.
In some embodiments, the sorting and encoding the encoding objects based on the word frequency to obtain the encoding table includes: sequencing the coding objects according to the sequence from high to low of the word frequency to obtain a coding sequence; if the number of characters in the coding sequence exceeds a preset threshold, sequentially coding partial characters in the coding sequence from the preset value to obtain a first sub-coding table, and coding the rest characters except the partial characters in the coding sequence into a set value to obtain a second sub-coding table; wherein the first sub-coding table and the second sub-coding table form an initial coding table; the encoding table is determined based on the initial encoding table.
Some embodiments of the method and the device have the advantages that the word frequency is ordered, the coding mode is selected according to the number of characters, the coding mode is simple and easy to operate, and occupied space is small.
In some embodiments, when the encoding object does not include the flag register, the determining the encoding table based on the initial encoding table includes: the initial coding table is used as the coding table; when the encoding object includes the flag register, the determining the encoding table based on the initial encoding table includes: and adding a flag register change character at the tail end of the initial coding table to obtain the coding table.
Some embodiments of the present application may implement encoding processing for different encoding objects, with wider adaptability.
In some embodiments, prior to said inputting the sample data into the trained target coding model, the method further comprises: acquiring positive coding sample data and negative coding sample data, wherein the positive coding sample data comprises a center word and text information associated with the center word, and the negative coding sample data comprises the center word and text information not associated with the center word; training a forward word embedding layer of an initial coding model by utilizing the forward coding sample data to obtain a forward word embedding layer to be verified; training a negative word embedding layer of an initial coding model by utilizing the negative coding sample data to obtain a negative word embedding layer to be verified; calculating the total loss of the positive word embedding layer to be verified and the negative word embedding layer to be verified through a loss function; and outputting the target coding model if the total loss is not larger than a preset loss value.
According to the method and the device for training the initial coding model, the initial coding model is positively and negatively trained through the text information containing the association and the unassociated text information, so that the optimal target coding model is obtained, the association between data which can be output by the target coding model and the context information is strong, and the data quality of training samples is improved.
In a second aspect, some embodiments of the present application provide a method of training a code classification model, comprising: acquiring training sample data obtained by the method according to any of the embodiments of the first aspect; the dimension of the training sample data is adjusted to obtain a training data set; and training the initial neural network model by using the training data set to obtain an object code classification model.
According to the embodiment of the application, the model is trained through the obtained training sample data, and the model classification effect is good.
In some embodiments, after the obtaining the object code classification model, the method further comprises: training an overall model formed by the object code classification model and the attention module by utilizing the training data set to obtain a trained object attention module; inputting sample data to the target attention module to acquire an enhanced sample and an attention matrix; the object code classification model is optimized by analyzing the attention matrix.
According to the method and the device, the attention module is used for carrying out attribution analysis on the object code classification model, so that the detection accuracy of the object code classification model can be improved.
In a third aspect, some embodiments of the present application provide an apparatus for generating training samples, including: the encoding module is configured to encode an encoding object in a program code file to obtain an encoding table, wherein the encoding object comprises: at least one of an opcode, an operand, and a flag register; the file coding module is configured to sample code the program code file based on the coding table to obtain sample data; and the sample output module is configured to input the sample data into a trained target coding model to obtain training sample data, wherein the training sample data is related to the association information of the coding object.
In a fourth aspect, some embodiments of the present application provide an apparatus for training a code classification model, comprising: a sample acquisition module configured to acquire training sample data resulting from the method of any one of claims 1-5; the sample processing module is configured to adjust the dimension of the training sample data to obtain a training data set; and the training module is configured to train the initial neural network model by using the training data set and acquire an object code classification model.
In a fifth aspect, some embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs a method according to any of the embodiments of the first aspect.
In a sixth aspect, some embodiments of the present application provide an electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the program, can implement a method according to any of the embodiments of the first aspect.
In a seventh aspect, some embodiments of the present application provide a computer program product comprising a computer program, wherein the computer program, when executed by a processor, is adapted to carry out the method according to any of the embodiments of the first aspect.
Drawings
In order to more clearly illustrate the technical solutions of some embodiments of the present application, the drawings that are required to be used in some embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort to a person having ordinary skill in the art.
FIG. 1 is a system diagram of code detection provided by some embodiments of the present application;
FIG. 2 is a flow chart of a method of generating training samples provided in some embodiments of the present application;
FIG. 3 is a graph illustrating the variation of loss values of a word2vec model provided by some embodiments of the present application;
FIG. 4 is one of the flow charts of the method of training a code classification model provided in some embodiments of the present application;
FIG. 5 is a second flowchart of a method for training a code classification model according to some embodiments of the present application;
FIG. 6 is a schematic diagram of the number of lines of assembly code provided by some embodiments of the present application;
FIG. 7 is a block diagram of an apparatus for generating training samples according to some embodiments of the present application;
FIG. 8 is a block diagram of an apparatus for training a code classification model according to some embodiments of the present application;
fig. 9 is a schematic diagram of an electronic device according to some embodiments of the present application.
Detailed Description
The technical solutions in some embodiments of the present application will be described below with reference to the drawings in some embodiments of the present application.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only to distinguish the description, and are not to be construed as indicating or implying relative importance.
In the related art, in order to cope with a situation in which the number of malicious codes increases sharply, a manual detection pressure is alleviated, and YARA (Yet Another Recursive Acronym) rules are introduced into the field of malicious code detection. However, security personnel need to be subjected to a large amount of analysis to sum up reasonable and effective YARA rules, and the detection mode often cannot keep pace with the occurrence speed of new malicious codes. In order to quickly identify malicious code families, network security personnel can conveniently analyze the malicious code families further, and meanwhile, the relevance of the malicious code families is intuitively displayed, the existing research often converts byte data with fixed length of a malicious binary file into a gray level diagram, or decompiles the malicious binary file into an asm file, and then N-gram slicing is carried out on an operation code. These feature extraction methods acquire a small amount of features and do not consider the file structure. For example, in one approach, only the opcode N-gram sequence is considered when considering the opcode-related features, and no operand and opcode to context relationships are considered. For selection of the operation code sequence, the technical scheme adopts a TF-IDF (Term Frequency-inverse document Frequency) method, the TF-IDF is a simple statistical method, and the importance of each operation code sequence in a certain part can be quickly identified, but the method refers to various region division operations, and does not specifically indicate which part-whole information processing link the TF-IDF is applied to. Meanwhile, TF-IDF only extracts word frequency information of text, and does not cover more semantic information. After obtaining the corresponding table of the operation code sequence, the matching mode of the technical scheme to the operation code sequence is similarity matching, and the successfully matched sequence in this mode cannot guarantee that the operation code sequence has similarity in function. The technical scheme then uses a layered noise reduction self-encoder to perform dimension reduction processing on the operation code sequence table, and although the self-encoder can theoretically capture higher-layer semantic information from the sequence, the TF-IDF has filtered most of the context linkage.
As can be seen from the above related art, in the prior art, the correlation between the operation code and the context is not considered when the operation code is processed, so that the detection effect of the code is affected, and the space occupied by the longer operation code sequence length is larger.
In view of this, some embodiments of the present application provide a method for generating training samples and training code classification models, where the method obtains a coding table by performing operation sentence extraction, statistics and classification coding on a coding object in a program code file, and the coding mode is simpler and easier to implement, and occupies less space. And then, converting the program code file according to the coding table to obtain sample data, inputting the sample data into a target coding model, and outputting associated information related to the context of the coding object as training sample data, wherein the quality is higher. Finally, training the initial neural network model through training sample data with higher quality to obtain the target code classification model, so that the detection effect of the target code classification model on malicious codes can be improved.
The overall composition of the system for code detection provided in some embodiments of the present application is described below by way of example in conjunction with fig. 1.
As shown in fig. 1, some embodiments of the present application provide a system diagram of code detection, the system of code detection comprising: a terminal 100 and a detection server 200. Wherein the detection server 200 is pre-deployed with a trained object code detection model. The detection server 200 may acquire the code to be detected of the terminal 100, then detect and classify the code to be detected through the object code detection model, output the family of the code to be detected, and confirm whether it is malicious code.
In some embodiments of the present application, the terminal 100 may be a mobile terminal or a non-portable computer terminal. The embodiments of the present application are not specifically limited herein.
It will be appreciated that training to obtain the object code detection model is required before the object code detection model is deployed at the detection server 200. In order to obtain a better target code detection model, high quality training sample data is required, and thus, the implementation of the training sample generation performed by the detection server 200 according to some embodiments of the present application is exemplarily described below with reference to fig. 2.
Referring to fig. 2, fig. 2 is a flowchart of a method for generating a training sample according to some embodiments of the present application, where the method for generating a training sample includes:
s210, coding an encoding object in a program code file to obtain a coding table, wherein the encoding object comprises: at least one of an opcode, an operand, and a flag register.
For example, in some embodiments of the present application, the encoding object may be one or any combination of an opcode, an operand, and a flag register, which is more adaptable. Wherein the program code file is an asm file (as a specific example of a program code file) of assembly code generated after disassembly of the windows executable file.
In some embodiments of the present application, S210 may include:
s211, extracting an operation sentence from the program code file, counting the word frequency of the coding object in the operation sentence, and generating a word list corresponding to the coding object.
For example, in some embodiments of the present application, an operation sentence is extracted from an asm file, and a vocabulary is obtained by integrating characters, word frequencies, according to the occurrence of the code object record (e.g., an operation code or an operand) and its frequency (i.e., word frequencies). For example, if only the operation code is used as the encoding object, the operation code and the occurrence number (i.e. word frequency) are recorded, and if the encoding object further includes an operand, the operand and the occurrence number thereof are also recorded.
To reduce the encoding length and simplify the encoding scheme, in some embodiments of the present application, the operands are compressed.
For example, the following operations are performed on operands: a. each operand is divided by space, the flags such as dword, ptr, offset are removed, and only the core operand after the flags are reserved. b. The [ ] symbols in the operands are removed, leaving the contents within [ ]. c. The operand containing the _ or _ is removed and the number information after _ is obtained. d. Containing + -operands, replacing + -uniformly with + -and replacing the digital section with "Number". The operands that contain are then directly stripped of post-x data. e. All immediate type operands are replaced with "Number".
Through the above processing, the number of characters to be encoded can be greatly reduced, and the operand type information is concentrated.
S212, sorting and encoding the encoding objects based on the word frequency to obtain the encoding table.
In some embodiments of the present application, S212 may include: sequencing the coding objects according to the sequence from high to low of the word frequency to obtain a coding sequence; if the number of characters in the coding sequence exceeds a preset threshold, sequentially coding partial characters in the coding sequence from the preset value to obtain a first sub-coding table, and coding the rest characters except the partial characters in the coding sequence into a set value to obtain a second sub-coding table; wherein the first sub-coding table and the second sub-coding table form an initial coding table; the encoding table is determined based on the initial encoding table.
For example, in some embodiments of the present application, according to the word frequency of each character in the vocabulary, the order of the characters is arranged from high to low, if the number of characters to be encoded exceeds 10000 (as a specific example of a preset threshold), the first 70% of the characters (as a specific example of a partial character) are sequentially encoded from 2 (as a specific example of a preset value) to obtain the first sub-encoding table. The other characters (as a specific example of the remaining characters) are collectively classified as 'UNK' (as a specific example of the set value), the number is 1, and the number 0 is used for indicating the blank or the partition, so as to obtain a second sub-coding table; if the number of characters to be encoded does not exceed 10000, all the characters are encoded. Finally, an initial coding table is obtained.
In some embodiments of the present application, when the encoding object does not include the flag register, S212 may include: the initial coding table is used as the coding table;
for example, in some embodiments of the present application, where the encoded object includes an opcode and/or an operand, the initial encoding table obtained above is the final encoding table.
In some embodiments of the present application, when the encoding object includes the flag register, S212 may include: and adding a flag register change character at the tail end of the initial coding table to obtain the coding table.
For example, in some embodiments of the present application, when the encoded object contains a flag register, it is necessary to continue numbering at the end of the initial encoding table, adding 8 flag register modifiers corresponding to add, adc, sub, sbb, cmp, mul, imul, rol, ror, rcl, rcr, shl, shr, sal, sar, and, or, xor, test, neg, inc, dec 7 group operations, respectively, effect on flag register changes and unknown flag register changes. And finally obtaining a complete coding table.
S220, sample coding is carried out on the program code file based on the coding table, and sample data are obtained.
For example, in some embodiments of the present application, the. Asm file is sample encoded according to an encoding table to obtain sample data. Specifically, each sample in the asm file is converted into data with dimension 1D or 2D and content type being an integer. At the end or logical break of each sample, it is filled with 0 s.
S230, inputting the sample data into a trained target coding model to obtain training sample data, wherein the training sample data is related to the association information of the coding object.
For example, in some embodiments of the present application, sample data may be input to a pre-trained target word2vec model (as a specific example of a target coding model) for coding, resulting in training sample data. Wherein the code corresponding to each character in the training sample data contains context information (as a specific example of the associated information). In addition, if only the operation code is recorded in the word list, the number of characters to be encoded is not large, the encoding table can encode all the characters, and the mapping dimension of the target word2vec encoding model can be set to a lower value, for example, 16. Specifically, the setting may be performed according to actual situations, and the embodiments of the present application are not specifically limited herein.
In order to further compress the number of characters to be encoded, the above-mentioned target word2vec model is proposed to perform encoding in some embodiments of the present application, and thus, the method for obtaining the target word2vec model is exemplarily described below.
In some embodiments of the present application, before S230, the method for obtaining the target word2vec model includes: acquiring positive coding sample data and negative coding sample data, wherein the positive coding sample data comprises a center word and text information associated with the center word, and the negative coding sample data comprises the center word and text information not associated with the center word; training a forward word embedding layer of an initial coding model by utilizing the forward coding sample data to obtain a forward word embedding layer to be verified; training a negative word embedding layer of an initial coding model by utilizing the negative coding sample data to obtain a negative word embedding layer to be verified; calculating the total loss of the positive word embedding layer to be verified and the negative word embedding layer to be verified through a loss function; and outputting the target coding model if the total loss is not larger than a preset loss value.
For example, in some embodiments of the present application, first, according to the compiling principle and the basic structure of binary codes, the general structure of binary codes corresponding to assembly code samples (as one specific example of positive-coded sample data and negative-coded sample data) is as in table 1.
TABLE 1
Change identification (optional) Operation code Source-destination address Operand 1 Operand 2
Where multiple assembly code implementations are possible for the same function, and where some of the assembly code may accomplish equivalent replacement in a simple manner, but where the type of operation (calculation, movement, etc.) and source-destination address of the involved opcodes are substantially the same.
The assembly code samples may then be classified prior to training the model. For example, 1) categorizes the operation types of the opcodes in the assembly code sample, roughly into compute classes, transfer classes, jump classes, stack classes, bit operations, tests, and others. Specifically, according to the compilation manual, if the number of operation codes is limited, the operation codes may be selected not to be classified, and the embodiment of the present application is not specifically limited herein. 2) The types of operands in the assembly code samples are classified broadly into immediate, registers, pointers, offsets, variables, memory addresses, stack addresses, and others. To be able to capture finer features, the flag register characters may not simplify encoding.
Finally, the word2vec model (as a specific example of the initial coding model) is trained by the classified assembly code samples. Specifically, the word2vec model is provided with two enabling word embedding layers (namely a positive word embedding layer and a negative word embedding layer) which are respectively used for positive encoding and negative encoding. The positive word embedding layer input is a center word in positive coded sample data, the output is context information (as one specific example of text information associated with the center word), the negative word embedding layer input is a negative coded sample data center word, and the output is a negative sample (as one specific example of text information not associated with the center word). So-called negative examples, i.e. words that do not appear in the context of the center word, the present application may randomly select one code from the code table as a negative example. Finally, the sum of the losses of the two ebedding layers is calculated separately as the total loss of the model using BCEWithLogitsLoss (a specific example of a loss function). And stopping training of the word2vec model after the total loss is not more than a certain value (as a specific example of a preset loss value), and outputting the target word2vec model.
For example, as a specific example of the present application, the relevant parameters of the word2vec model are selected to be a window size of 3, a learning rate of 0.0001, and an embedding layer width of 16. If the method of excluding the most probable context is used in generating the negative sample, the time for training the word2vec program is increased, and the loss value cannot be significantly reduced, so that the negative sample is generated in a random manner. The change in loss value of the trained target word2vec model is shown in fig. 3. The final positive sample word embedding loss value is about 0.09, the negative sample word embedding loss value is about 0.12, and the total loss value is about 0.20.
After training is complete, the target word2vec model may be used for updimensional mapping, e.g., mapped to [0.1770,0.1449, -0.1570,0.1922,0.1773,0.1659, -0.1612,0.1649,0.1676,0.1758, -0.1736,0.1563,0.1535,0.1798,0.1920,0.1612] for the character mov.
After generating training sample data based on the embodiments provided above, the model needs to be trained to obtain an object code classification model to facilitate code detection. Accordingly, the following describes exemplary processes for training a code classification model provided in some embodiments of the present application in connection with FIG. 4.
Referring to fig. 4, fig. 4 is a flowchart of a method for training a code classification model according to some embodiments of the present application, where the method for training the code classification model includes:
S410, training sample data is acquired.
For example, in some embodiments of the present application, any of the method embodiments provided above are provided for obtaining training sample data.
And S420, adjusting the dimension of the training sample data to obtain a training data set.
For example, in some embodiments of the present application, the dimensions of the training sample data are adjusted to (n, m, m), where n, m only need to satisfy that n×m is equal to the size of the single training data, so as to obtain the training data set.
S430, training the initial neural network model by using the training data set, and obtaining an object code classification model.
For example, in some embodiments of the present application, the conv1 network layer of the ResNet50 model (as a specific example of the initial neural network model) is modified to nn.Conv2d (n, 64, kernel_size=7, stride=2, padding=3, bias=false), and the fc network layer is modified to nn.sequential (nn.linear (ResNet 50.Fc. In_features, 6), nn.logsoftmax (dim=1)). Then, training the ResNet50 model by using the training data set, and optimizing the model by using BCEWITHLogitsLoss as a loss function to finally obtain the target ResNet50 model (as a specific example of the target code classification model) meeting the requirements.
In order to improve the accuracy of the target ResNet50 model, the model may also be tuned by the model attribution, so in some embodiments of the present application, S330 may further include: training an overall model formed by the object code classification model and the attention module by utilizing the training data set to obtain a trained object attention module; inputting sample data to the target attention module to acquire an enhanced sample and an attention matrix; the object code classification model is optimized by analyzing the attention matrix.
For example, in some embodiments of the present application, feature attribution of the classification model may be implemented for the ResNet50 model by means of an attention module implemented based on SE (sequential-and-excitation), which may also be used directly in the training process to adjust the impact of relevant features on the results. Wherein the attention module structure is shown in table 2 below.
TABLE 2
Global pooling layer AdaptiveAvgPool1d(1)
Full connection layer Linear(input_channels,input_channels//reduction_ratio)
An activation layer ReLU(inplace=True)
Full connection layer Linear(input_channels//reduction_ratio,input_channels)
Output layer Sigmoid()
The first full-connection layer compresses the information to reduce the dimension, the second full-connection layer maps the compressed information back to the original dimension, and the output layer converts all the information into values between [0,1 ]. After the attention module parameters are stable, the output value can reflect whether each corresponding feature should be enhanced, i.e. whether the feature has a greater impact on the classification accuracy of the target ResNet50 model. It is generally considered that the feature corresponding to the larger position in the data in the output value of the attention module is more important to the target ResNet50 model, so that the adjustment of the target ResNet50 model can be realized, and finally, the target ResNet50 model with higher detection accuracy is obtained.
The specific process of training a code classification model provided in some embodiments of the present application is described below by way of example in conjunction with fig. 5.
Referring to fig. 5, fig. 5 is a flowchart of a method for training a code classification model according to some embodiments of the present application.
The implementation of the above is exemplarily described below.
S510, extracting an operation sentence from the program code file, and counting the word frequency of the coding object in the operation sentence to generate a word list corresponding to the coding object.
For example, as a specific example of the present application, experiments were performed on the big2015 dataset. First, families with sample numbers below 500 are filtered from big2015, while a maximum of 2000 samples are selected per family. After removing the samples without assembly code, 8512 samples were kept for 6 families to use for the experiment. Fig. 6 counts the number of lines of assembly code in these samples, and finally sets up to 4000 lines of assembly code per sample as the subject of the experiment, taking into account the storage space limitation. And then counting the occurrence frequency of the operation codes and the operation numbers in the assembly codes to obtain the corresponding word list.
S520, sorting and encoding the encoding objects based on the word frequency in the word list to obtain an encoding table.
For example, as a specific example of the present application, characters are arranged from top to bottom according to word frequency, and they are encoded.
And S530, performing sample coding on the program code file based on the coding table to obtain sample data.
S540, inputting the sample data into a trained target word2vec model to obtain training sample data.
S550, the dimension of the training sample data is adjusted, and a training data set is obtained.
For example, as a specific example of the present application, each sample in the training sample data is reformed into a dimension (10,80,80) parameter using a torch.
S560, training the ResNet50 model by using the training data set to obtain a target ResNet50 model.
For example, as a specific example of the present application, training data sets are delivered to the ResNet50 model in a picture-like structure. The weight of the loss function is set in such a manner that the loss weight of each family=total number of samples/number of samples of the family. The target ResNet50 model was obtained by training.
S570, updating the object code classification model through the attention module.
For example, as a specific example of the present application, after the target ResNet50 model is trained, the attention module is stacked with the trained ResNet50 model and retrained. In this embodiment, the loss value 0.4269 is taken, and the model result of the accuracy 0.8438 is used for analysis. After training, the target attention module is taken out, sample data is input, and an enhanced sample and an attention matrix are obtained. Features that the ResNet50 deems more helpful for code classification can be seen visually from this matrix. Specifically, inputting the relevant data of 04EjIdbPV5e1Xrofopin.asm to a target attention module to obtain an attention matrix, for example, knowing that the data of the 1 st and 10 th channels of the file are more important according to the attention matrix, and corresponding to the top 400 lines of assembly codes and 3900 th to 4000 th lines of parts in the asm file; the 5 th channel is less important, corresponding to rows 1900 to 2000. In the process of converting the asm sample into the input sample of the model, irreversible modification operation does not occur, so that model attribution can be completed through the identification features and the corresponding data of the asm.
It should be noted that the specific implementation process of S510 to S570 may refer to the method embodiments provided above, and are not repeated here for avoiding repetition.
As can be seen from the above description of some embodiments of the present application, the present application filters operands in a particular manner and then arranges the codes along with the opcodes. After the influence condition of the statistical operation code on the flag register is investigated, the influence condition is encoded according to the change condition of the flag register, and the different operation codes are respectively corresponding. Each opcode corresponds to a fixed flag register change. The encoding of the unified part of operation codes and operands is integrated in a specific way, so that the characters required to be encoded are reduced. The whole process reduces the coding amount and the occupied space, is simple and convenient to operate, and widens the adaptation scene. The application trains a wor vec model on a big2015 data set part sample in a positive and negative sampling mode, and the pre-trained model can be used in a migration mode. Furthermore, an attention module is realized based on the SE architecture, and can output the importance of the characteristics to the classification task and output the characteristics adjusted according to the importance. It can be seen that the embodiment of the application can realize family classification of malicious codes based on assembly sentences in the asm file; a high-dimensional mapping model of the opcodes in the asm file (i.e., the target wor2vec model) is trained for use by migration. By adjusting the ResNet50 model, a high-accuracy classification model is realized. An attention module is realized based on SE, and can be used for explaining a model to perform feature attribution so as to improve the accuracy of subsequent code detection.
Referring to fig. 7, fig. 7 illustrates a block diagram of an apparatus for training a code classification model according to some embodiments of the present application. It should be understood that the apparatus for training a code classification model corresponds to the above method embodiments, and can perform the steps related to the above method embodiments, and specific functions of the apparatus for training a code classification model may be referred to the above description, and detailed descriptions thereof are omitted herein as appropriate to avoid redundancy.
The means for training the code classification model of fig. 7 comprises at least one software functional module which can be stored in a memory in the form of software or firmware or cured in the means for training the code classification model, the means for training the code classification model comprising: the encoding module 710 is configured to encode an encoding object in a program code file to obtain an encoding table, where the encoding object includes: at least one of an opcode, an operand, and a flag register; a file encoding module 720 configured to sample encode the program code file based on the encoding table, and obtain sample data; and the sample output module 730 is configured to input the sample data to a trained target coding model to obtain training sample data, wherein the training sample data is related to the association information of the coding object.
Referring to fig. 8, fig. 8 illustrates a block diagram of an apparatus for training a code classification model according to some embodiments of the present application. It should be understood that the apparatus for training a code classification model corresponds to the above method embodiments, and can perform the steps related to the above method embodiments, and specific functions of the apparatus for training a code classification model may be referred to the above description, and detailed descriptions thereof are omitted herein as appropriate to avoid redundancy.
The means for training the code classification model of fig. 8 includes at least one software functional module that can be stored in memory in the form of software or firmware or cured in the means for training the code classification model, the means for training the code classification model comprising: a sample acquisition module 810 configured to acquire training sample data; a sample processing module 820 configured to adjust the dimension of the training sample data to obtain a training data set; a training module 830 is configured to train the initial neural network model with the training dataset to obtain an object code classification model.
It will be clear to those skilled in the art that, for convenience and brevity of description, reference may be made to the corresponding procedure in the foregoing method for the specific working procedure of the apparatus described above, and this will not be repeated here.
Some embodiments of the present application also provide a computer readable storage medium having stored thereon a computer program, which when executed by a processor, may implement operations of the method corresponding to any of the above-described methods provided by the above-described embodiments.
Some embodiments of the present application further provide a computer program product, where the computer program product includes a computer program, where the computer program when executed by a processor may implement operations of a method corresponding to any of the foregoing methods provided by the foregoing embodiments.
As shown in fig. 9, some embodiments of the present application provide an electronic device 900, the electronic device 900 comprising: memory 910, processor 920, and a computer program stored on memory 910 and executable on processor 920, wherein processor 920 may implement a method as in any of the embodiments described above when the program is read from memory 910 and executed by processor 920 via bus 930.
The processor 920 may process the digital signals and may include various computing structures. Such as a complex instruction set computer architecture, a reduced instruction set computer architecture, or an architecture that implements a combination of instruction sets. In some examples, the processor 920 may be a microprocessor.
Memory 910 may be used for storing instructions to be executed by processor 920 or data related to execution of instructions. Such instructions and/or data may include code to implement some or all of the functions of one or more modules described in embodiments of the present application. The processor 920 of embodiments of the present disclosure may be configured to execute instructions in the memory 910 to implement the methods shown above. Memory 910 includes dynamic random access memory, static random access memory, flash memory, optical memory, or other memory known to those skilled in the art.
The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application, and various modifications and variations may be suggested to one skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application. It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims (11)

1. A method of generating training samples, comprising:
encoding an encoding object in a program code file to obtain an encoding table, wherein the encoding object comprises: at least one of an opcode, an operand, and a flag register;
sample coding is carried out on the program code file based on the coding table, and sample data are obtained;
and inputting the sample data into a trained target coding model to obtain training sample data, wherein the training sample data is related to the associated information of the coding object.
2. The method of claim 1, wherein encoding the encoded object in the program code file to obtain the encoding table comprises:
extracting an operation sentence from the program code file, and counting the word frequency of the coding object in the operation sentence to generate a word list corresponding to the coding object;
and sequencing and encoding the encoding objects based on the word frequency to obtain the encoding table.
3. The method of claim 2, wherein the ordering and encoding the encoded objects based on the word frequency to obtain the encoding table comprises:
Sequencing the coding objects according to the sequence from high to low of the word frequency to obtain a coding sequence;
if the number of characters in the coding sequence exceeds a preset threshold, sequentially coding partial characters in the coding sequence from the preset value to obtain a first sub-coding table, and coding the rest characters except the partial characters in the coding sequence into a set value to obtain a second sub-coding table; wherein the first sub-coding table and the second sub-coding table form an initial coding table;
the encoding table is determined based on the initial encoding table.
4. The method of claim 3, wherein when the encoding object does not include the flag register, the determining the encoding table based on the initial encoding table comprises:
the initial coding table is used as the coding table;
when the encoding object includes the flag register, the determining the encoding table based on the initial encoding table includes:
and adding a flag register change character at the tail end of the initial coding table to obtain the coding table.
5. A method according to any of claims 1-3, wherein prior to said inputting said sample data into the trained target coding model, the method further comprises:
Acquiring positive coding sample data and negative coding sample data, wherein the positive coding sample data comprises a center word and text information associated with the center word, and the negative coding sample data comprises the center word and text information not associated with the center word;
training a forward word embedding layer of an initial coding model by utilizing the forward coding sample data to obtain a forward word embedding layer to be verified;
training a negative word embedding layer of an initial coding model by utilizing the negative coding sample data to obtain a negative word embedding layer to be verified;
calculating the total loss of the positive word embedding layer to be verified and the negative word embedding layer to be verified through a loss function;
and outputting the target coding model if the total loss is not larger than a preset loss value.
6. A method of training a code classification model, comprising:
acquiring training sample data obtained by the method of any one of claims 1-5;
the dimension of the training sample data is adjusted to obtain a training data set;
and training the initial neural network model by using the training data set to obtain an object code classification model.
7. The method of claim 6, wherein after the obtaining the object code classification model, the method further comprises:
Training an overall model formed by the object code classification model and the attention module by utilizing the training data set to obtain a trained object attention module;
inputting sample data to the target attention module to acquire an enhanced sample and an attention matrix;
the object code classification model is optimized by analyzing the attention matrix.
8. An apparatus for generating training samples, comprising:
the encoding module is configured to encode an encoding object in a program code file to obtain an encoding table, wherein the encoding object comprises: at least one of an opcode, an operand, and a flag register;
the file coding module is configured to sample code the program code file based on the coding table to obtain sample data;
and the sample output module is configured to input the sample data into a trained target coding model to obtain training sample data, wherein the training sample data is related to the association information of the coding object.
9. An apparatus for training a code classification model, comprising:
a sample acquisition module configured to acquire training sample data resulting from the method of any one of claims 1-5;
The sample processing module is configured to adjust the dimension of the training sample data to obtain a training data set;
and the training module is configured to train the initial neural network model by using the training data set and acquire an object code classification model.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program, wherein the computer program when run by a processor performs the method according to any of claims 1-7.
11. An electronic device comprising a memory, a processor, and a computer program stored on the memory and running on the processor, wherein the computer program when run by the processor performs the method of any one of claims 1-7.
CN202311218705.6A 2023-09-19 2023-09-19 Method and device for generating training sample and training code classification model Pending CN117852030A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311218705.6A CN117852030A (en) 2023-09-19 2023-09-19 Method and device for generating training sample and training code classification model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311218705.6A CN117852030A (en) 2023-09-19 2023-09-19 Method and device for generating training sample and training code classification model

Publications (1)

Publication Number Publication Date
CN117852030A true CN117852030A (en) 2024-04-09

Family

ID=90546587

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311218705.6A Pending CN117852030A (en) 2023-09-19 2023-09-19 Method and device for generating training sample and training code classification model

Country Status (1)

Country Link
CN (1) CN117852030A (en)

Similar Documents

Publication Publication Date Title
CN109271521B (en) Text classification method and device
CN106778241B (en) Malicious file identification method and device
US20170124435A1 (en) Method for Text Recognition and Computer Program Product
CN107229627B (en) Text processing method and device and computing equipment
CN111523314B (en) Model confrontation training and named entity recognition method and device
CN107341143A (en) A kind of sentence continuity determination methods and device and electronic equipment
CN110580308A (en) information auditing method and device, electronic equipment and storage medium
CN109993216B (en) Text classification method and device based on K nearest neighbor KNN
CN114282527A (en) Multi-language text detection and correction method, system, electronic device and storage medium
CN114416979A (en) Text query method, text query equipment and storage medium
CN110826298A (en) Statement coding method used in intelligent auxiliary password-fixing system
CN113486178A (en) Text recognition model training method, text recognition device and medium
CN113761875B (en) Event extraction method and device, electronic equipment and storage medium
CN113282717B (en) Method and device for extracting entity relationship in text, electronic equipment and storage medium
CN111241271B (en) Text emotion classification method and device and electronic equipment
CN109472020B (en) Feature alignment Chinese word segmentation method
CN114266251A (en) Malicious domain name detection method and device, electronic equipment and storage medium
CN112732863B (en) Standardized segmentation method for electronic medical records
CN112134858B (en) Sensitive information detection method, device, equipment and storage medium
CN109902162B (en) Text similarity identification method based on digital fingerprints, storage medium and device
CN113836297B (en) Training method and device for text emotion analysis model
CN110874398B (en) Forbidden word processing method and device, electronic equipment and storage medium
CN117852030A (en) Method and device for generating training sample and training code classification model
CN115294593A (en) Image information extraction method and device, computer equipment and storage medium
CN113282746B (en) Method for generating variant comment countermeasure text of network media platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination