CN114253866A

CN114253866A - Malicious code detection method and device, computer equipment and readable storage medium

Info

Publication number: CN114253866A
Application number: CN202210190516.1A
Authority: CN
Inventors: 王东升; 彭涛; 赵立伟; 王健; 王特; 阎博
Original assignee: Ziguang Hengyue Technology Co ltd
Current assignee: Ziguang Hengyue Technology Co ltd
Priority date: 2022-03-01
Filing date: 2022-03-01
Publication date: 2022-03-29
Anticipated expiration: 2042-03-01
Also published as: CN114253866B

Abstract

The application belongs to the technical field of detection, and discloses a method, a device, computer equipment and a readable storage medium for detecting malicious codes, wherein the method comprises the steps of extracting data characteristics of data to be detected according to the position relation among instructions in the data to be detected; and inputting the data characteristics into a pre-trained malicious code detection model to obtain a code detection result output by the malicious code detection model, wherein the malicious code detection model is constructed based on machine learning. Therefore, a malicious code detection model is established based on machine learning, data features are extracted according to the position relation of each instruction in the data to be detected, the trained malicious code detection model is adopted, malicious code detection is carried out on the data based on the data features, a code detection result is obtained, more precise data features can be extracted, and the accuracy of code detection is further improved.

Description

Malicious code detection method and device, computer equipment and readable storage medium

Technical Field

The present application relates to the field of detection technologies, and in particular, to a method and an apparatus for malicious code detection, a computer device, and a readable storage medium.

Background

Malicious code (Shellcode) is a piece of code that is used to execute with a software bug and can be stuffed with a piece of Shellcode machine code that can be executed by the CPU after an instruction pointer register (Eip) overflows to control the device to execute any instructions of an attacker.

In the prior art, the detection of the presence of the Shellcode is usually performed by means of abnormal behavior detection, code matching and the like.

However, by adopting the method, the accuracy rate of malicious code detection is low, and the missing report rate is high.

Disclosure of Invention

The embodiment of the application aims to provide a method and a device for detecting malicious codes, computer equipment and a readable storage medium, which are used for improving the accuracy of malicious code detection and reducing the false negative rate of malicious code detection when the malicious codes are detected.

In one aspect, a method for malicious code detection is provided, including:

extracting data characteristics of the data to be detected according to the position relation among the instructions in the data to be detected;

and inputting the data characteristics into a pre-trained malicious code detection model to obtain a code detection result output by the malicious code detection model, wherein the malicious code detection model is constructed based on machine learning.

In the implementation process, a malicious code detection model is established based on machine learning, data features are extracted according to the position relation of each instruction in the data to be detected, the trained malicious code detection model is adopted, malicious code detection is carried out on the data based on the data features, a code detection result is obtained, more precise data features can be extracted, and the code detection accuracy is further improved.

In one embodiment, after inputting the data characteristics into a pre-trained malicious code detection model and obtaining a code detection result output by the malicious code detection model, the method further includes:

respectively carrying out malicious detection on data to be detected based on at least one malicious code detection mode to obtain at least one code detection category, wherein the malicious code detection mode is different from the construction principle of a malicious code detection model;

respectively determining the probability of each code detection category based on at least one code detection category and the code detection categories in the code detection result;

and generating a comprehensive detection result of the data to be detected according to the probability of each code detection category.

In the implementation process, malicious code detection is performed on the data by combining with a malicious code detection mode constructed based on other principles to obtain a code detection type, so that the code detection type output by a malicious code detection model and the code detection type output by other malicious code detection modes are combined to obtain a comprehensive code detection result, and the accuracy of malicious code detection is further improved.

In one embodiment, extracting data features of data to be detected according to a position relationship between instructions in the data to be detected includes:

respectively acquiring the instruction category of each instruction in the data to be detected;

generating an instruction category combination corresponding to every two adjacent instructions based on the instruction category of each instruction in the data to be detected, wherein each instruction category combination comprises an instruction category of one instruction and an instruction category of a next instruction of one instruction;

determining the combination occurrence times of each instruction type combination based on each instruction type combination;

respectively determining the average instruction interval between every two instruction categories based on the instruction categories of all instructions in the data to be detected, wherein the two instruction categories are the same category or different categories;

and acquiring the data characteristics of the data to be detected based on the combination occurrence times of the instruction category combinations and the instruction average interval between every two instruction categories.

In the implementation process, according to the repeated occurrence times of the instruction class combination corresponding to every two instructions and the instruction average interval between every two instruction classes, finer data characteristics can be obtained.

In one embodiment, obtaining data characteristics of data to be detected based on the combined occurrence number of each instruction category combination and the instruction average interval between every two instruction categories includes:

generating an instruction transfer matrix based on the combination occurrence times of each instruction category combination;

generating an instruction interval matrix based on the average interval of the instructions between every two instruction categories;

and splicing the instruction transfer matrix and the instruction interval matrix to obtain a feature matrix, wherein the feature matrix is a data feature.

In the implementation process, the instruction transfer matrix and the instruction interval matrix are spliced to obtain the data characteristics of the malicious code detection model, so that convenience is provided for subsequent code detection.

In one embodiment, the method further comprises:

acquiring a training sample data set, wherein the training sample data set comprises a plurality of training sample data and corresponding actual code types;

respectively obtaining data characteristics corresponding to each training sample data according to the position relation of each instruction in each training sample data in the training sample data set;

and training the malicious code detection model based on the data characteristics and the actual code category respectively corresponding to each training sample data to obtain the trained malicious code detection model.

In the implementation process, an initial malicious code detection model is constructed based on machine learning, data features are extracted according to the position relation of the instructions, and model training is performed according to the extracted data features, so that a malicious code detection model with higher detection precision can be obtained.

In one aspect, an apparatus for malicious code detection is provided, including:

the extraction unit is used for extracting the data characteristics of the data to be detected according to the position relation among the instructions in the data to be detected;

and the detection unit is used for inputting the data characteristics into a pre-trained malicious code detection model to obtain a code detection result output by the malicious code detection model, and the malicious code detection model is constructed based on machine learning.

In one embodiment, the detection unit is further configured to:

In one embodiment, the extraction unit is configured to:

In one embodiment, the extraction unit is further configured to:

In one aspect, a computer device is provided, comprising a processor and a memory, the memory storing computer readable instructions which, when executed by the processor, perform the steps of the method provided in any of the various alternative implementations of malicious code detection described above.

In one aspect, a computer-readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, performs the steps of the method as provided in any of the various alternative implementations of malicious code detection described above.

In one aspect, a computer program product is provided which, when run on a computer, causes the computer to perform the steps of the method as provided in any of the various alternative implementations of malicious code detection described above.

In the method, the device, the computer equipment and the readable storage medium for detecting the malicious code, provided by the embodiment of the application, the data characteristics of the data to be detected are extracted according to the position relation among the instructions in the data to be detected; and inputting the data characteristics into a pre-trained malicious code detection model to obtain a code detection result output by the malicious code detection model, wherein the malicious code detection model is constructed based on machine learning. Therefore, a malicious code detection model is established based on machine learning, data features are extracted according to the position relation of each instruction in the data to be detected, the trained malicious code detection model is adopted, malicious code detection is carried out on the data based on the data features, a code detection result is obtained, more precise data features can be extracted, and the accuracy of code detection is further improved.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a flowchart illustrating an implementation of a method for training a malicious code detection model according to an embodiment of the present disclosure;

FIG. 2 is an exemplary diagram of a list of instruction categories provided by an embodiment of the present application;

fig. 3 is a flowchart of a method for detecting malicious code according to an embodiment of the present disclosure;

fig. 4 is a schematic architecture diagram of a malicious code detection system according to an embodiment of the present disclosure;

fig. 5 is a block diagram illustrating a structure of an apparatus for detecting malicious code according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

First, some terms referred to in the embodiments of the present application will be described to facilitate understanding by those skilled in the art.

The terminal equipment: may be a mobile terminal, a fixed terminal, or a portable terminal such as a mobile handset, station, unit, device, multimedia computer, multimedia tablet, internet node, communicator, desktop computer, laptop computer, notebook computer, netbook computer, tablet computer, personal communication system device, personal navigation device, personal digital assistant, audio/video player, digital camera/camcorder, positioning device, television receiver, radio broadcast receiver, electronic book device, gaming device, or any combination thereof, including the accessories and peripherals of these devices, or any combination thereof. It is also contemplated that the terminal device can support any type of interface to the user (e.g., wearable device), and the like.

A server: the cloud server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server for providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, big data and artificial intelligence platform and the like.

In order to improve the accuracy of malicious code detection and reduce the false negative rate of malicious code detection when malicious code detection is performed, embodiments of the present application provide a method and an apparatus for malicious code detection, a computer device, and a readable storage medium.

In the embodiment of the present application, the execution subject is a computer device, and optionally, the computer device may be a server or a terminal device.

In the embodiment of the application, model training is performed on the basis of a training sample data set to obtain a trained malicious code detection model, and then malicious code detection is performed on the basis of the trained malicious code detection model.

Referring to fig. 1, an implementation flow chart of a method for training a malicious code detection model according to an embodiment of the present application is shown, and a specific implementation flow of the method is as follows:

step 100: and acquiring a training sample data set.

Specifically, a plurality of training sample data are obtained, the actual code category of each training sample data is set, and a training sample data set including the plurality of training sample data and the corresponding actual code category is generated.

Wherein the training sample data comprises positive sample data and negative sample data. The positive sample data is normal data that does not contain malicious code and may also be referred to as white samples. Negative examples are abnormal data containing malicious code. The actual code class is used to indicate whether the true class of training sample data is malicious or non-malicious code. The actual code category corresponding to the positive sample data is non-malicious code, and the actual code category corresponding to the negative sample data is malicious code.

In one embodiment, data acquisition is performed from a channel such as a malicious code publishing website, and at least one of source code extraction, confusion extraction, code extraction and the like is adopted to extract shellcode instruction codes from the acquired data to obtain negative sample data.

The malicious code publishing website is a website for publishing malicious codes.

In one embodiment, normal transmission traffic, such as at least one of information such as files, data, pictures, and web pages, is collected, and a packet capture library (pcap) or other manners may be used to perform data analysis on the collected traffic and extract encoded data to obtain normal data.

Alternatively, the encoded data may be hexadecimal (Hex) data.

For example, the training sample data set includes 2138 pieces of negative sample data and 3004 pieces of positive sample data.

In practical application, the collection quantity, the collection channel and the extraction mode of the positive sample data and the negative sample data can be set according to a practical application scene, and are not limited herein.

In one embodiment, the actual code category of the training sample data may be set in a labeling manner.

For example, by means of the tags, the actual code class of positive sample data is marked as 1 and the actual code class of negative sample data is marked as 0.

In this way, a set of training sample data for model training may be obtained.

Further, a verification data set for model testing can be generated based on a principle similar to that of obtaining the training sample data set, so that in the subsequent step, the accuracy of code detection can be evaluated based on the verification data set.

Optionally, the data in the validation data set may be different from the data in the training sample data set. The data ratio between the verification data set and the training sample data set can be set according to the actual application scenario, for example, 3:7, which is not limited herein.

In one embodiment, the acquired training sample data set and the verification data set can be read into a data framework (data-frame) through a language extension library (Pandas) for data analysis, so as to facilitate subsequent data processing.

Step 101: and respectively obtaining the data characteristics corresponding to each training sample data according to the position relation of each instruction in each training sample data in the training sample data set.

Specifically, when step 101 is executed, the following steps may be adopted:

s1011: and respectively acquiring the instruction type of each instruction in the training sample data.

Specifically, according to the corresponding relation between the machine code and the assembly instruction, the assembly instruction corresponding to each instruction in each training sample data is respectively obtained, and according to the corresponding relation between the assembly instruction and the instruction category, the instruction category to which the assembly instruction corresponding to each instruction belongs is respectively obtained.

This is because the negative sample data, i.e., the Shellcode data, is a 16-system machine code in the form of "\ x90\ x90\ x16\ x6 b", which has 256 16-system machine codes from 00-FF (corresponding to 10-system 0-255). Therefore, the 16-system machine code and the assembler instruction can be queried by querying the corresponding table between the assembler instruction and the machine code, for example, the 16-system machine code "\ x 06" corresponds to the assembler instruction: PUSH ES.

In one embodiment, the assembly instructions are classified according to their operation types and functions to obtain their instruction classes, and a correspondence between the assembly instructions and the instruction classes is established, so that a correspondence between hexadecimal instructions and instruction classes can be established.

Wherein the instruction categories may include at least one of the following categories: data transfer instructions, arithmetic operation instructions, logical operation instructions, string instructions, program branch instructions, tag processing instructions, and others.

In one embodiment, each assembly instruction is divided into 8 categories, namely data transfer instructions, arithmetic operation instructions, logical operation instructions, string instructions, program branch instructions, tag processing instructions, and others.

The data transmission instruction can comprise MOV, PUSH, POP and the like. The arithmetic operation instructions may include ADD, SUB, SBB, and the like. The logical operation instruction may include AND, OR, XOR, AND the like. String instructions may include MOVs, CMPS, and REPC, among others. Program branch instructions may include JMP, CALL, and RET, among others. The pseudo-instructions may include PROC, ENDP, ENDS, and the like. The tag processing instructions may include CLC, CMC, STD, and the like. Others may include NOT USED, etc.

In practical applications, the instruction category may be set according to practical application scenarios, and is not limited herein.

S1012: and generating an instruction type combination corresponding to every two adjacent instructions based on the instruction type of each instruction in each training sample data.

Specifically, the instruction type combination is a binary combination composed of two instruction types, and each instruction type combination includes an instruction type of one instruction and an instruction type of a next instruction of the instruction.

S1013: and respectively determining the occurrence times n of the combination of the instruction type combinations in each training sample data based on the instruction type combinations in each training sample data.

That is, the times of repeated occurrences of each instruction class combination in the same training sample data are respectively counted.

For example, the instructions in a piece of training sample data are in turn: instruction 1, instruction 2, instruction 3, instruction 4, instruction 1, instruction 2. The instruction types corresponding to the instructions are as follows in sequence: category 1, category 2, category 3, category 4, category 1, category 2. Based on the instruction category of each instruction in the training sample data, the generated instruction category combination sequentially comprises: (class 1, class 2), (class 2, class 3), (class 3, class 4), (class 4, class 1), (class 1, class 2). The instruction category combination (category 1, category 2) appears twice, that is, the number of occurrences of the combination n =2, and the number of occurrences of the combination n of the other instruction category combinations is 1.

Optionally, a matrix form may be adopted, and the number of times n of combination occurrence of each instruction category combination is counted.

In one embodiment, a matrix tool is used to construct a k × k instruction branch matrix, where the initial value of each element in the instruction branch matrix is zero, the row corresponds to the instruction type of the current instruction, and the column corresponds to the instruction type of the instruction next to the current instruction. k represents the number of rows or columns of the instruction transfer matrix and is an integer, e.g., k may be 8.

And respectively traversing each piece of read training sample data in a sliding mode from the first instruction, wherein the size of a sliding window is 2, 2 instructions are counted each time, the sliding step length is 1 instruction, and the instruction types corresponding to the current instruction and the next instruction are counted. For example, if the current instruction belongs to the category 2, and the next instruction corresponding to the current instruction belongs to the category 4, then record the position information of the matrix to be processed as (2, 4), that is, the positions of the 1 st row and the 3 rd column of the instruction transfer matrix (where the matrix subscript starts from 0), perform +1 operation on the element value corresponding to the matrix position (2, 4), and so on, complete traversal of the whole piece of data, obtain the instruction transfer matrix after completion of statistics, where each element in the instruction transfer matrix is the occurrence number n of the combination of each instruction category combination. By analogy, other instruction transfer matrices of each training sample data can be obtained and stored.

Therefore, the occurrence times of the combination of each instruction type combination in each corresponding training sample data can be respectively counted through each instruction transfer matrix, and the transfer relation among the instruction types in the training sample data is reflected through the instruction transfer matrix.

S1014: and respectively determining the instruction average interval between every two instruction categories in each training sample data based on the instruction category of each instruction in each training sample data.

Specifically, every two instruction categories (i.e. the first instruction category and the second instruction category) in the training sample data may be the same category or different categories.

In one embodiment, according to the order of the instruction categories of each instruction in training sample data, an instruction category list corresponding to the training sample data is generated, and the following steps are executed for each two instruction categories in the instruction category list:

the method comprises the steps of obtaining a first position where a first instruction type appears at the beginning of an instruction type list and a second position where a second instruction type appears after the first position every time, respectively determining an instruction interval between each second position and the first position, and determining an average value of each instruction interval to obtain an instruction average interval between the first instruction interval and the second instruction interval.

Referring to fig. 2, an exemplary diagram of an instruction category list provided in the embodiment of the present application is shown. In fig. 2, the instruction types corresponding to the instructions in one piece of training sample data are in turn: [1, 2, 4, 1, 5, 7, 8, 2, 4, 7, 1, 3, 5, 6 ].

For example, the first instruction class and the second instruction class are both 1, the first position of the first instruction class 1 is the position where 1 appears for the first time in fig. 2, and the second position of the second instruction class 1 is the position where 1 appears each time after the first position. In the training sample data, the instruction intervals between the first position of the first instruction type 1 and the second positions of the second instruction type 1 are as follows: 3 and 11, which can be denoted as [3, 11 ]. The average interval of instructions between the first instruction class 1 and the second instruction class 1 is: (3 + 11)/2 = 7.

For another example, the first instruction type is 4, the second instruction type is 7, the first position of the first instruction type 4 is the position where 4 appears for the first time in fig. 2, and the second instruction type is the position where 7 appears each time after the first position. The instruction intervals between the first position of the first instruction class 4 and the second positions of the second instruction class 7 are: 3, 4 and 8, which can be expressed as [3, 4, 8], the average interval of instructions between the first instruction class 4 and the second instruction class 7 is: (3 +4+ 8)/3 = 5.

Specifically, the following steps are executed for each training sample data:

and aiming at a piece of training sample data, constructing an r multiplied by r instruction interval matrix, wherein the row corresponds to the instruction type of the current instruction, and the column corresponds to the instruction type of the next instruction of the current instruction. r represents the number of rows or columns of the instruction transfer matrix, which is an integer. For example, r may be 8. That is, the rows of the instruction interval matrix correspond to 8 instruction classes, and the columns of the instruction interval matrix correspond to 8 instruction classes.

The matrix position information in the instruction interval matrix may be represented as (first instruction class, second instruction class). The element corresponding to the matrix position (the first instruction type, the second instruction type) is the instruction average interval between the first instruction type and the second instruction type in the training sample data.

Therefore, the instruction interval matrix corresponding to each training sample data can be obtained, and the position interval relation among the instruction types in the training sample data is reflected through the instruction interval matrix.

S1015: and respectively obtaining the data characteristics of each training sample data based on the combination occurrence times of each instruction class combination in each training sample data and the instruction average interval between every two instruction classes.

Specifically, the following steps are respectively executed for each training sample data:

and obtaining the data characteristics of the training sample data based on the combination occurrence times of each instruction class combination in one piece of training sample data and the instruction average interval between every two instruction classes.

In one embodiment, an instruction transfer matrix is generated based on the number of occurrences of a combination of each instruction class in a piece of training sample data, an instruction interval matrix is generated based on an average interval of instructions between every two instruction classes in the training sample data, and the instruction transfer matrix and the instruction interval matrix are spliced to obtain a feature matrix, that is, a data feature of the training sample data.

In one embodiment, the instruction transfer matrix is converted into a one-dimensional transfer matrix, the instruction interval matrix is converted into a one-dimensional interval matrix, and the one-dimensional transfer matrix and the one-dimensional interval matrix are spliced to obtain a data feature matrix, i.e., a data feature.

In this way, the data features of each training sample data can be extracted.

Step 102: and training the malicious code detection model based on the data characteristics and the actual code category respectively corresponding to each training sample data to obtain the trained malicious code detection model.

Specifically, when step 102 is executed, the following steps may be executed in a loop:

s1021: and inputting the data characteristics of each training sample data into the malicious code detection model, and outputting the code detection category corresponding to each training sample data.

The malicious code detection model is constructed based on machine learning.

In one embodiment, an iterative algorithm (AdaBoost) algorithm in a machine learning library (scimit-spare) is adopted to construct a malicious code detection model.

S1022: and determining a detection error based on the code detection category and the actual code category of each training sample data.

Specifically, the code detection category includes: malicious code and non-malicious code to indicate that the data is malicious code or non-malicious code.

In one embodiment, the total number of each training sample data and the number of the inconsistent code detection categories and actual code categories in each training sample data are counted, and the ratio of the number to the total number is determined as the detection error.

In practical applications, the detection error may be determined in other manners, which is not limited herein.

S1023: and judging whether the detection error meets an error threshold value, if so, executing S1024, and otherwise, executing S1025.

In practical applications, the error threshold may be set according to practical application scenarios, and is not limited herein.

S1024: and obtaining a trained malicious code detection model.

S1025: and adjusting the model parameters of the malicious code detection model according to the detection error, and executing S1021.

Therefore, the instruction transfer characteristics and the instruction position characteristics in the data can be extracted, and model training is carried out based on the instruction transfer characteristics and the instruction position characteristics, so that code detection can be carried out through a trained malicious code detection model in the subsequent steps, and the detection precision of shellcode detection is improved.

Referring to fig. 3, a flowchart of a method for detecting malicious codes according to an embodiment of the present application is shown, and a specific implementation flow of the method is as follows:

step 300: and extracting the data characteristics of the data to be detected according to the position relation among the instructions in the data to be detected.

Specifically, when step 300 is executed, the following steps may be adopted:

s3001: and respectively acquiring the instruction type of each instruction in the data to be detected.

In one embodiment, the following steps are performed for each instruction in the data to be detected:

and obtaining an instruction type corresponding to one instruction according to the preset corresponding relation between the instruction and the instruction type.

S3002: and generating an instruction type combination corresponding to every two adjacent instructions based on the instruction type of each instruction in the data to be detected.

Specifically, each instruction type combination includes an instruction type of one instruction and an instruction type of a next instruction of the instruction.

In this way, a binary combination of instruction class composition, i.e. an instruction class combination, for every two adjacent instructions can be obtained.

S3003: and determining the combined occurrence times of the instruction type combinations based on the instruction type combinations.

Specifically, the number of repeated occurrences of each instruction type combination is counted in sequence, and the number of combined occurrences of each instruction type combination is obtained respectively.

Therefore, according to the position relation among the instructions in the data to be detected and the corresponding relation between the instructions and the instruction types, the instruction type combination corresponding to every two adjacent instructions can be generated, and the repeated occurrence frequency of each instruction type combination, namely the combination occurrence frequency, is counted.

S3004: and respectively determining the average instruction interval between every two instruction categories based on the instruction categories of the instructions in the data to be detected.

Specifically, each two instruction categories (i.e., the first instruction category and the second instruction category) may be the same category or different categories.

In one embodiment, an instruction category list corresponding to data to be detected is generated according to the order of instruction categories of each instruction in the data to be detected, and the following steps are executed respectively for every two instruction categories in the instruction category list:

In practical applications, the average interval between every two instruction categories may also be determined in other manners, and is not limited herein.

S3005: and acquiring the data characteristics of the data to be detected based on the combination occurrence times of the instruction category combinations and the instruction average interval between every two instruction categories.

Specifically, when S3005 is executed, the following steps may be adopted:

s30051: and generating an instruction transfer matrix based on the combined occurrence times of the instruction type combinations.

S30052: an instruction interval matrix is generated based on the average interval of instructions between each two instruction classes.

S30053: and splicing the instruction transfer matrix and the instruction interval matrix to obtain the data characteristics.

For example, an 8x8 instruction transition matrix is converted into a 1x64 one-dimensional transition matrix, an instruction interval matrix is converted into a 1x64 one-dimensional interval matrix, and the one-dimensional transition matrix and the one-dimensional interval matrix are spliced to obtain a 1x128 data feature matrix.

In the embodiment of the present application, the data features of the data to be detected may be extracted based on a principle similar to the data features of the training sample data, that is, when S3001-S3005 are executed, the specific steps may be referred to as step 101, which is not described herein again.

In this way, data features in the data to be detected can be extracted.

Step 301: and inputting the data characteristics into a pre-trained malicious code detection model to obtain a code detection result output by the malicious code detection model.

Specifically, the malicious code detection model is used for identifying whether the data to be detected is malicious code.

In one embodiment, the data characteristics of the data to be detected are input into a pre-trained malicious code detection model, the code detection category of the data to be detected is obtained, and then the code detection result containing the code detection category is obtained.

Furthermore, in order to improve the accuracy of code detection, code detection can be performed by combining other detection modes to obtain a comprehensive detection result of the data to be detected.

Specifically, when determining the comprehensive detection result, the following steps may be adopted:

s3011: and respectively carrying out malicious detection on the data to be detected based on at least one malicious code detection mode to obtain at least one code detection category.

Specifically, the malicious code detection mode is a mode different from detection based on the malicious code detection model, that is, the malicious code detection mode and the malicious code detection model are different in principle.

Optionally, the malicious code detection mode may be a trained code detection auxiliary model. The number of the code detection assistance models may be one or more.

Optionally, the construction principle of the code detection auxiliary model may be based on at least one of the following algorithms:

word frequency-inverse text frequency (TF-IDF) algorithms, N-gram statistical language model (N-gram) algorithms (e.g., 3-gram, 4-gram, and 5-gram, etc.), Convolutional Neural Network (CNN), and Bag of Words (Bag of Words, BOW) algorithms.

In one embodiment, 6 trained code detection auxiliary models are obtained, and data to be detected is detected by the 6 trained code detection auxiliary models respectively, so as to obtain a code detection category output by each code detection auxiliary model.

The 6 different code detection auxiliary models are respectively constructed based on a TF-IDF algorithm, a CNN algorithm, a BOW algorithm, a 3-gram, a 4-gram and a 5-gram algorithm.

Before executing S3011, each code detection assist model is trained in advance based on a training sample data set, and trained code detection assist models are obtained.

In practical application, the malicious code detection mode may be set according to a practical application scenario, which is not limited herein.

S3012: and respectively determining the probability of each code detection category based on at least one code detection category and the code detection categories in the code detection result.

In one embodiment, the total number of times of detection of the determined code detection categories is counted, the number of times of occurrence of each code detection category is counted, and the probability of each code detection category is obtained according to the ratio of the number of times of occurrence of each code detection category to the total number of times of detection.

For example, the code detection categories output by the 6 different code detection auxiliary models are sequentially as follows: the code detection category in the code detection result is malicious code, the total detection times of the code detection category is 7, the probability of the malicious code is 5/7, and the probability of the non-malicious code is 2/7.

In one embodiment, the occurrence frequency of each code detection category is counted, and the probability of each code detection category is obtained according to the occurrence frequency and the weight of each code detection category.

In this way, the probability of each code detection class can be obtained.

S3013: and generating a comprehensive detection result of the data to be detected according to the probability of each code detection category.

Specifically, the maximum probability of the probabilities is determined, and a comprehensive detection result of the data to be detected is generated based on the code detection category corresponding to the maximum probability.

In one embodiment, the code detection category corresponding to the maximum probability is determined as the comprehensive detection result.

For example, if the malicious code type is 0, the non-malicious code type is 1, and 7 prediction results (i.e., code detection types) of a certain piece of data to be detected are sequentially [1, 0, 1, and 0], the prediction result is finally output as [1], i.e., the non-malicious code.

Furthermore, the prediction accuracy can be determined according to the code detection category of the data to be detected and the actual code category.

In one embodiment, a verification data set including at least one to-be-detected data and corresponding actual detection categories is obtained, total data amount of the to-be-detected data in the verification data set is counted, the code detection categories of each to-be-detected data in the verification data set are respectively obtained based on a malicious code detection model, the to-be-detected data with the code detection categories consistent with the corresponding actual code categories are screened out, screening number of the screened to-be-detected data is counted, and ratio between the screening number and the total data amount of the to-be-detected data in the verification data set is determined as prediction accuracy of the malicious code detection model.

The verification data set may be obtained by using a principle similar to that of obtaining the training sample data set, which is not described herein again.

In one embodiment, a verification data set including at least one to-be-detected data and a corresponding actual detection category is obtained, total data amount of the to-be-detected data in the verification data set is counted, an integrated code detection category of each to-be-detected data in the verification data set and a counted code detection category are obtained respectively based on a malicious code detection model and a code detection auxiliary model, the to-be-detected data with the code detection category consistent with the corresponding actual code category is screened out, screening quantity of the screened to-be-detected data is counted, and a ratio of the screening quantity to the total data amount of the to-be-detected data in the verification data set is determined as prediction precision of comprehensive prediction.

Fig. 4 is a schematic diagram of an architecture of a malicious code detection system provided in the embodiment of the present application. The malicious code detection system comprises: the device comprises a data acquisition module, a feature extraction module, a feature splicing module, a first model output module, an auxiliary model construction module, a second model output module, a probability statistics module and a comprehensive result output module.

A data acquisition module: the method is used for acquiring a training sample data set, data to be detected or a verification data set.

A feature extraction module: the method is used for respectively extracting the instruction transfer characteristics and the instruction interval matrix according to the instruction position in the data (training sample data or data to be detected) and the corresponding relation between the instruction and the instruction category.

A characteristic splicing module: and the characteristic matrix is obtained by splicing the instruction transfer matrix and the instruction interval matrix.

A first model output module: and the malicious code detection model is used for inputting the feature matrix into the malicious code detection model to obtain the output malicious detection category.

Specifically, the first model output module is used for performing model training or model prediction based on the feature matrix.

An auxiliary model construction module: the method is used for constructing at least one code detection auxiliary model based on at least one of TF-IDF algorithm, CNN algorithm, BOW algorithm, 3-gram, 4-gram and 5-gram algorithm.

A second model output module: and the code detection device is used for respectively inputting the acquired data to each code detection auxiliary model and respectively obtaining the code detection category output by each code detection auxiliary model.

A probability statistic module: the probability of each code detection category is respectively determined based on each code detection category of the data to be detected determined by the second model output module and the code detection category of the data to be detected output by the first model output module.

The comprehensive result output module: and generating a comprehensive detection result of the data to be detected according to the probability of each code detection category.

In the embodiment of the present application, specific implementation steps of each module in the malicious code detection system are specifically referred to in step 100 to step 102, and in step 300 to step 301, which are not described herein again.

In the embodiment of the application, a malicious code detection model is built based on machine learning, the instruction transfer characteristics and the instruction interval characteristics of data are extracted, malicious code detection is performed on the data based on the instruction transfer characteristics and the instruction interval characteristics of the data by adopting the malicious code detection model, a code detection category is obtained, and the accuracy of code detection is improved through more precise characteristic extraction. Furthermore, malicious code detection can be performed on the data by combining with a code detection auxiliary model constructed based on other principles to obtain a code detection category, so that the code detection category output by the malicious code detection model and the code detection category output by the code detection auxiliary model are combined to obtain a comprehensive code detection result, and the accuracy of malicious code detection is further improved.

Based on the same inventive concept, the embodiment of the present application further provides a device for detecting malicious codes, and because the principle of solving the problem of the device and the equipment is similar to that of a method for detecting malicious codes, the implementation of the device can refer to the implementation of the method, and repeated details are not repeated.

As shown in fig. 5, which is a schematic structural diagram of an apparatus for detecting malicious codes according to an embodiment of the present application, the apparatus includes:

the extracting unit 501 is configured to extract data features of the data to be detected according to a position relationship between instructions in the data to be detected;

the detection unit 502 is configured to input the data characteristics into a pre-trained malicious code detection model, and obtain a code detection result output by the malicious code detection model, where the malicious code detection model is constructed based on machine learning.

In one embodiment, the detection unit 502 is further configured to:

In one embodiment, the extracting unit 501 is configured to:

In one embodiment, the extracting unit 501 is further configured to:

Fig. 6 shows a schematic structural diagram of a computer device 6000. Referring to fig. 6, the computer apparatus 6000 includes: processor 6010 and memory 6020 may optionally further include a power supply 6030, a display unit 6040, and an input unit 6050.

The processor 6010 is the control center for the computer device 6000, connects various components using various interfaces and lines, and performs various functions of the computer device 6000 by running or executing software programs and/or data stored in the memory 6020, thereby monitoring the computer device 6000 as a whole.

In the embodiment of the present application, the processor 6010 executes the steps in the above embodiments when calling the computer program stored in the memory 6020.

Alternatively, processor 6010 may include one or more processing units; preferably, processor 6010 may integrate an application processor that handles mainly the operating system, user interfaces, applications, etc. and a modem processor that handles mainly wireless communications. It is to be appreciated that the modem processor described above may not be integrated into processor 6010. In some embodiments, the processor, memory, and/or memory may be implemented on a single chip, or in some embodiments, they may be implemented separately on separate chips.

The memory 6020 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, various applications, and the like; the storage data area may store data created according to the use of the computer device 6000, and the like. In addition, the memory 6020 may include high-speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

Computer device 6000 may also include a power supply 6030 (e.g., a battery) for powering the various components, which may be logically connected to processor 6010 via a power management system that may perform functions such as managing charging, discharging, and power consumption via the power management system.

The display unit 6040 can be used to display information input by a user or information provided to the user, various menus of the computer apparatus 6000, and the like, and in the embodiment of the present invention, is mainly used to display a display interface of each application in the computer apparatus 6000 and objects such as texts and pictures displayed in the display interface. The display unit 6040 may include a display panel 6041. The Display panel 6041 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.

The input unit 6050 may be used to receive information such as numbers or characters input by a user. The input unit 6050 may include a touch panel 6051 and other input devices 6052. Touch panel 6051, also referred to as a touch screen, may collect touch operations by a user on or near touch panel 6051 (e.g., operations by a user on or near touch panel 6051 using a finger, a stylus, or any other suitable object or attachment).

Specifically, the touch panel 6051 may detect a touch operation by the user, detect signals resulting from the touch operation, convert the signals into touch point coordinates, send the touch point coordinates to the processor 6010, receive a command sent from the processor 6010, and execute the command. In addition, the touch panel 6051 can be implemented by various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. Other input devices 6052 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, power on and off keys, etc.), a trackball, a mouse, a joystick, and the like.

Of course, the touch panel 6051 may cover the display panel 6041, and when the touch panel 6051 detects a touch operation thereon or nearby, the touch operation is transmitted to the processor 6010 to determine the type of the touch event, and then the processor 6010 provides a corresponding visual output on the display panel 6041 according to the type of the touch event. Although in fig. 6, touch panel 6051 and display panel 6041 are two separate components to implement the input and output functions of computer device 6000, in some embodiments, touch panel 6051 and display panel 6041 may be integrated to implement the input and output functions of computer device 6000.

Computer device 6000 may also include one or more sensors such as pressure sensors, gravitational acceleration sensors, proximity light sensors, and the like. Of course, the computer device 6000 may also include other components such as a camera, which are not shown in fig. 6 and will not be described in detail since they are not the components used in the embodiments of the present application.

Those skilled in the art will appreciate that FIG. 6 is merely exemplary of a computing device and is not intended to limit the computing device, and may include more or less components than those shown, or some of the components may be combined, or different components.

In an embodiment of the present application, a computer-readable storage medium has a computer program stored thereon, and when the computer program is executed by a processor, the communication device may perform the steps in the above embodiments.

For convenience of description, the above parts are separately described as modules (or units) according to functional division. Of course, the functionality of the various modules (or units) may be implemented in the same one or more pieces of software or hardware when implementing the present application.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method of malicious code detection, comprising:

2. The method of claim 1, wherein after the inputting the data features into a pre-trained malicious code detection model and obtaining a code detection result output by the malicious code detection model, the method further comprises:

respectively carrying out malicious detection on the data to be detected based on at least one malicious code detection mode to obtain at least one code detection category, wherein the malicious code detection mode is different from the construction principle of the malicious code detection model;

respectively determining the probability of each code detection category based on the at least one code detection category and the code detection categories in the code detection result;

3. The method according to claim 1, wherein the extracting data features of the data to be detected according to the position relationship between the instructions in the data to be detected comprises:

generating an instruction category combination corresponding to every two adjacent instructions based on the instruction categories of the instructions in the data to be detected, wherein each instruction category combination comprises an instruction category of one instruction and an instruction category of a next instruction of the one instruction;

respectively determining the average instruction interval between every two instruction categories based on the instruction categories of the instructions in the data to be detected, wherein the two instruction categories are the same category or different categories;

4. The method according to claim 3, wherein the obtaining the data characteristics of the data to be detected based on the combined occurrence number of the combination of the instruction classes and the instruction average interval between every two instruction classes comprises:

and splicing the instruction transfer matrix and the instruction interval matrix to obtain a feature matrix, wherein the feature matrix is the data feature.

5. The method of any one of claims 1-4, further comprising:

6. An apparatus for malicious code detection, comprising:

7. The apparatus of claim 6, wherein the detection unit is further to:

8. The apparatus of claim 6, wherein the extraction unit is to:

9. The apparatus of claim 8, wherein the extraction unit is to:

10. The apparatus of any of claims 6-9, wherein the extraction unit is further to:

11. A computer device comprising a processor and a memory, the memory storing computer readable instructions that, when executed by the processor, perform the method of any one of claims 1-5.

12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 5.