CN114357443A

CN114357443A - Malicious code detection method, equipment and storage medium based on deep learning

Info

Publication number: CN114357443A
Application number: CN202111519261.0A
Authority: CN
Inventors: 于金龙; 卯路宁; 王智民; 王高杰
Original assignee: Beijing 6Cloud Technology Co Ltd; Beijing 6Cloud Information Technology Co Ltd
Current assignee: Beijing 6Cloud Technology Co Ltd; Beijing 6Cloud Information Technology Co Ltd
Priority date: 2021-12-13
Filing date: 2021-12-13
Publication date: 2022-04-15

Abstract

The invention discloses a malicious code detection method, equipment and a storage medium based on deep learning, and belongs to the technical field of information security. The method comprises the steps of obtaining a code to be detected, inputting the code to be detected into a preprocessing module for data processing to obtain characteristic data after data processing, inputting the characteristic data into a pre-constructed detection model, and outputting a malicious code judgment result by the detection model. The detection of the codes to be detected through the detection model is realized, the malicious code detection result is automatically generated, and the accuracy of malicious code detection is improved.

Description

Malicious code detection method, equipment and storage medium based on deep learning

Technical Field

The invention relates to the technical field of information security, in particular to a malicious code detection method, equipment and a storage medium based on deep learning.

Background

As Web applications dominate the mainstream market in browser/server (B/S) architectures, browsers and Web pages have become an important channel for the spread of malicious code. An attacker utilizes website code bugs, third-party application bugs, browser bugs and operating system bugs to carry out cross-site scripting attack on websites, inject network trojans, tamper webpages, phishing, steal personal information and the like.

Javascript (js) is a dynamic, lightweight scripting language for adding more functionality and enhancing the user experience. The method is a ubiquitous technology, and allows client end scripts to interact with users and enables websites to be dynamic and interactive, so that server-client end interaction is avoided, instant feedback is achieved, and a rich interface is achieved. Despite various advantages, JS has been used to perform various cyber attacks. An attacker uses malicious JavaScript code to perform functions such as keystroke logging, browser cookie stealing, hacking, website tampering, and Trojan horses. In addition, botnets may also be created by social engineering enticing users to download malware.

In the prior art, malicious code detection is performed by an anti-virus and Intrusion Detection System (IDS), and particularly, signatures, pattern matching and heuristics-based methods are used to detect attacks. The signature characteristic code detection method comprises the steps of maintaining a known malicious code library, and comparing a characteristic code of a code sample to be detected with a characteristic code in the malicious code library; the heuristic rule detection method extracts rules of the existing malicious codes through professional analysts and detects code samples according to the extracted rules. However, these methods have many disadvantages, such as launching a 0-day attack or other type of attack against a new vulnerability, which cannot be effectively detected according to existing detection rules.

Disclosure of Invention

The invention mainly aims to provide a malicious code detection method, equipment and a storage medium based on deep learning, and aims to solve the problem that malicious codes cannot be effectively detected in the prior art.

In order to achieve the above object, the present invention provides a malicious code detection method based on deep learning, which includes the following steps:

acquiring a code to be detected, and preprocessing the code to be detected to obtain data to be detected after data processing;

inputting the data to be detected into a pre-constructed detection model, and outputting a malicious code judgment result by the detection model, wherein the detection model is constructed on the basis of word element level word vector representation and character level one-hot coding.

Optionally, before the step of inputting the data to be detected into a pre-constructed detection model and outputting a malicious code judgment result by the detection model, where the detection model is constructed based on a token-level word vector representation and a character-level one-hot code, the method further includes:

acquiring a labeled data set and a non-labeled data set, and dividing the labeled data set into a labeled training set and a labeled testing set;

and dividing the sample codes in the unlabeled data set and the labeled training set to obtain corresponding sample lemmas, and inputting the sample lemmas into a pre-training model to obtain a word vector representation data set of the sample lemmas.

Optionally, after the step of dividing the code in the unlabeled data set and the code in the labeled training set to obtain corresponding sample lemmas, inputting the sample lemmas into a pre-training model to obtain word vector representation data sets of the sample lemmas, the method further includes:

taking the word vector representation data set as a parameter of a first-layer deep learning model;

dividing the labeled training set to obtain training word elements corresponding to the labeled training set;

inputting the training word elements into the deep learning model to obtain a first word vector representation of the training word elements;

carrying out one-hot code conversion on the characters in the labeled training set to obtain a first one-hot coded data set;

and inputting the first word vector representation and the first unique hot coded data set into a deep learning model for supervised training, and calculating the accuracy through the labeled test set to construct a detection model.

Optionally, the step of acquiring the code to be detected, and preprocessing the code to be detected to obtain data after data processing includes:

acquiring a code to be detected, and performing data processing on the code to be detected to obtain a target code after data processing;

dividing the target code to obtain a corresponding target word element;

converting the target code into a single-hot coding format to obtain a second single-hot coding data set;

and taking the target word element and the second unique hot coded data set as data to be detected.

Optionally, the step of performing data processing on the code to be detected to obtain a target code after data processing includes:

acquiring numerical data in the code to be detected;

and carrying out standardization processing on the numerical data to obtain the target code.

Optionally, the step of dividing the target code to obtain corresponding target lemmas includes:

acquiring a preset number of special characters as separators, and dividing the target code into first lemmas;

screening first lemmas with the length not less than a preset threshold value in the first lemmas to obtain second lemmas;

and converting the second word element into lower case to obtain a target word element.

Optionally, the step of converting the object code into a one-hot encoding format to obtain a second one-hot encoded data set includes:

acquiring a preset number of characters in the target code;

and converting the characters into a single-hot coding format to obtain a second single-hot coding data set.

In addition, to achieve the above object, the present invention further provides a malicious code detection apparatus based on deep learning, including:

the acquisition module is used for acquiring a code to be detected, preprocessing the code to be detected and acquiring data to be detected after data processing;

and the detection module is used for inputting the data to be detected into a pre-constructed detection model, outputting a malicious code judgment result by the detection model, and constructing the detection model based on the word element level word vector representation and the character level one-hot coding.

Optionally, the detection module is further configured to:

Optionally, the obtaining module is further configured to:

dividing the target code to obtain a corresponding target word element;

Optionally, the obtaining module is further configured to:

acquiring numerical data in the code to be detected;

Optionally, the obtaining module is further configured to:

acquiring a preset number of characters in the target code;

In addition, to achieve the above object, the present invention further provides a malicious code detection device based on deep learning, where the device includes: the system comprises a memory, a processor and a deep learning based malicious code detection program stored on the memory and capable of running on the processor, wherein the deep learning based malicious code detection program is configured to realize the steps of the deep learning based malicious code detection method.

In addition, in order to achieve the above object, the present invention further provides a storage medium, wherein the storage medium stores thereon a deep learning-based malicious code detection program, and when the deep learning-based malicious code detection program is executed by a processor, the steps of the deep learning-based malicious code detection method are implemented.

According to the malicious code detection method, device and system based on deep learning and the storage medium, the code to be detected is obtained and preprocessed to obtain the data to be detected after data processing, the data to be detected is input into the pre-constructed detection model, and the malicious code judgment result is output by the detection model. The detection model constructed based on deep learning is used for detecting the codes to be detected, the detection model automatically generates malicious code detection results, and the accuracy of malicious code detection is improved.

Drawings

Fig. 1 is a schematic structural diagram of a deep learning-based malicious code detection device of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a malicious code detection method based on deep learning according to a first embodiment of the present invention;

fig. 3 is a flowchart illustrating a detailed process of step S10 in the first embodiment of the method for detecting malicious codes based on deep learning according to the present invention;

FIG. 4 is a schematic diagram of a pre-training model generation flow of a third embodiment of the deep learning-based malicious code detection method according to the present invention;

FIG. 5 is a schematic diagram illustrating a relationship between a pre-training model and a detection model in a third embodiment of the deep learning-based malicious code detection method according to the present invention;

FIG. 6 is a schematic diagram of a model structure of an embodiment of a deep learning-based malicious code detection method according to the present invention;

fig. 7 is a functional module diagram of an embodiment of a malicious code detection method based on deep learning according to the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a malicious code detection device based on deep learning in a hardware operating environment according to an embodiment of the present invention.

As shown in fig. 1, the deep learning based malicious code detection apparatus may include: a processor 1001, such as a Central Processing Unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a WIreless interface (e.g., a WIreless-FIdelity (WI-FI) interface). The Memory 1005 may be a Random Access Memory (RAM) Memory, or may be a Non-Volatile Memory (NVM), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the architecture shown in FIG. 1 does not constitute a limitation of deep learning based malicious code detection devices, and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a storage medium, may include therein an operating system, a data storage module, a network communication module, a user interface module, and a deep learning-based malicious code detection program.

In the deep learning based malicious code detection device shown in fig. 1, the network interface 1004 is mainly used for data communication with other devices; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 of the malicious code detection device based on deep learning according to the present invention may be disposed in the malicious code detection device, and the malicious code detection device invokes a malicious code detection program based on deep learning stored in the memory 1005 through the processor 1001, and executes the malicious code detection method based on deep learning according to the embodiment of the present invention.

An embodiment of the present invention provides a malicious code detection method based on deep learning, and referring to fig. 2, fig. 2 is a schematic flowchart of a first embodiment of a malicious code detection method based on deep learning according to the present invention.

In this embodiment, the malicious code detection method based on deep learning includes:

step S10, acquiring a code to be detected, and performing data processing on the code to be detected to obtain characteristic data after data processing;

step S20, inputting the characteristic data into a pre-constructed detection model;

and step S30, outputting a judgment result of the malicious code by the detection model.

In this embodiment, a malicious code detection method based on deep learning is provided, and is used in a malicious code detection system for detecting malicious JavaScript codes in browsers and web pages. JavaScript is a lightweight, interpreted or just-in-time programming language with function priority, which can implement complex functions on web pages, such as: the HTML content can be changed. Due to the popularity of JavaScript, JavaScript has been used to perform various cyber attacks. An attacker, which can use JavaScript to automatically download and install malicious software on a computer without the user's knowledge, or perform a cross-site scripting (XSS) attack, injects malicious scripts into legitimate websites, steals user sensitive information (such as passwords, cookies, and accounts) when a user accesses such websites, and cross-site request forgery (XSRF) attacks, entices users authenticated by Web applications to perform unnecessary operations, and inserts illegal malicious advertisements into Web pages. Therefore, in the embodiment, the code to be detected is obtained, the data of the code to be detected is processed to obtain the feature data after the data processing, then the feature data is input into the detection model which is constructed in advance, and the detection result of the code to be detected is obtained through the detection model.

The respective steps will be described in detail below:

step S10, acquiring a code to be detected, and preprocessing the code to be detected to obtain data to be detected after data processing;

in one embodiment, a code to be detected is obtained, and data processing is performed on the code to be detected to obtain corresponding data to be detected. Specifically, when an unknown webpage is detected, a code to be detected is obtained by extracting a JavaScript code embedded in the unknown webpage, the obtained code to be detected is firstly input into a preprocessing module for data processing, a token (token) of the code to be detected is extracted, characters in the code to be detected are converted into a unique hot coding format, and the data in the unique hot coding format and the token are used as characteristic data after data processing.

And step S20, inputting the data to be detected into a pre-constructed detection model, and outputting a malicious code judgment result by the detection model, wherein the detection model is constructed based on the lexical element level word vector representation and the character level one-hot code.

In one embodiment, the data to be detected obtained by the preprocessing module is input into a pre-constructed detection model for detection, and the detection model is used for detecting whether the data to be detected is malicious codes or not. After the data to be detected is input into a pre-constructed detection model, word vector representation of the lemma is obtained from the first layer of the detection model and then is input into the second layer of the detection model together with the character-level one-hot code. It should be noted that the pre-constructed detection model is specifically constructed according to input data to be detected, and in this embodiment, the input feature data is token-level word vector representation and character-level (token) unique hot coding, so the same level of feature data is also used when constructing the model. After the detection model is established, when detection is needed, only the stored detection model needs to be loaded, and whether the data to be detected is a malicious program can be detected by taking the data to be detected obtained by the preprocessing module as input.

In the embodiment, the data to be detected is obtained by preprocessing the code to be detected, the features of the token are automatically extracted through the detection model, the word vector representation of the token is obtained, the word vector representation of the token and the unique hot code of the character are input into the detection model, and the detection result of the malicious code is obtained through prediction of the detection model, so that the automatic detection of the malicious code is realized, and the accuracy of detecting the malicious code is improved.

Further, based on the first embodiment of the malicious code detection method based on deep learning, the second embodiment of the malicious code detection method based on deep learning is provided.

Referring to fig. 3, fig. 3 is a detailed flowchart of step S10 in the first embodiment of the present invention, and the difference between the second embodiment of the malicious code detection method based on deep learning and the first embodiment of the malicious code detection method based on deep learning in the present invention is that the steps of obtaining the code to be detected, preprocessing the code to be detected, and obtaining the data to be detected after data processing include:

step S11, acquiring a code to be detected, and performing data processing on the code to be detected to obtain a target code after data processing;

step S12, dividing the target code to obtain corresponding target lemma;

step S13, converting the object code into a one-hot encoding format to obtain a second one-hot encoded data set;

step S14, the target lemma and the second unique hot coded data set are used as data to be detected;

in this embodiment, a code to be detected is obtained, and the code to be detected is preprocessed, where the preprocessing includes: data processing, target word element division and character unique hot coding extraction are carried out, feature data of different dimensions are extracted, the problem that when a text to be detected is short, a result obtained by a detection model only carrying out detection on character-level word vector representation is not accurate is solved, and the accuracy of detection by using the detection model is improved.

The following is a detailed description of each step:

in an embodiment, data processing is performed on the code to be detected to obtain a first code. Among them, data processing, for example: screening the codes to be detected, and deleting messy codes in the codes to be detected. It can be understood that when the script to be detected is acquired from the website, the acquired data may be wrong, for example, some emoticons and chinese characters may be converted into messy codes in Javascript, so that the acquired data is processed, useless data is screened out, and the subsequent detection workload is reduced.

Step S12, dividing the target code to obtain corresponding target lemma;

in an embodiment, the preprocessed target code is subjected to character string division to obtain a target lemma. It will be appreciated that to obtain features from text, the text needs to be split first. The feature here is the lemma (token) from the text, one of which is an arbitrary combination of characters, i.e. a continuous character segment, and one of which has the commonality that such a word, or a phrase, corresponds to only one semantic vector (embedding). Partitioning the first code may split this string according to matching a given regular expression. Or using a function substr (start, length) to indicate that a string with length from start position is truncated, and so on.

in one embodiment, characters in the object code are converted to a one-hot encoded format to obtain a second one-hot encoded data set. Wherein the second one-hot coded data set is different from the first one-hot coded data set in that the second one-hot coded data set is obtained from the code to be detected and the first one-hot coded data set is obtained from the tagged training set. The One-Hot coding is also called One-bit effective coding, One-Hot coding, and is a representation that classification variables are used as binary vectors, so that only 1 bit of a single feature in each sample is in a state 1, and the others are all 0. It can be understood that, if the feature to be detected is input to a machine learning algorithm, feature digitization is usually required to be performed on data to be detected, and then conversion to the unique hot code is required, and a specific process of conversion to the unique hot code can be realized by the prior art, which is not described herein again.

And step S14, taking the target lemma and the second unique hot coded data set as data to be detected.

In one embodiment, the target lemma and the one-hot code are used as feature data. The method includes the steps that preprocessed data of the first code at two levels are obtained, one is character-level one-hot coding, the other is a word element, word vector representation of the word element is obtained through a pre-constructed detection model, then the character-level one-hot coding and the word vector representation of the word element level are detected, and the data at the two levels can be complemented with each other to obtain a more accurate malicious code detection result.

Further, in an embodiment, the step of performing data processing on the code to be detected to obtain a target code after data processing includes:

step S111, acquiring numerical data in the code to be detected;

step S112, performing normalization processing on the numerical data to obtain the target code.

In an embodiment, the step of performing data processing on the code to be detected further includes acquiring numerical data in the code to be detected, and normalizing the numerical data. Specifically, in order to better handle the numerical data such as random numerical values, IP addresses, random domain names (including numbers in many cases), dates, version numbers, etc. in the codes, the numerical values are normalized and the numbers are replaced with '+' numbers to eliminate the influence of the same (or almost the same) codes on the subsequent processing, thereby improving the detection result.

Further, in an embodiment, the step of dividing the target code to obtain corresponding target lemmas includes:

step S121, acquiring a preset number of special characters as separators, and dividing the target code into first lemmas;

in one embodiment, a predetermined number of special characters are obtained as delimiters to divide the object code. Because the target code is a whole text, it is not practical to detect the whole data, and features in the text need to be extracted to facilitate subsequent detection, so the text is divided first. Specifically, in JavaScript, besides a character string, there are other punctuation marks, and the character string needs to start and end with a double quotation mark (") or a single quotation mark ('), so that the single quotation mark and the double quotation mark can be used as special characters for segmenting the character string, and of course, there are other special characters that can be used for segmenting the character string, for example: space, brackets (}), semicolon (;), colon (: equal (═) and backswing (\\) and backswing to special characters such as the enter key (\ r) and the like. Wherein, the number of the special characters can be set according to the actual situation.

Step S122, screening first lemmas with the length not less than a preset threshold value in the first lemmas to obtain second lemmas;

in an embodiment, the first lemma obtained after the division is filtered, and the lemma with the length not less than a preset threshold in the first lemma is reserved to obtain a second lemma. Since a single character has no meaning, in an embodiment, only the token with the length of at least 2 is reserved, and certainly, the token with the length of at least 3 may also be reserved, and the preset threshold may be set according to actual conditions, but if the preset threshold is set too large, the problem of missed detection is likely to occur.

And S123, converting the second lemma into lower case to obtain a target lemma.

In one embodiment, the filtered second lemma is converted into a lower case. Since the JavaScript code is distinguished between the upper case and the lower case, the token needs to be converted into the lower case, and the format is unified, thereby facilitating subsequent processing.

Further, in an embodiment, the step of converting the target lemma into a one-hot encoding format to obtain a second one-hot encoded data set includes:

step S131, acquiring the preset number of characters;

step S132, converting the characters into a one-hot coding format to obtain a second one-hot coding data set.

In one embodiment, a preset number of characters are obtained, and then the preset number of characters are converted into a one-hot encoding format to obtain a second one-hot encoded data set. Js files of Javascript can exist independently, can also be inserted into an Html webpage, and are usually short when being inserted into the Html webpage, so that when the texts are short or the text data is possibly less, only a few tokens are segmented, and therefore the One-hot of the characters is extracted as a supplement to improve the prediction effect. If the script to be detected is short, all characters of the script may be included when the preset number of characters is extracted. Because of the character-level conversion, if all characters in One text are converted into One-hot codes, the data volume is huge, and the calculation occupation space is large, so that partial characters in the lemma also need to be extracted. The specific number of the preset numbers can be determined by experimental data, through the test of researchers, the first 1000 characters can be selected for conversion, and according to the condition that the characters are 26 English letters, the number of the selected 1000 characters is very large, so that the amount of the data is considered when the preset number is determined. Therefore, the characters with the preset number are obtained firstly, then format conversion is carried out on the characters, the characters are converted into the one-hot encoding format, and a second one-hot encoding data set is obtained. It should be noted that the preset number of characters obtained here is the preset number of characters obtained from top to bottom in all the lemmas, and it is assumed that the preset number is n, that is, the first n characters in the lemmas are obtained.

In the embodiment, a code to be detected is obtained and data processing is performed, numerical data in the code to be detected is further obtained, the numerical data is subjected to normalization processing to obtain a target code, then a preset number of special characters are used as separators, a first code is divided into first lemmas, the first lemmas with the length not smaller than a preset threshold value in the first lemmas are screened to obtain second lemmas, then the second lemmas are converted into lowercase to obtain corresponding target lemmas, then a preset number of characters in the target are obtained, and the characters are converted into a unique hot coding format. In this embodiment, data processing is performed on the acquired code to be detected, so that the data quality is improved, and two levels of codes to be detected are input when the detection model is subsequently input: the character-level one-hot coding and the lemma avoid the problem of too little characteristic quantity caused by too few codes to be detected, and improve the accuracy of detection.

Further, referring to fig. 4, fig. 4 is a schematic diagram of a generation flow of a pre-training model of a third embodiment of the malicious code detection method based on deep learning of the present invention, and proposes the third embodiment of the malicious code detection method based on deep learning of the present invention.

The third embodiment of the malicious code detection method based on deep learning of the present invention is different from the first and second embodiments in that before the step of obtaining the code to be detected, inputting the code to be detected into a pre-constructed preprocessing module, and obtaining the feature data after data processing, the method further comprises:

step S40, acquiring a labeled data set and an unlabeled data set, and dividing the labeled data set into a labeled training set and a labeled testing set;

and step S50, dividing the sample codes in the unlabeled data set and the labeled training set to obtain corresponding sample lemmas, and inputting the sample lemmas into a pre-training model to obtain word vector representation data sets of the sample lemmas.

In this embodiment, a sample downloaded from a website is obtained, and it is clearly known that the sample is malicious, and then we collect some normal codes, for example, a web page is obtained from hundreds of degrees, so we also clearly know that the website is normal, but i obtain codes from other web pages that are not visited frequently, we do not know whether the web page has malicious codes, and therefore, this part does not know whether the data set having malicious codes is the unlabeled data set of us. And dividing the labeled objects into a training set and a testing set, training the training set to obtain a model, and then calculating the accuracy of the model by using the testing set. All the word vectors are obtained, which is equivalent to constructing a dictionary. The existing malicious code data set comprises a labeled data set and a non-labeled data set, the labeled data set is divided into a labeled training set and a labeled testing set, then the non-labeled data set and part of the training set in the labeled training set are input into an initial pre-training model, a pre-training model is obtained through training, and the pre-training model is used for constructing word vectors corresponding to word elements.

The respective steps will be described in detail below:

step S40, acquiring a malicious code data set, including a labeled data set and a non-labeled data set, and dividing the labeled data set into a labeled training set and a labeled testing set;

in one embodiment, a malicious code data set is obtained, which includes a tagged data set and a non-tagged data set, and the tagged data set is further divided into a tagged training set and a tagged testing set. The malicious code data set is a series of data with and without malicious codes, and because the malicious codes are detected through machine learning, a sample of the malicious codes needs to be obtained for the machine learning. Generally, there are several ways to obtain a malicious code sample. (1) And (3) user side sampling, which is a main acquisition method of most antivirus software manufacturers, wherein the end user using the antivirus software uploads a malicious code sample to the manufacturers. The method has better real-time performance, and can cooperate with a security manufacturer to acquire sample data. (2) Public network databases such as VirusBulletin, Open mail, VX Heavens, etc. obtain these shared data. (3) Other technical approaches have crawled data through crawler tools such as honeypots (e.g., Nepenthes honeypots). Wherein, the label is what we want to predict, i.e. the y variable in the simple linear regression, and the feature is the input variable, i.e. the x variable in the simple linear regression. Then a labeled sample is a sample that contains both the feature and the label, and an unlabeled sample is a sample that contains the feature but no label. In order to evaluate the effect of the model, the labeled data set is further divided into a labeled training set and a labeled test set, specifically, whether a leave-one method or a cross-validation method is subsequently used is considered, and if the leave-one method is used, the data set can be statically divided into the training set and the test set according to a certain proportion.

In one embodiment, the unlabeled data set and the labeled training set are divided to obtain sample lemmas, and the sample lemmas are input into the pre-training model to obtain word vector representation data sets of the sample lemmas. Specifically, word vectors corresponding to each token are obtained by using word2vec or fasttext models on a labeled training set and an unlabeled data set. The word vectors corresponding to the tokens are equivalent to a dictionary, and then the tokens of the codes to be detected look up the corresponding word vectors in the dictionary. In the embodiment, the unlabeled data set and the labeled data set are put into the pre-training model to generate the word vectors, so that a large amount of vector data is obtained by extraction, and the word vector extraction is conveniently performed on the codes to be detected. The pre-training model is constructed in a model training stage.

Further, in an embodiment, after the pre-training model is obtained by training, the method further includes:

step S51, taking the word vector representation data set as the parameters of the first layer deep learning model;

step S52, dividing the labeled training set to obtain training lemmas corresponding to the labeled training set;

step S53, inputting the training morpheme into the deep learning model to obtain a first word vector representation of the training morpheme;

step S54, carrying out unique hot code conversion on the characters in the labeled training set to obtain a first unique hot code data set;

and step S55, inputting the first word vector representation and the first unique hot coded data set into a deep learning model for supervised training, and calculating the accuracy through the labeled test set to construct a detection model.

The embodiment further embeds a word vector representation data set into a deep learning model, performs word element division on a labeled training set, obtains word vector representation of training word elements by the deep learning model, performs unique hot code conversion on characters in the labeled training set to obtain a unique hot code data set, inputs the unique hot code data set and the word element word vector representation into a deep learning network, performs supervision training by the labeled training set, and constructs a detection model.

The respective steps will be described in detail below:

in one embodiment, the data set obtained from the pre-trained model is used as the parameters of the first-layer deep learning model, i.e., embedded in the deep learning model. After the deep learning model is embedded, when the lemma is input, the deep learning model can find the word vector corresponding to the lemma for subsequent calculation.

in an embodiment, the labeled training set is divided to obtain training tokens corresponding to the labeled training set. Because a detection model based on character-level one-hot encoding and token-level word vectors needs to be constructed, data used in construction needs to be corresponding, and therefore data in a labeled training set is divided to obtain training word elements.

in one embodiment, a first vector representation of words of the training tokens is derived from the deep learning model. In step S52, the word vector indicates that the data set is already used as a parameter of the deep learning model of the first layer, so the training lemma is directly input into the deep learning model, and a word vector corresponding to the training lemma is obtained inside the deep learning model. And performing calculation by using the word vector corresponding to token.

in one embodiment, characters in the tagged training set are subjected to one-hot conversion to obtain a first one-hot encoded data set converted into a binary one-hot encoded format. Because the input detection data format is considered to be a one-hot coded format, the model needs to learn how to distinguish malicious code in the one-hot coded format when being built. Therefore, we do one-hot transcoding of the training set data.

In one embodiment, a first word vector representation and a first unique hot coded data set obtained from a first layer of a model are input into a deep learning network, supervised training is performed by using a labeled training set, and accuracy is calculated by using a labeled test set to obtain a detection model. Specifically, in the training phase: and inputting the word vectors of the labeled training set and the one-hot into a deep learning model to construct a detection model. In the testing stage: and (4) taking the test set with the label as a code to be detected to calculate the detection accuracy of the model. By simultaneously extracting character-level one-hot coding and token-level word vector representation, the deep learning model can remove learning characteristics based on the combination of two levels, and the performance of the model is improved. The deep learning framework employed in this embodiment is Keras, and similarly, other deep learning frameworks such as TensorFlow, PyTorch may be employed. The deep learning framework is an interface, library or tool, and enables a deep learning model to be constructed more easily and quickly without deep knowledge of details of an underlying algorithm. The deep learning framework defines the model by utilizing the pre-constructed and optimized component set, and provides a clear and concise method for realizing the model. Among them, the common deep learning models mainly include a Full Connected (FC) Network structure, a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), and the like.

As shown in fig. 5, the relationship diagram between the pre-training model and the detection model according to the third embodiment of the malicious code detection method based on deep learning according to the present invention explains the whole model generation process, and the process includes two stages. In the first stage, the pre-training model is trained using the unlabeled data set and the labeled partial training set to obtain a word vector representation (embedding) of token. And constructing a deep learning model in the second stage, using embedding obtained in the first stage as a parameter of a first layer of token input, then using a plurality of layers of CNN and LSTM-RNN neural network units, wherein the input of the deep neural network is composed of character-level one-hot coding and character-level word vector representation, then using a labeled training set for supervised training, and constructing a detection model to detect JavaScript malicious codes.

Referring to fig. 6, fig. 6 is a schematic diagram of a model structure of an embodiment of malicious code detection based on deep learning, which illustrates a malicious code detection method of the present invention, and the structure converts an input JavaScript code into a character-level one-hot code and a token-level word vector representation (having the same meaning as that of the token embedding in the diagram). Inputting character level one-hot (one-hot) coding and token (token) level word vector representation into a convolution layer, a pooling layer and a Dropout layer respectively, then connecting and inputting the two layers into an LSTM layer, and finally outputting a prediction result.

Among them, the convolution layer (convolution) is used for feature extraction. Max Pooling (Max Pooling), which refers to selecting the maximum value in a pooled window, is one of the most common downsampling operations in CNN models in NLP. It means that for a certain Filter, some feature values are extracted, only the value with the maximum score is taken as the Pooling layer retention value, other feature values are all discarded, the maximum value represents that only the strongest of the features is retained, and other weak features are discarded, so that dimensionality is reduced. Max Pooling can reduce the number of model parameters and is beneficial to reducing the problem of model overfitting. Global Max Pooling (Global Max Pooling) is to set the pool size equal to the input size so that Max computes the entire input as the output value. The Dropout layer is used for temporarily discarding the neural network unit from the network according to a certain probability in the training process of the deep learning network, so that the model prevents overfitting through the Dropout layer, and the stability and robustness of the model are improved. Bidirectional Long Short-Term Memory (Bidirectional Long Short-Term Memory) is a temporal Recurrent Neural Network (RNN). It should be noted that convolution, pooling, Dropout, and LSTM are all part of a structure of a neural network, and similarly, similar neural network hierarchies may be substituted, such as RNN, GRU.

In the embodiment, a detection model is constructed by obtaining malicious code data sets including a labeled data set and a non-labeled data set, dividing the labeled data set into a labeled training set and a labeled testing set, inputting a part of training sets in the non-labeled data set and the labeled training set into a pre-training model, obtaining word vector representation of a word element by the pre-training model, performing unique hot code conversion on characters in the labeled training set to obtain a unique hot coded data set, inputting the word vector representation and the unique hot coded data set into a deep learning network, and performing accuracy calculation through the labeled training set. The constructed detection model realizes three beneficial effects: (1) and (4) putting the unlabeled malicious code data set into a pre-training model to enhance the performance of the pre-training model. (2) And simultaneously, character-level one-hot coding and token-level word vector representation are extracted, so that the deep learning model can remove learning characteristics based on the combination of two levels, and the performance of the model is improved. (3) The deep learning model automatically extracts features through character level one-hot coding and token level word vectors without manually designing the features or extracting semantic features of codes through abstract syntax trees and the like.

Referring to fig. 7, fig. 7 is a functional module schematic diagram of an embodiment of the malicious code detection device based on deep learning according to the present invention. The malicious code detection device of the invention comprises:

the acquisition module 10 is configured to acquire a code to be detected, and preprocess the code to be detected to obtain data to be detected after data processing;

the detection module 20 is configured to input the data to be detected into a pre-constructed detection model, and output a malicious code judgment result by the detection model, where the detection model is constructed based on a token-level word vector representation and a character-level unique hot code.

Optionally, the detection module is further configured to:

Optionally, the obtaining module is further configured to:

dividing the target code to obtain a corresponding target word element;

Optionally, the obtaining module is further configured to:

acquiring numerical data in the code to be detected;

Optionally, the obtaining module is further configured to:

acquiring a preset number of characters in the target code;

In addition, the embodiment of the invention also provides a storage medium. The storage medium of the present invention stores a malicious code detection program, and the malicious code detection program implements the steps of the malicious code detection method as described above when executed by a processor.

The method implemented when the malicious code detection program running on the processor is executed may refer to each embodiment of the malicious code detection method of the present invention, and details are not described here.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A malicious code detection method based on deep learning is characterized by comprising the following steps:

2. The deep learning-based malicious code detection method according to claim 1, wherein the data to be detected is input into a pre-constructed detection model, and a malicious code judgment result is output by the detection model, and before the detection model is constructed based on the token-level word vector representation and the character-level one-hot encoding, the method further comprises:

3. The deep learning-based malicious code detection method according to claim 2, wherein after the step of dividing the code in the unlabeled data set and the labeled training set to obtain corresponding sample lemmas, inputting the sample lemmas into a pre-training model to obtain word vector representation data sets of the sample lemmas, the method further comprises:

4. The deep learning-based malicious code detection method according to claim 1, wherein the step of obtaining the code to be detected, and preprocessing the code to be detected to obtain the data to be detected after data processing comprises:

dividing the target code to obtain a corresponding target word element;

5. The malicious code detection method based on deep learning of claim 4, wherein the step of performing data processing on the code to be detected to obtain a target code after data processing comprises:

acquiring numerical data in the code to be detected;

6. The deep learning-based malicious code detection method according to claim 4, wherein the step of dividing the target code to obtain corresponding target lemmas comprises:

7. The deep learning-based malicious code detection method according to claim 4, wherein the step of converting the target code into a one-hot encoding format to obtain a second one-hot encoded data set comprises:

acquiring a preset number of characters in the target code;

8. A malicious code detection device based on deep learning, characterized in that the device comprises: a memory, a processor, and a deep learning based malicious code detection program stored on the memory and executable on the processor, the deep learning based malicious code detection program configured to implement the steps of the deep learning based malicious code detection method according to any one of claims 1 to 7.

9. A storage medium having stored thereon a deep learning based malicious code detection program, which when executed by a processor implements the steps of the deep learning based malicious code detection method according to any one of claims 1 to 7.