CN113987496A

CN113987496A - Malicious attack detection method and device, electronic equipment and readable storage medium

Info

Publication number: CN113987496A
Application number: CN202111300573.2A
Authority: CN
Inventors: 吕晋
Original assignee: Beijing Topsec Technology Co Ltd; Beijing Topsec Network Security Technology Co Ltd; Beijing Topsec Software Co Ltd
Current assignee: Beijing Topsec Technology Co Ltd; Beijing Topsec Network Security Technology Co Ltd; Beijing Topsec Software Co Ltd
Priority date: 2021-11-04
Filing date: 2021-11-04
Publication date: 2022-01-28

Abstract

The application belongs to the technical field of network security, and discloses a malicious attack detection method, a malicious attack detection device, electronic equipment and a readable storage medium, wherein the method comprises the steps of extracting character strings of a target script file to be detected to obtain at least one target character string; and acquiring a malicious attack detection result of the target script file based on at least one target character string and a pre-trained malicious attack detection model, wherein the malicious attack detection model is constructed based on a convolutional neural network and a gated cyclic unit and is used for detecting whether the target script file is a malicious script file. Therefore, time cost and labor cost consumed by malicious attack detection are reduced, and accuracy of the malicious attack detection is improved.

Description

Malicious attack detection method and device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of network security technologies, and in particular, to a malicious attack detection method, apparatus, electronic device, and readable storage medium.

Background

In practical applications, an attacker usually executes a malicious command of a malicious attack script uploaded in a server through a browser or a client to achieve malicious remote control over the server, such as illegally improving access rights, peeping a file directory, tampering any file, executing the malicious command, downloading confidential data, and the like. Therefore, malicious attack detection on script files is often required to improve network security.

In the prior art, a rule matching detection mode is usually adopted to detect malicious attacks on a script file.

However, in this way, detection rules need to be constructed manually according to expert experience, which consumes a lot of labor cost and time cost, and it is difficult to detect unknown malicious script files, so that the accuracy of malicious attack detection is low.

Therefore, when detecting malicious attacks, how to reduce the cost and improve the accuracy of detecting the malicious attacks is a technical problem to be solved.

Disclosure of Invention

The embodiment of the application aims to provide a malicious attack detection method, a malicious attack detection device, electronic equipment and a readable storage medium, which are used for reducing the cost and improving the accuracy of malicious attack detection when malicious attack detection is carried out.

In one aspect, a method for malicious attack detection is provided, including:

extracting character strings of a target script file to be detected to obtain at least one target character string;

and acquiring a malicious attack detection result of the target script file based on at least one target character string and a pre-trained malicious attack detection model, wherein the malicious attack detection model is constructed based on a convolutional neural network and a gated cyclic unit and is used for detecting whether the target script file is a malicious script file.

In the implementation process, the consumed time cost and the labor cost are reduced, and the accuracy of malicious script attack detection is improved.

In one embodiment, extracting a character string of a target script file to be detected to obtain at least one target character string includes:

extracting character strings from the target script file to obtain at least one character string;

obtaining a sample character string set, wherein the sample character string set is generated based on sample character strings in non-malicious samples and malicious samples;

screening out character strings contained in the sample character string set from at least one character string;

and determining the screened character strings as target character strings.

In the implementation process, the target character strings which are valuable for detecting the malicious attacks are screened out, so that the consumed storage resources are reduced, and the accuracy of subsequent malicious attack detection is improved.

In one embodiment, extracting a character string from a target script file to obtain at least one character string includes:

acquiring a scripting language type of a target script file;

judging whether the script language type is a non-specified type, if so, extracting at least one character string in the target script file by adopting a regular expression;

otherwise, analyzing the target script file to obtain a byte code file corresponding to the target script file, and extracting character strings from the byte code file by adopting a regular expression to obtain at least one character string.

In the implementation process, different character string extraction modes can be adopted for different script files, so that the application range of malicious attack detection is widened.

In one embodiment, the malicious attack detection model includes a word vector embedding module, a convolutional neural network model, a gated cyclic unit model, and a full connection layer, and obtains a malicious attack detection result based on at least one target character string and a pre-trained malicious attack detection model, including:

obtaining an index corresponding to at least one target character string based on a first corresponding relation between preset sample character strings and the indexes;

obtaining a word vector corresponding to at least one target character string based on a second corresponding relation between the index and the word vector and the index corresponding to the at least one target character string through a word vector embedding module;

obtaining a first feature output by a convolutional neural network model based on an index and a word vector corresponding to at least one target character string and the convolutional neural network model;

obtaining a second characteristic output by the gate control cycle unit model based on the index and the word vector corresponding to the at least one target character string and the gate control cycle unit model;

splicing the first characteristic and the second characteristic to obtain a third characteristic;

and inputting the third characteristic into the full connection layer to obtain a malicious attack detection result output by the full connection layer.

In the implementation process, the convolutional neural network model and the gated cyclic unit model are combined to construct a malicious attack detection model, so that the false alarm rate of malicious attack detection is reduced, the accuracy and generalization capability of the malicious attack detection are improved, and the performance of the malicious attack detection and the safety detection capability are fully improved.

In one embodiment, before obtaining an input matrix corresponding to at least one target character string based on a correspondence relationship among preset sample character strings, indexes, and word vectors, the method further includes:

respectively generating corresponding indexes aiming at each sample character string in the sample character string set;

establishing a first corresponding relation according to the indexes corresponding to the sample character strings;

respectively inputting each sample character string in the sample character string set into a pre-trained word vector model to obtain a corresponding word vector;

and establishing a second corresponding relation according to the word vector corresponding to each sample character string and the first corresponding relation.

In the implementation process, a first corresponding relation between the sample character string and the index and a second corresponding relation between the index and the word vector are constructed, and storage resources consumed by model input are reduced.

In one embodiment, before performing character string extraction on a target script file to be detected to obtain at least one target character string, the method further includes:

respectively extracting character strings of each sample in the sample set by adopting a regular expression to obtain a sample character string corresponding to each sample;

screening sample character strings corresponding to each sample by adopting at least one of a first screening mode, a second screening mode and a third screening mode to obtain a sample character string set;

the first screening mode is to screen character strings according to malicious attack categories of samples, the second screening mode is to screen character strings according to document frequency of the character strings in each document, and the third screening mode is to screen the character strings based on machine learning.

In the implementation process, the character strings are screened according to the malicious attack category of the sample, the document frequency of the character strings in each document and machine learning, and when the malicious attack detection is carried out on the target script file, the character strings of the target script file are screened according to the sample character string set, so that a large amount of confusing custom functions maliciously introduced in the script file are avoided, the storage resources consumed by the sample character string set are reduced, and a lightweight model is favorably constructed.

In one embodiment, a first screening method is adopted to screen a sample string corresponding to each sample to obtain a sample string set, and the method includes:

obtaining a non-malicious character string set based on the sample character string of each non-malicious sample;

acquiring a malicious character string set based on sample character strings in each malicious sample;

and obtaining a sample character string set based on the intersection of the non-malicious character string set and the malicious character string set.

In the implementation process, the first screening mode is adopted, so that the character strings which are significant to malicious attack detection can be screened out, and the consumed storage resources can be reduced.

In one embodiment, the method for screening sample character strings corresponding to each sample by using a second screening method to obtain a sample character string set includes:

acquiring a document frequency corresponding to each sample character string;

and screening out the sample character strings with the corresponding document frequency higher than a preset document frequency threshold value from each sample character string to obtain a sample character string set.

In the implementation process, the sample character strings belonging to the common words can be screened out through the second screening mode.

In one embodiment, the method for screening sample character strings corresponding to each sample by using a third screening method to obtain a sample character string set includes:

screening each sample character string based on the variance and chi-square test to obtain a test character string set;

screening each sample character string based on logistic regression to obtain a regression character string set;

screening each sample character string based on a random forest to obtain a forest character string set;

and obtaining a sample character string set based on the check character string set, the regression character string set and the forest character string set.

In the implementation process, an important feature, namely a sample character string, can be screened out.

In one aspect, an apparatus for malicious attack detection is provided, including:

the acquisition unit is used for extracting character strings of a target script file to be detected to acquire at least one target character string;

a detection unit: and acquiring a malicious attack detection result of the target script file based on at least one target character string and a pre-trained malicious attack detection model, wherein the malicious attack detection model is constructed based on a convolutional neural network and a gated cyclic unit and is used for detecting whether the target script file is a malicious script file.

In one embodiment, the obtaining unit is configured to:

and determining the screened character strings as target character strings.

In one embodiment, the obtaining unit is configured to:

acquiring a scripting language type of a target script file;

In one embodiment, the malicious attack detection model includes a word vector embedding module, a convolutional neural network model, a gated cyclic unit model, and a full connection layer, and the detection unit is configured to:

In one embodiment, the detection unit is further configured to:

In one embodiment, the obtaining unit is further configured to:

acquiring a document frequency corresponding to each sample character string;

In one embodiment, the obtaining unit is further configured to:

In one aspect, an electronic device is provided that includes a processor and a memory, the memory storing computer readable instructions that, when executed by the processor, perform the steps of the method provided in any of the various alternative implementations of malicious attack detection described above.

In one aspect, a readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, is adapted to carry out the steps of the method as provided in any of the various alternative implementations of malicious attack detection as described above.

In one aspect, a computer program product is provided which, when run on a computer, causes the computer to perform the steps of the method as provided in any of the various alternative implementations of malicious attack detection as described above.

In the method, the device, the electronic device and the readable storage medium for detecting the malicious attack, the character string extraction is carried out on the target script file to be detected to obtain at least one target character string; and acquiring a malicious attack detection result of the target script file based on at least one target character string and a pre-trained malicious attack detection model. Therefore, the consumed labor cost and time cost are reduced, and the accuracy of malicious attack detection is improved.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a flowchart of an implementation of a method for constructing a sample string set according to an embodiment of the present application;

fig. 2 is a flowchart of an implementation of a method for detecting malicious attacks according to an embodiment of the present disclosure;

fig. 3 is a flowchart illustrating an implementation of a method for training a malicious attack detection model according to an embodiment of the present disclosure;

fig. 4 is a block diagram illustrating a structure of a malicious attack detection apparatus according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

First, some terms referred to in the embodiments of the present application will be described to facilitate understanding by those skilled in the art.

The terminal equipment: may be a mobile terminal, a fixed terminal, or a portable terminal such as a mobile handset, station, unit, device, multimedia computer, multimedia tablet, internet node, communicator, desktop computer, laptop computer, notebook computer, netbook computer, tablet computer, personal communication system device, personal navigation device, personal digital assistant, audio/video player, digital camera/camcorder, positioning device, television receiver, radio broadcast receiver, electronic book device, gaming device, or any combination thereof, including the accessories and peripherals of these devices, or any combination thereof. It is also contemplated that the terminal device can support any type of interface to the user (e.g., wearable device), and the like.

A server: the cloud server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server for providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, big data and artificial intelligence platform and the like.

In order to reduce consumed time cost and labor cost and improve accuracy of malicious attack detection when detecting whether a script file is a malicious script, embodiments of the present application provide a malicious attack detection method, an apparatus, an electronic device, and a readable storage medium.

In the embodiment of the present application, only the execution subject is taken as an example for description, and in practical applications, the execution subject may also be an electronic device such as a terminal device, and is not limited herein.

In the embodiment of the application, before malicious attack detection is performed, a sample character string set for screening character strings in a target file is constructed. Referring to fig. 1, an implementation flow chart of a method for constructing a sample string set provided in the embodiment of the present application is shown, and a specific implementation flow of the method is as follows:

step 100: a set of samples is obtained.

Specifically, a non-malicious sample and a malicious sample are obtained, and a sample set is formed based on the non-malicious sample and the malicious sample.

The Script (Script) is an executable file written according to a certain format using a specific descriptive language. In practical applications, the script can be written in different scripting languages, and is not limited herein. The samples (i.e., non-malicious samples as well as malicious samples) may be background management scripts (webshells). The non-malicious samples are one or more, and the malicious samples are one or more. The sample scripting language may be at least one of a Server page (JSP), a dynamic Server page (ASP), and a Hypertext Preprocessor (PHP), or may be any other scripting language, which is not limited herein.

In one embodiment, the non-malicious sample is obtained from any one or any combination of a content management system (verbress), website management software (phpcms), a database management tool (phpmyadmin), a template engine (smart), and a high-performance PHP framework (yii). Malicious samples were obtained through a managed platform (GitHub) oriented to open source and private software projects. The number of non-malicious samples was 4.8 ten thousand and the number of malicious samples was 2800.

Further, sample deduplication can be performed on samples in the sample set.

Wherein, when carrying out the sample duplication elimination, the following steps can be adopted:

s1001: and respectively determining the hash value of each sample by adopting a hash algorithm.

For example, MD5 (i.e., hash value) for each sample may be determined separately using MD5 Message Digest Algorithm (Message-Digest Algorithm 5, MD 5).

S1002: and carrying out sample deduplication on a plurality of samples with the same hash value.

That is, if there are a plurality of samples having the same hash value, and the plurality of samples are the same, the redundant samples can be removed and only one sample can be retained.

In this way, a sample set may be generated.

Step 101: and respectively extracting character strings of each sample in the sample set by adopting a regular expression to obtain a sample character string corresponding to each sample.

Since variables and functions that are meaningful for malicious attack detection are usually composed of upper and lower case letters and an underline, whereas custom variables that are not meaningful for malicious attack detection are typically composed of numbers, letters, underlines, and variables with fewer characters are typically custom variables, the function names with excessive characters are generally meaningless to the detection of malicious attacks, and moreover, if the numbers are reserved, the number of functions and variables is huge, and during model training, consumes a great deal of time cost, storage resources, and system resources, and is prone to false alarms, therefore, a regular expression for extracting variables, functions and special characters composed of capital letters, lowercase letters and underlines in the script file is constructed, in the subsequent steps, when the malicious attack detection is carried out based on the extracted character string, the accuracy of the malicious attack detection is improved.

When extracting the sample character string corresponding to the sample, the following steps may be adopted:

s1011: the scripting language type of the sample is obtained.

Specifically, the scripting language type is a scripting language for writing scripts, for example, the scripting language type may be a Hypertext Preprocessor (PHP) script file.

S1012: and judging whether the script language type is a non-specified type, if so, executing S1013, otherwise, executing S1014.

In one embodiment, the designated type is PHP, and the non-designated type is a non-PHP script file.

In practical applications, the specified type may be set according to practical application scenarios, and is not limited herein.

S1013: and extracting at least one sample character string in the sample by adopting the regular expression.

S1014: analyzing the sample to obtain a byte code file corresponding to the sample, and extracting a sample character string from the byte code file by adopting a regular expression to obtain at least one sample character string.

Specifically, the sample may be a PHP script file written in the PHP language, and the sample string may be a byte code (opcode) extracted from the PHP script file.

In one embodiment, a running environment for extracting sample character strings of the php script file is pre-installed, a script file parsing command is executed to obtain a byte code file, and byte codes existing in the form of capital letters in the byte code file are extracted through a regular expression.

For example, the runtime environment may be a memory Leak detection tool (vld), and the file parsing command may be cmd ═ php "+" -dvld.active ═ 1-dvld.active ═ 0 "+ filepathutput ═ commands.

Further, a regular expression can be adopted to respectively extract sample character strings in the sample and byte code files, and the extracted sample character strings are determined to be the at least one sample character string.

In one embodiment, regular expressions are adopted to respectively extract sample character strings in sample and byte code files, and variable and function sample character strings in the forms of capital letters and capital and lower case letters are extracted.

In the embodiment of the present application, only the sample character string in one sample is taken as an example for explanation, and in practical application, the sample character strings of other samples can be extracted by using the same principle, which is not described herein again.

In this way, the sample character string in each sample can be extracted.

Step 102: and screening the sample character strings corresponding to the samples by adopting at least one of a first screening mode, a second screening mode and a third screening mode to obtain a sample character string set.

Specifically, the first screening mode is to screen character strings according to malicious attack categories, the second screening mode is to screen character strings according to the document frequency of the character strings in each document, and the third screening mode is to screen character strings based on machine learning. Optionally, the machine learning may include at least one of variance, chi-square test, logistic regression, and random forest, and in practical application, the machine learning may also be set according to a practical application scenario.

When step 102 is executed, any one or combination of the following manners may be adopted:

mode 1: and screening the sample character strings corresponding to the samples according to the malicious attack categories of the samples by adopting a first screening mode to obtain a sample character string set.

For example, the size of the sample string set is 9507.

Mode 2: and screening the sample character strings corresponding to the samples according to the document frequency of the sample character strings in the documents by adopting a second screening mode to obtain a sample character string set.

Mode 3: and screening the sample character strings corresponding to the samples by adopting a third screening mode based on machine learning to obtain a sample character string set.

Mode 4: and acquiring the sample character string sets respectively obtained based on the first screening mode, the second screening mode and the third screening mode, merging the sample character string sets obtained based on the first screening mode and the second screening mode, determining the intersection of the merged sample character string set and the sample character string set obtained based on the third screening mode, and acquiring a new sample character string set.

Further, an intersection or a union of the sample character string sets obtained based on at least two of the first screening mode, the second screening mode and the third screening mode may be used to obtain a new sample character string set.

In this way, a better sample string set can be obtained without requiring expert experience.

And the malicious attack category is divided into non-malicious samples and malicious samples according to the malicious attack category according to whether the samples are divided by the malicious scripts or not. The first filtering method may also be called black and white intersection lexical extraction.

When the method 1 is executed, the following steps may be adopted:

step 1: and obtaining a non-malicious character string set based on the sample character strings of the non-malicious samples.

Step 2: and obtaining a malicious character string set based on the sample character strings in the malicious samples.

And step 3: and obtaining a sample character string set based on the intersection of the non-malicious character string set and the malicious character string set.

Because most of the words outside the intersection are self-defined functions and variables or the names of the functions are intentionally written in a mess mode for confusing and escaping, the character strings have small significance and large quantity for malicious attack detection and can cause the problem of model noise in the subsequent model training process, therefore, the character strings which are significant for the malicious attack detection can be screened out by adopting the first screening mode, and the consumed storage resources can be reduced.

In one embodiment, two lists, namely a data (data) table and a label (lables) table, are created, sample character strings of each script sample file are stored in the data table, corresponding malicious attack categories are stored in the lables table, and the sample character strings of each script sample file in the data table are divided into a non-malicious character string set and a malicious character string set according to the malicious attack categories. Further, an intersection of the non-malicious character string set and the malicious character string set is taken to obtain a sample character string set.

Optionally, the malicious attack categories of the non-malicious sample and the malicious sample may be represented by 0 and 1, respectively, or may be represented by other labels, which is not limited herein. The set of sample strings obtained by the first filtering means may be represented as a set _ a set.

When the method 2 is executed, the following steps may be adopted:

step 1: and acquiring the document frequency corresponding to each sample character string.

Specifically, the document frequency corresponding to each sample character string is the number of documents containing each sample character string. Further, the document frequency may also be the total number of times the sample string appears in each document.

Step 2: and screening out the sample character strings with the corresponding document frequency higher than a preset document frequency threshold value from each sample character string to obtain a sample character string set.

In practical applications, the preset document frequency threshold may be set according to practical application scenarios, for example, 5, which is not limited herein.

For example, the sample character strings corresponding to the document frequency higher than 5 (i.e., the preset document frequency threshold) are screened out, and a sample character string set is obtained.

The sample string is usually an Application Programming Interface (API) when the document frequency of the sample string is higher than a preset document frequency threshold after the size of the sample set is determined, otherwise, the sample string is usually a self-defined non-generic word, and the non-generic word causes the sample to be distinguished too high and the generalization capability of the model to be reduced, so that the sample string belonging to the generic word can be screened out through the second screening method.

When the method 3 is executed, the following steps may be adopted:

step 1: and screening each sample character string based on the variance and chi-square test to obtain a test character string set.

Specifically, a sample character string with high information content is screened out from each sample character string based on a machine learning library (sklern) and a variance, a sample character string with high correlation with a malicious attack detection result is screened out based on sklern and chi-square test, and a test character string set is formed by the screened sample character strings.

In this way, the sample character strings with high information content or high correlation with the malicious attack detection result can be screened out, but the screened test character string set is easily influenced by the number of the samples.

Step 2: and screening each sample character string based on a logistic regression algorithm to obtain a regression character string set.

Thus, by using a logistic regression algorithm, a sample string related to malicious attack detection can be selected based on the regression coefficient.

And step 3: and screening each sample character string based on the random forest to obtain a forest character string set.

Therefore, fitting is carried out on nonlinear data through a forest tree construction mode based on random forests in each sample character string, and important sample character strings are screened out through the Gini coefficient.

With this method, important features, i.e., sample strings, can be screened out, but this method is more advantageous for variables with more categories, and the importance of other features decreases sharply once a feature is selected, so that other important features may be lost.

And 4, step 4: and obtaining a sample character string set based on the check character string set, the regression character string set and the forest character string set.

Specifically, the check character string set, the regression character string set and the forest character string set are combined to obtain a sample character string set.

Further, the sample character string set may also be obtained based on an intersection or a union of any one or any combination of the check character string set, the regression character string set, and the forest character string set.

Further, before executing the mode 3, the numerical value conversion and normalization may be performed on each sample character string to obtain a character string numerical value corresponding to each sample character string, and the mode 3 may be executed to obtain a sample character string set based on the character string numerical value corresponding to each sample character string.

Alternatively, the numerical conversion may employ a Term Frequency-Inverse Document Frequency (Term Frequency-Inverse Document Frequency) algorithm.

For example, the sample string set obtained by way of 3 may be denoted as set _ C.

Therefore, in the subsequent malicious attack detection step, the character strings of the target script file can be screened based on the acquired sample character string set so as to screen out effective target character strings.

Referring to fig. 2, an implementation flow chart of a method for detecting malicious attacks provided in the embodiment of the present application is shown, and a specific implementation flow of the method is as follows:

step 200: and extracting character strings of the target script file to be detected to obtain at least one target character string.

Specifically, when step 200 is executed, the following steps may be adopted:

s2001: and extracting character strings of the target script file to obtain at least one character string.

Specifically, when S2001 is executed, the following steps may be adopted:

step 1: and acquiring the script language type of the target script file.

Step 2: and (4) judging whether the script language type is a non-specified type, if so, executing the step (3), and otherwise, executing the step (4).

And step 3: and extracting at least one character string in the target script file by adopting a regular expression.

And 4, step 4: and analyzing the target script file to obtain a byte code file corresponding to the target script file, and extracting character strings from the byte code file by adopting a regular expression to obtain at least one character string.

Specifically, the target script file may be a PHP script file written in a PHP language, and the character strings may be byte codes (opcodes) extracted from the PHP script file.

Further, the regular expressions can be adopted to respectively extract the character strings in the target script file and the byte code file, and the extracted character string set is determined as the at least one character string.

In one embodiment, regular expressions are adopted to respectively extract character strings in a target script file and a byte code file, and variable and function character strings in the forms of capital letters and capital and lower case letters are extracted.

It should be noted that, based on a principle similar to the sample string extraction, a string of the target script file may be extracted, which is not described herein again.

S2002: a sample set of strings is obtained.

S2003: and screening out character strings contained in the sample character string set from the at least one character string.

Therefore, effective character strings can be screened out through the sample character string set.

S2004: and determining the screened character strings as target character strings.

Step 201: and acquiring a malicious attack detection result of the target script file based on at least one target character string and a pre-trained malicious attack detection model.

Specifically, the malicious attack detection model is constructed based on a convolutional neural network and a gated loop unit and is used for detecting whether a target script file is a malicious script file.

When step 201 is executed, the following steps may be adopted:

s2011: and obtaining an index corresponding to at least one target character string based on a first corresponding relation between preset sample character strings and the indexes.

In one embodiment, before S2011 is executed, an index of each sample string is generated in advance by a marker (token) of an open source artificial neural network library (keras), and based on the index of each sample string, a first corresponding relationship between the sample string and the index is established.

Further, a first input matrix including each index may be generated based on the index corresponding to each target character string.

Further, if the number of indexes in the first input matrix is higher than the specified matrix length threshold, dividing each index according to the specified matrix length threshold to obtain a plurality of first input matrices, and if the number of indexes in the first input matrix is lower than the specified matrix length threshold, performing data filling on the first input matrix to enable the number of indexes of the first input matrix to reach the specified matrix length threshold.

In one embodiment, the dimension of the first input matrix may be [ index batch (batch _ size), index (time _ steps) ].

For example, if the specified matrix length threshold is 1500 and the number of indexes in the first input matrix is lower than the specified matrix length threshold, the first input matrix is data-filled, and the obtained first input matrix is [ 17653514 … … 000 ], for a total of 1500 indexes.

In practical applications, the specified matrix length threshold may be set according to practical application scenarios, and is not limited herein.

S2012: and obtaining a word vector corresponding to at least one target character string through a word vector embedding module based on a second corresponding relation between the index and the word vector and the index corresponding to at least one target character string.

Specifically, a first input matrix generated based on the index corresponding to each target character string is input into a malicious attack detection model, a word vector corresponding to the index of each target character string is obtained through a word vector embedding module in the malicious attack detection model based on the second corresponding relation, and a second input matrix is generated based on the index and the word vector corresponding to each target character string.

In one embodiment, the dimensions of the second input matrix are [ index batch (batch _ size), index (time _ steps), word vector (embedded _ size) ].

The malicious attack detection model comprises a word vector embedding module. The word vector Embedding module is an Embedding (Embedding) layer.

Wherein, before performing S2012, a second corresponding relationship between the index and the word vector is pre-established. When the second corresponding relationship is established, the following steps may be adopted:

and inputting each sample character string into a pre-trained word vector training model, outputting a word vector corresponding to each sample character string, and establishing a second corresponding relation between the index and the word vector based on the word vector corresponding to each sample character string and the first corresponding relation between the sample character string and the index.

Alternatively, the Word vector training model may be constructed based on the google open source Word vector model (Word2 vec). The word vector training model is a shallow neural network aiming at obtaining context meaning of words in a certain window range, and high-dimensional sparse target character strings can be converted into low-dimensional dense word vectors through the word vector training model.

In one embodiment, a word vector matrix is established based on the second correspondence between the index and the word vector, and S2012 is performed based on the word vector matrix.

S2013: and obtaining a first characteristic output by the convolutional neural network model based on the index and the word vector corresponding to the at least one target character string and the convolutional neural network model.

Specifically, the second input matrix is input to the convolutional neural network model to obtain a first characteristic output by the convolutional neural network model.

The Convolutional Neural network model is constructed based on a Convolutional Neural network commonly used for abstract features, and optionally, the Convolutional Neural network model can be constructed based on a Deep Pyramid Convolutional Neural Network (DPCNN).

Wherein the first feature may be denoted output _ a.

S2014: and obtaining a second characteristic output by the gated circulation unit model based on the index and the word vector corresponding to the at least one target character string and the gated circulation unit model.

Specifically, the second input matrix is input to the gated loop unit model, and the second characteristic output by the gated loop unit model is obtained.

Specifically, the gated loop Unit model may be constructed based on a gated loop Unit (GRU). The second feature may be denoted output _ B.

The GRU is a recurrent neural network for acquiring sequence information, is a variant of a Long Short-Term Memory network (LSTM) model, and maintains the LSTM (capability of acquiring Long-distance information) while reducing the number of parameters.

S2015: and splicing the first characteristic and the second characteristic to obtain a third characteristic.

Therefore, the first feature and the second feature can be spliced on the index dimension, so that the character string feature can be enhanced, and compared with a convolutional neural network model or a gated cyclic unit model which is adopted independently, a better malicious attack detection effect can be obtained by adopting the mode.

S2016: and inputting the third characteristic into the full connection layer to obtain a malicious attack detection result output by the full connection layer.

Specifically, the third feature is input to a Dense (Dense) full connection layer to obtain a malicious attack detection value, and a malicious attack detection result, that is, whether the target script file is a malicious script file, can be determined by using the malicious attack detection value.

The malicious attack detection model comprises a word vector embedding module, a convolutional neural network model, a gate control circulation unit model and a full connection layer.

Before step 201 is executed, the malicious attack detection model is trained in advance through sample data.

In one embodiment, cross entropy is used as a loss function, Adam is selected as an optimizer, learning rate is set to 0.001, training is set for 50 rounds, requirement on plus is set to 2 to converge the model to a better position, and EarlyStopping is set to 5 to reduce the time overhead of model training. The false positive of the model on the existing test set is 0.004, and the recall of the model is 0.94.

In the embodiment of the application, a sample character string set is constructed in advance based on the malicious attack category of a sample, the document frequency of character strings in each document and machine learning, so that when the malicious attack detection is carried out on a target script file, the character strings of the target script file are screened according to the sample character string set, a large amount of confused user-defined functions introduced by malicious words in script file collection are avoided, the storage resources consumed by the sample character string set are reduced, a lightweight model is favorably constructed, different character string extraction modes can be selected according to the script language type of the target script file, the malicious attack detection can be carried out on script files of different types based on script participles or byte codes, the corresponding relation among the sample character strings, indexes and word vectors is established in advance, and only the indexes of the target character strings can be used, the method is used as the input of a malicious attack detection model, so that the consumed storage resources are reduced, the data processing efficiency is improved, further, a convolutional neural network model and a gated cyclic unit model are combined to construct the malicious attack detection model, the false alarm rate of malicious attack detection is reduced, the accuracy and generalization capability of the malicious attack detection are improved, and the performance of the malicious attack detection and the security detection capability are fully improved.

Fig. 3 is a flowchart illustrating an implementation of a method for training a malicious attack detection model according to an embodiment of the present disclosure.

Step 300: and constructing a sample set.

Specifically, a plurality of non-malicious samples and a plurality of malicious samples are collected, and based on the hash value of each sample, sample duplication removal is performed to obtain a sample set.

Further, if the sample is a PHP script file, the PHP script file is parsed to obtain a bytecode file, and the obtained bytecode file is added to the sample set.

Step 301: and constructing a regular expression.

Step 302: and extracting a sample character string of each sample in the sample set based on the regular expression.

Step 303: and screening the sample character strings according to the malicious attack category, the document frequency and the machine learning corresponding to each sample character string to obtain a sample character string set.

Step 304: and generating a word vector corresponding to each sample character string through a word vector training model.

Specifically, Word2vec is trained based on each sample character string in the sample character string set to obtain a trained Word vector training model, and a Word vector corresponding to each sample character string is generated through the Word vector training model.

Step 305: determining an index corresponding to each sample character string, and constructing a first corresponding relation between the sample character string and the index and a second corresponding relation between the index and the word vector.

Step 306: and respectively generating a first input matrix corresponding to the sample character string of each sample based on the first corresponding relation and the specified matrix length threshold.

Step 307: and generating a second input matrix based on the first input matrix and the second corresponding relation.

Step 308: and inputting the second input matrix into the convolutional neural network model to obtain the first characteristic.

Step 309: and inputting the second input matrix into the gated cyclic unit model to obtain a second characteristic.

Step 310: and splicing the first characteristic and the second characteristic to obtain a third characteristic.

Step 311: and inputting the third characteristic into the full connection layer to obtain a malicious attack detection result output by the full connection layer.

Step 312: and adjusting parameters of the malicious attack detection model according to the actual malicious attack result and the malicious attack detection result of each sample to obtain the trained malicious attack detection model.

Based on the same inventive concept, the embodiment of the present application further provides a malicious attack detection device, and as the principles of the device and the apparatus for solving the problems are similar to those of a malicious attack detection method, the implementation of the device can refer to the implementation of the method, and repeated details are not repeated.

As shown in fig. 4, a schematic structural diagram of a malicious attack detection apparatus provided in an embodiment of the present application includes:

an obtaining unit 401, configured to perform character string extraction on a target script file to be detected, so as to obtain at least one target character string;

the detection unit 402: and acquiring a malicious attack detection result of the target script file based on at least one target character string and a pre-trained malicious attack detection model, wherein the malicious attack detection model is constructed based on a convolutional neural network and a gated cyclic unit and is used for detecting whether the target script file is a malicious script file.

In one embodiment, the obtaining unit 401 is configured to:

and determining the screened character strings as target character strings.

In one embodiment, the obtaining unit 401 is configured to:

acquiring a scripting language type of a target script file;

In one embodiment, the malicious attack detection model includes a word vector embedding module, a convolutional neural network model, a gated cyclic unit model, and a full connection layer, and the detection unit 402 is configured to:

In one embodiment, the detection unit 402 is further configured to:

In one embodiment, the obtaining unit 401 is further configured to:

acquiring a document frequency corresponding to each sample character string;

In one embodiment, the obtaining unit 401 is further configured to:

In the method, the device, the electronic device and the readable storage medium for detecting the malicious attack, the character string extraction is carried out on the target script file to be detected to obtain at least one target character string; and acquiring a malicious attack detection result of the target script file based on at least one target character string and a pre-trained malicious attack detection model, wherein the malicious attack detection model is constructed based on a convolutional neural network and a gated cyclic unit and is used for detecting whether the target script file is a malicious script file. In this way, the cost of consumption is reduced and the accuracy of malicious attack detection is improved.

Fig. 5 shows a schematic structural diagram of an electronic device 5000. Referring to fig. 5, the electronic device 5000 includes: the processor 5010 and the memory 5020 can optionally include a power supply 5030, a display unit 5040, and an input unit 5050.

The processor 5010 is a control center of the electronic apparatus 5000, connects various components using various interfaces and lines, and performs various functions of the electronic apparatus 5000 by running or executing software programs and/or data stored in the memory 5020, thereby monitoring the electronic apparatus 5000 as a whole.

In the embodiment of the present application, the processor 5010 executes the method for detecting malicious attacks provided by the embodiment shown in fig. 2 when calling the computer program stored in the memory 5020.

Optionally, the processor 5010 can include one or more processing units; preferably, the processor 5010 can integrate an application processor, which mainly handles operating systems, user interfaces, applications, etc., and a modem processor, which mainly handles wireless communications. It is to be appreciated that the modem processor described above may not be integrated into processor 5010. In some embodiments, the processor, memory, and/or memory may be implemented on a single chip, or in some embodiments, they may be implemented separately on separate chips.

The memory 5020 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, various applications, and the like; the storage data area may store data created according to the use of the electronic device 5000, and the like. Further, the memory 5020 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The electronic device 5000 also includes a power supply 5030 (e.g., a battery) that provides power to the various components and that may be logically connected to the processor 5010 via a power management system to provide management of charging, discharging, and power consumption via the power management system.

The display unit 5040 may be configured to display information input by a user or information provided to the user, and various menus of the electronic device 5000, and in the embodiment of the present invention, the display unit is mainly configured to display a display interface of each application in the electronic device 5000 and objects such as texts and pictures displayed in the display interface. The display unit 5040 may include a display panel 5041. The Display panel 5041 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.

The input unit 5050 may be used to receive information such as numbers or characters input by a user. Input units 5050 may include touch panel 5051 as well as other input devices 5052. Among other things, the touch panel 5051, also referred to as a touch screen, may collect touch operations by a user on or near the touch panel 5051 (e.g., operations by a user on or near the touch panel 5051 using a finger, a stylus, or any other suitable object or attachment).

Specifically, the touch panel 5051 can detect a touch operation by a user, detect signals resulting from the touch operation, convert the signals into touch point coordinates, transmit the touch point coordinates to the processor 5010, and receive and execute a command transmitted from the processor 5010. In addition, the touch panel 5051 may be implemented in various types, such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. Other input devices 5052 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, power on/off keys, etc.), a trackball, a mouse, a joystick, and the like.

Of course, the touch panel 5051 may cover the display panel 5041, and when the touch panel 5051 detects a touch operation thereon or thereabout, it is transmitted to the processor 5010 to determine the type of touch event, and then the processor 5010 provides a corresponding visual output on the display panel 5041 according to the type of touch event. Although in fig. 5, the touch panel 5051 and the display panel 5041 are implemented as two separate components to implement input and output functions of the electronic device 5000, in some embodiments, the touch panel 5051 and the display panel 5041 may be integrated to implement input and output functions of the electronic device 5000.

The electronic device 5000 may also include one or more sensors, such as pressure sensors, gravitational acceleration sensors, proximity light sensors, and the like. Of course, the electronic device 5000 may further include other components such as a camera according to the requirements of a specific application, and these components are not shown in fig. 5 and are not described in detail since they are not components used in this embodiment of the present application.

Those skilled in the art will appreciate that fig. 5 is merely an example of an electronic device and is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or different components.

In an embodiment of the present application, a readable storage medium has a computer program stored thereon, and when the computer program is executed by a processor, the communication device may perform the steps in the above embodiments.

For convenience of description, the above parts are separately described as modules (or units) according to functional division. Of course, the functionality of the various modules (or units) may be implemented in the same one or more pieces of software or hardware when implementing the present application.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method of malicious attack detection, comprising:

and acquiring a malicious attack detection result of the target script file based on the at least one target character string and a pre-trained malicious attack detection model, wherein the malicious attack detection model is constructed based on a convolutional neural network and a gated loop unit and is used for detecting whether the target script file is a malicious script file.

2. The method of claim 1, wherein the extracting the character strings of the target script file to be detected to obtain at least one target character string comprises:

screening out character strings contained in the sample character string set from the at least one character string;

and determining the screened character strings as the target character strings.

3. The method of claim 2, wherein said extracting the character string of the target script file to obtain at least one character string comprises:

acquiring the script language type of the target script file;

judging whether the script language type is a non-specified type or not, if so, extracting the at least one character string in the target script file by adopting a regular expression;

otherwise, analyzing the target script file to obtain a byte code file corresponding to the target script file, and extracting character strings from the byte code file by adopting the regular expression to obtain the at least one character string.

4. The method according to claim 2 or 3, wherein the malicious attack detection model includes a word vector embedding module, a convolutional neural network model, a gated cyclic unit model and a full connection layer, and the obtaining a malicious attack detection result based on the at least one target character string and a pre-trained malicious attack detection model includes:

obtaining an index corresponding to the at least one target character string based on a first corresponding relation between preset sample character strings and indexes;

obtaining, by the word vector embedding module, a word vector corresponding to the at least one target character string based on a second correspondence between an index and a word vector and the index corresponding to the at least one target character string;

obtaining a first feature output by the convolutional neural network model based on the index and the word vector corresponding to the at least one target character string and the convolutional neural network model;

obtaining a second feature output by the gated loop unit model based on the index and the word vector corresponding to the at least one target character string and the gated loop unit model;

splicing the first feature and the second feature to obtain a third feature;

inputting the third feature into the full-connection layer, and obtaining the malicious attack detection result output by the full-connection layer.

5. The method as claimed in claim 4, wherein before obtaining the index corresponding to the at least one target character string based on the first corresponding relationship between the preset sample character string and the index, the method further comprises:

respectively generating a corresponding index for each sample character string in the sample character string set;

establishing the first corresponding relation according to the index corresponding to each sample character string;

and establishing the second corresponding relation according to the word vector corresponding to each sample character string and the first corresponding relation.

6. The method according to claim 2 or 3, wherein before the extracting the character strings of the target script file to be detected to obtain at least one target character string, the method further comprises:

the first screening mode is to screen character strings according to malicious attack categories of samples, the second screening mode is to screen character strings according to the document frequency of the character strings in each document, and the third screening mode is to screen the character strings based on machine learning.

7. The method according to claim 6, wherein the screening the sample character strings corresponding to the samples by using the first screening method to obtain the sample character string set comprises:

obtaining the sample character string set based on the intersection of the non-malicious character string set and the malicious character string set.

8. The method according to claim 7, wherein the screening the sample character strings corresponding to the samples by using the second screening method to obtain the sample character string set comprises:

acquiring a document frequency corresponding to each sample character string;

and screening out the sample character strings with the corresponding document frequency higher than a preset document frequency threshold value from each sample character string to obtain the sample character string set.

9. The method according to claim 8, wherein the screening the sample character strings corresponding to the samples by using the third screening method to obtain the sample character string set comprises:

obtaining the sample character string set based on the check character string set, the regression character string set and the forest character string set.

10. An apparatus for malicious attack detection, comprising:

a detection unit: and acquiring a malicious attack detection result of the target script file based on the at least one target character string and a pre-trained malicious attack detection model, wherein the malicious attack detection model is constructed based on a convolutional neural network and a gated loop unit and is used for detecting whether the target script file is a malicious script file.

11. The apparatus of claim 10, wherein the obtaining unit is to:

and determining the screened character strings as the target character strings.

12. The apparatus of claim 11, wherein the obtaining unit is to:

acquiring the script language type of the target script file;

13. The apparatus according to claim 11 or 12, wherein the malicious attack detection model includes a word vector embedding module, a convolutional neural network model, a gated round-robin unit model, and a full-link layer, and the detection unit is configured to:

splicing the first feature and the second feature to obtain a third feature;

14. The apparatus of claim 13, wherein the detection unit is further to:

15. The apparatus of claim 11 or 12, wherein the obtaining unit is further configured to:

16. The apparatus of claim 15, wherein the obtaining unit is further configured to: obtaining a non-malicious character string set based on the sample character string of each non-malicious sample;

17. The apparatus of claim 16, wherein the obtaining unit is further configured to: acquiring a document frequency corresponding to each sample character string;

18. The apparatus of claim 17, wherein the obtaining unit is further configured to: screening each sample character string based on the variance and chi-square test to obtain a test character string set;

19. An electronic device comprising a processor and a memory, the memory storing computer readable instructions that, when executed by the processor, perform the method of any of claims 1-9.

20. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-9.